摘要: 随着互联网的高速发展,各种各样的信息资源呈指数级增长,随之出现许多负面影响,需要构建一个安全健康的网络环境。为此,提出针对网页文本内容的敏感信息过滤算法(SWDT-IFA)。该算法不依赖词典与分词,通过构建敏感词决策树,将网页文本内容以数据流形式检索决策树,记录敏感词词频、区域信息以及敏感词级别,计算文本整体敏感度,过滤敏感文本。实验结果表明,SWDT-IFA 算法具有较高的查准率和查全率,且执行时间能
够满足当前网络环境的实时性要求。
关键词:
文本过滤,
敏感级别,
决策树,
分流,
词频
Abstract: With the development of Internet,many negative effects come out as the exponential growth of various
information resources,which means that a more secure and healthy network environment should be constructed right now.In order to solve this problem,this paper proposes a Sensitive Word Decision Tree for Information Filtering Algorithm (SWDT-IFA) for content-based Web pages. The algorithm takes no consideration of dictionary and word segmentation, builds the foundation on the sensitive words decision tree,lets the web text retrieval decision tree in form of data stream, records word frequency,regional information and sensitive level,and calculates the sensitive degree of the text to filter the sensitivity. Experimental results show that the SWDT-IFA algorithm has precision ratio and recall ratio,and low time complexity which can require the real-time demand of network environment.
Key words:
text filtering,
sensitive level,
decision tree,
distributary,
word frequency
中图分类号:
邓一贵,伍玉英. 基于文本内容的敏感词决策树信息过滤算法[J]. 计算机工程.
DENG Yi-gui,WU Yu-ying. Information Filtering Algorithm of Text Content-based Sensitive Words Decision Tree[J]. Computer Engineering.