作者投稿和查稿 主编审稿 专家审稿 编委审稿 远程编辑

计算机工程 ›› 2013, Vol. 39 ›› Issue (7): 252-256. doi: 10.3969/j.issn.1000-3428.2013.07.056

• 人工智能及识别技术 • 上一篇    下一篇

绿色网络网页正文内容提取算法

龙 珑1,邓 伟2   

  1. (1. 广西师范学院计算机与信息工程学院,南宁 530023;2. 广西肿瘤防治研究所,南宁 530021)
  • 收稿日期:2012-07-31 出版日期:2013-07-15 发布日期:2013-07-12
  • 作者简介:龙 珑(1980-),男,高级工程师、硕士,主研方向:机器学习;邓 伟(通讯作者),副主任医师、博士
  • 基金资助:
    国家创新基金资助项目(10C26224504901);广西自然科学基金资助项目(2011GXNSFB0180825)

Text Content Extraction Algorithm for Green Network Webpage

LONG Long 1, DENG Wei 2   

  1. (1. College of Computer and Information Engineering, Guangxi Teachers Education University, Nanning 530023, China; 2. Guangxi Cancer Institute, Nanning 530021, China)
  • Received:2012-07-31 Online:2013-07-15 Published:2013-07-12

摘要: 互联网中的网页有较多商业广告,绿色网络系统无法过滤其中具有不良内容的网站。为解决该问题,提出一种绿色网络网页正文内容提取算法。通过文件对象模型树识别与提取网页正文内容模块,使用基于粒子群的权值优化算法对网页正文各个板块特征权值进行评分,利用与不良关键字的比较,确定并过滤不良网页。实验结果表明,经粒子群权值算法优化提取后,绿色网络系统对不良网页的识别准确率为86.9%,召回率为95.6%,F值为91.02%,比优化前有较大提高。

关键词: 绿色网络, 网瘾, 不良内容, 粒子群优化, 正文提取

Abstract: At present, the Web pages have more business in the Internet advertising, the green network system can not filter the site with poor content. In order to solve this problem, this paper proposes a text content extraction algorithm for green network webpage. It uses the Document Object Model(DOM) tree to identify and extract the pages of text content module, uses an optimized content extraction algorithm based on particle swarm weight to score each section of the main content, compares the scores with the unhealthy keywords to identify and filter harmful Web pages. Experimental results show that, after optimized by new algorithm, the accuracy rate of identifying harmful webpage is 86.9%, the recall rate is 95.6%, the F value is 91.02%, and is higher than before optimization.

Key words: green network, net addiction, undesired content, Particle Swarm Optimization(PSO), text extraction

中图分类号: