作者投稿和查稿 主编审稿 专家审稿 编委审稿 远程编辑

计算机工程 ›› 2014, Vol. 40 ›› Issue (12): 199-204. doi: 10.3969/j.issn.1000-3428.2014.12.037

• 人工智能及识别技术 • 上一篇    下一篇

基于字频分布的中文网页编码识别算法

侯整风1,张浩1,张娜2   

  1. 1.合肥工业大学计算机与信息学院,合肥 230009; 2.安徽移动淮南分公司,安徽 淮南 232001
  • 收稿日期:2013-12-05 修回日期:2014-01-22 出版日期:2014-12-15 发布日期:2015-01-16
  • 作者简介:侯整风(1958-),男,教授,主研方向:网络信息安全;张 浩、张 娜,硕士研究生。
  • 基金资助:
    教育部广东省产学研基金资助项目(2009B090200049)。

Chinese Webpage Encoding Identification Algorithm Based on Word Frequency Distribution

HOU Zhengfeng1,ZHANG Hao1,ZHANG Na2   

  1. 1.School of Computer & Information,Hefei University of Technology,Hefei 230009,China;
    2.Huainan Branch of Anhui Mobile Limited,Huainan 232001,China
  • Received:2013-12-05 Revised:2014-01-22 Online:2014-12-15 Published:2015-01-16

摘要: 编码识别是网页内容过滤的必要前提,多种中文编码共存给中文网页的内容过滤带来不便。针对上述问题,提出一种基于字频分布的中文网页编码识别算法。根据汉字的使用频率,选取使用频度较高的字符构成高频字符编码表,以高频字符编码作为关键字,使用改进的模式匹配算法查找待识别网页,并统计匹配次数。将编码的匹配结果作为分析的依据,最终判定待识别网页的真实码制。实验结果证明,与Unigram算法相比,该算法对目前通用的中文编码识别率较高,适合对未知编码的中文网页进行快速编码识别。

关键词: 中文编码, 网页过滤, 高频字符, 模式匹配, 有限状态自动机

Abstract: Web coding identification is the premise of webpage content filtering,and coexistence of a variety of Chinese encoding makes Chinese webpage coded identification inconvenient.This paper presents a Chinese Web encoding identification algorithm——FKI(Frequency Keyword Identification) which is based on the frequency of Chinese character used.FKI selects the frequency of high character to construct high frequency character encoding tables,according to the frequency of the use of Chinese characters.Using high frequency character encoding as a keyword,FKI algorithm scans the Webpage by improved pattern matching algorithm,statistical matching number,and determines the real code of webpage based on the matching result.Experimental results show that,compared with the Unigram algorithm,this algorithm has a higher recognition rate.FKI algorithm is suitable for Chinese webpage which is unknown code to identify code quickly and accurately.

Key words: Chinese encoding, Web filtering, high frequency characters, pattern matching, finite state automata

中图分类号: