Author Login Editor-in-Chief Peer Review Editor Work Office Work

Computer Engineering

Previous Articles     Next Articles

A Sample Characteristics Library Generation Method of Client SMS Filtering

BAO Li-qun  1, HOU Zhi-wei  2, LI Xiang-lin  1   

  1. (1. Department of Electronic and Information Engineering, Lanzhou Institute of Technology, Lanzhou 730050, China; 2. School of Electronic and Information Engineering, Lanzhou Jiaotong University, Lanzhou 730070, China)
  • Received:2012-11-21 Online:2014-01-15 Published:2014-01-13

一种客户端短信过滤的样本特征库生成方法

包理群1,侯志伟2,李祥林1   

  1. (1. 兰州工业学院电子信息工程系,兰州 730050;2. 兰州交通大学电子信息工程学院,兰州 730070)
  • 作者简介:包理群(1983-),女,讲师,硕士,主研方向:智能信息处理,嵌入式开发;侯志伟,硕士研究生;李祥林,教授
  • 基金资助:
    甘肃省自然科学基金资助项目(1208RJZA186);甘肃省高等学校科研基金资助项目(2013A-127);甘肃省科技支撑计划基金资助项目(1104GKCA032);兰州市科技计划基金资助项目(2010-1-225)

Abstract: In view of the lack of Chinese SMS sample libraries, this paper proposes a client sample characteristics library generation method. It gives the design of sample characteristics database for client SMS spam filtering, and completes text preprocessing and Chinese word segmentation for messages received from the client, considering the low frequency words having a high amount of information and terms with strong category characteristic, it improves mutual information extraction evaluation function, and extracts the sample characteristic and forms the characteristic data. Experiment tests the impact of the number of features on filter performance using the Bayesian algorithm, and results show that the accuracy rate reaches a maximum when the number of features is 10. Experiment also tests the database file size, and when the number of key words reach 2 000, the size of database file is about 714.28 KB. It can run on the ordinary mobile phone platform, and tests show the feasibility of the method.

Key words: client SMS filtering, sample characteristics library, embedded database, ARM-Linux platform, transplanting, mutual information

摘要: 针对目前中文短信过滤研究缺乏样本库的现状,提出一种客户端样本特征库生成方法。设计客户端短信过滤样本特征数据库,将客户端接收到的短信进行预处理和中文分词,考虑高信息量的低频词和带有较强类别特性的特征词,改进互信息评价函数提取样本特征,形成特征数据。采用Naive Bayes算法测试特征数目对过滤器性能的影响,实验结果表明,当特征数目为10时,测试准确率达到最大值,当样本特征库中短信数目达到2 000条时,数据库文件的大小约为714.28 KB,可在普通手机平台上运行,验证了特征库生成方法的可行性。

关键词: 客户端短信过滤, 样本特征库, 嵌入式数据库, ARM-Linux平台, 移植, 互信息

CLC Number: