计算机工程 ›› 2019, Vol. 45 ›› Issue (10): 288-292,300.doi: 10.19678/j.issn.1000-3428.0052123

• 开发研究与工程应用 • 上一篇    下一篇

维吾尔语停用词抽取方法研究

塞麦提·麦麦提敏1, 司马义·阿不都热依木2   

  1. 1. 新疆大学 中国语言学院, 乌鲁木齐 830046;
    2. 新疆民汉语文翻译研究中心, 乌鲁木齐 830046
  • 收稿日期:2018-07-16 修回日期:2018-10-22 出版日期:2019-10-15 发布日期:2018-11-09
  • 作者简介:塞麦提·麦麦提敏(1980-),男,副教授、博士,主研方向为自然语言信息处理;司马义·阿不都热依木,讲师、博士。
  • 基金项目:
    国家社会科学基金(17XYY034);教育部人文社会科学研究青年项目(16XJJC740001)。

Research on Uyghur Stop Words Extraction Method

SAIMAITI Maimaitimin1, ESMAEL Abdurehim2   

  1. 1. Chinese Languages School, Xinjiang University, Urumqi 830046, China;
    2. Xinjiang Research Center for Chinese-Ethnic Languages Translation, Urumqi 830046, China
  • Received:2018-07-16 Revised:2018-10-22 Online:2019-10-15 Published:2018-11-09

摘要: 为提高信息处理效率,文本信息检索系统通常将停用词作为噪音过滤掉,影响了文本处理的效果。针对该问题,提出一种应用于维吾尔语的停用词抽取方法。在分析维吾尔语停用词特点的基础上,采用文档频数、词项频率和信息熵的方法对大量语料进行统计,并分析候选停用词的词性分布情况。通过文本分类实验确定停用词阈值,结果表明,使用该方法进行停用词过滤后,文本分类的计算复杂度降低,分类准确率达到80.8%。

关键词: 信息检索, 停用词, 维吾尔语, 文本分类, 语料统计

Abstract: In order to improve the efficiency of information processing,the text information retrieval system usually filters out the stop words as noise,which affects the effect of text processing.Aiming at this problem,a stop words extraction method in Uyghur language is proposed.On the basis of analyzing the characteristics of Uyghur stop words,the statistics on a large number of corpus is carried out by means of Document Frequency(DF),Term Frequency(TF) and Entropy(EN),and the part of speech distribution of candidate stop words is analyzed.The threshold of stop words is determined by text classification experiments.Experimental results show that after filtering stop words with the proposed method,the computational complexity of text classification is reduced,and the classification precision reaches 80.8%.

Key words: information retrieval, stop words, Uyghur, text classification, corpus statistics

中图分类号: