作者投稿和查稿 主编审稿 专家审稿 编委审稿 远程编辑

计算机工程

• 人工智能及识别技术 • 上一篇    下一篇

基于改进卡方统计量的藏文文本表示方法

徐 涛,于洪志,加羊吉   

  1. (西北民族大学中国民族语言文字信息技术重点实验室,兰州 730030)
  • 收稿日期:2013-04-17 出版日期:2014-06-15 发布日期:2014-06-13
  • 作者简介:徐 涛(1986-),男,博士研究生,主研方向:自然语言处理,机器学习;于洪志,教授、博士生导师;加羊吉,博士研究生。
  • 基金资助:
    国家“973”计划基金资助项目(2013CB329303);国家自然科学基金资助项目(61032008);国家科技支撑计划基金资助项目(2009BAH41B07);中央高校基本科研业务费专项基金资助项目(ycx13014)。

Tibetan Document Representation Method Based on Improved Chi-squared Statistic

XU Tao, YU Hong-zhi, JIA Yang-ji   

  1. (Key Lab of China’s National Linguistic Information Technology, Northwest University for Nationalities, Lanzhou 730030, China)
  • Received:2013-04-17 Online:2014-06-15 Published:2014-06-13

摘要: 藏文文本表示是将非结构化的藏文文本转换为计算机能够处理的数据形式,是藏文文本分类、文本聚类等领域特征抽取的前提。传统的藏文文本表示方法较少考虑特征项之间的关联度,容易造成语义损失。为此,结合向量空间模型,提出一种新的藏文文本表示方法。提取文本中词频统计TF-IDF值较高的部分词项作为对比词项,对藏文文本进行断句处理,以每个句子作为一个语境主题,利用卡方统计量计算文本中词项与对比词项的关联程度。实验结果表明,与传统的向量空间模型相比,该方法能更准确地表示藏文文本。

关键词: 藏文信息处理, 改进卡方统计量, 文本表示, 自动断句, 向量空间模型

Abstract: Tibetan document representation is to transfer the non-structure Tibetan text into an information form which can be processed by the computer, which is the premise of the categorization and clustering of the Tibetan text. Traditional Tibetan document representation methods take little relational degree of the feature items into account. As a result, some semantic information will be lost, and the accuracy of the document representation will be reduced. Integrated with the Vector Space Model(VSM) which is a classical model in information retrieval, this paper proposes a new document representation method. The terms with high value of TF-IDF are extracted as compared terms first, and then Tibetan sentences are segmented from Tibetan document as context subject, and the Chi-square statistic is used to compute the degree of bias between each term and the compared terms. Experimental results show that this method works more accurately than the traditional VSM in Tibetan document representation.

Key words: Tibetan information processing, improved Chi-squared statistic, document representation, auto sentence segmentation, Vector Space Model(VSM)

中图分类号: