基于改进卡方统计量的藏文文本表示方法

doi:10.3969/j.issn.1000-3428.2014.06.040

计算机工程

基于改进卡方统计量的藏文文本表示方法

徐涛，于洪志，加羊吉

(西北民族大学中国民族语言文字信息技术重点实验室，兰州 730030)

收稿日期:2013-04-17 出版日期:2014-06-15 发布日期:2014-06-13
作者简介:徐涛(1986－)，男，博士研究生，主研方向：自然语言处理，机器学习；于洪志，教授、博士生导师；加羊吉，博士研究生。
基金资助:
国家“973”计划基金资助项目(2013CB329303)；国家自然科学基金资助项目(61032008)；国家科技支撑计划基金资助项目(2009BAH41B07)；中央高校基本科研业务费专项基金资助项目(ycx13014)。

Tibetan Document Representation Method Based on Improved Chi-squared Statistic

XU Tao, YU Hong-zhi, JIA Yang-ji

(Key Lab of China’s National Linguistic Information Technology, Northwest University for Nationalities, Lanzhou 730030, China)

Received:2013-04-17 Online:2014-06-15 Published:2014-06-13

摘要/Abstract

摘要： 藏文文本表示是将非结构化的藏文文本转换为计算机能够处理的数据形式，是藏文文本分类、文本聚类等领域特征抽取的前提。传统的藏文文本表示方法较少考虑特征项之间的关联度，容易造成语义损失。为此，结合向量空间模型，提出一种新的藏文文本表示方法。提取文本中词频统计TF-IDF值较高的部分词项作为对比词项，对藏文文本进行断句处理，以每个句子作为一个语境主题，利用卡方统计量计算文本中词项与对比词项的关联程度。实验结果表明，与传统的向量空间模型相比，该方法能更准确地表示藏文文本。

关键词: 藏文信息处理, 改进卡方统计量, 文本表示, 自动断句, 向量空间模型

Abstract: Tibetan document representation is to transfer the non-structure Tibetan text into an information form which can be processed by the computer, which is the premise of the categorization and clustering of the Tibetan text. Traditional Tibetan document representation methods take little relational degree of the feature items into account. As a result, some semantic information will be lost, and the accuracy of the document representation will be reduced. Integrated with the Vector Space Model(VSM) which is a classical model in information retrieval, this paper proposes a new document representation method. The terms with high value of TF-IDF are extracted as compared terms first, and then Tibetan sentences are segmented from Tibetan document as context subject, and the Chi-square statistic is used to compute the degree of bias between each term and the compared terms. Experimental results show that this method works more accurately than the traditional VSM in Tibetan document representation.

Key words: Tibetan information processing, improved Chi-squared statistic, document representation, auto sentence segmentation, Vector Space Model(VSM)

中图分类号:

TP18

徐涛，于洪志，加羊吉. 基于改进卡方统计量的藏文文本表示方法[J]. 计算机工程, doi: 10.3969/j.issn.1000-3428.2014.06.040.

XU Tao, YU Hong-zhi, JIA Yang-ji. Tibetan Document Representation Method Based on Improved Chi-squared Statistic[J]. Computer Engineering, doi: 10.3969/j.issn.1000-3428.2014.06.040.

https://www.ecice06.com/CN/Y2014/V40/I6/185

[1]	杨燕燕, 谢明轩, 曹江峡, 王学宾, 柳厅文, 杜彦辉. 基于原型网络的中文分类模型对抗样本生成[J]. 计算机工程, 2023, 49(8): 54-62.
[2]	胡均毅, 李金龙. 基于情感评分的分层文本表示情感分类方法[J]. 计算机工程, 2020, 46(3): 46-52,59.
[3]	何鸿业,郑瑾,张祖平. 结合词性特征与卷积神经网络的文本情感分析[J]. 计算机工程, 2018, 44(11): 209-214,221.
[4]	常琳,黄志清,张严心. 基于数据驱动的WSN节点故障诊断算法[J]. 计算机工程, 2017, 43(9): 105-109.
[5]	秦锋,王恒,郑啸,王修君. 基于上下文语境的微博情感分析[J]. 计算机工程, 2017, 43(3): 241-246,252.
[6]	黄文明,莫阳. 基于文本加权KNN算法的中文垃圾短信过滤[J]. 计算机工程, 2017, 43(3): 193-199.
[7]	石雁,李朝锋. 基于协同相似计算的查询推荐[J]. 计算机工程, 2016, 42(8): 188-193.
[8]	马慧芳,吉余岗,李晓红,周汝南. 基于离散粒子群优化的微博热点话题发现算法[J]. 计算机工程, 2016, 42(3): 208-213.
[9]	马雷雷,李宏伟,连世伟,梁汝鹏,陈虎. 一种基于本体语义的灾害主题爬虫策略[J]. 计算机工程, 2016, 42(11): 50-56.
[10]	余峰，余正涛，杨剑锋，郭剑毅，严馨. 基于主题信息的项目评审专家推荐方法[J]. 计算机工程, 2014, 40(6): 201-205.
[11]	高俊波，梅波. 基于文本内容分析的微博广告过滤模型研究[J]. 计算机工程, 2014, 40(5): 17-20.
[12]	廖涛, 刘宗田, 王先传. 基于事件的多主题文本自动文摘方法[J]. 计算机工程, 2013, 39(3): 236-240.
[13]	蒋效宇. 基于关键词抽取的自动文摘算法?[J]. 计算机工程, 2012, 38(3): 183-186.
[14]	花青松, 刘海峰, 胡铮. 基于基尼系数的用户兴趣分布模式度量方法[J]. 计算机工程, 2012, 38(22): 39-42.
[15]	杨婉霞, 孙理和, 黄永峰. 结合语义与统计的特征降维短文本聚类[J]. 计算机工程, 2012, 38(22): 171-175.

选择文件类型/文献管理软件名称

选择包含的内容

基于改进卡方统计量的藏文文本表示方法

Tibetan Document Representation Method Based on Improved Chi-squared Statistic

PDF

可视化

被引次数

摘要/Abstract

引用本文

使用本文

参考文献

相关文章 15

编辑推荐

Metrics

本文评价

模态框（Modal）标题

选择文件类型/文献管理软件名称

选择包含的内容

基于改进卡方统计量的藏文文本表示方法

Tibetan Document Representation Method Based on Improved Chi-squared Statistic

PDF

可视化

被引次数

摘要/Abstract

引用本文

使用本文

参考文献

相关文章 15

编辑推荐

Metrics

本文评价