摘要: 藏文文本表示是将非结构化的藏文文本转换为计算机能够处理的数据形式,是藏文文本分类、文本聚类等领域特征抽取的前提。传统的藏文文本表示方法较少考虑特征项之间的关联度,容易造成语义损失。为此,结合向量空间模型,提出一种新的藏文文本表示方法。提取文本中词频统计TF-IDF值较高的部分词项作为对比词项,对藏文文本进行断句处理,以每个句子作为一个语境主题,利用卡方统计量计算文本中词项与对比词项的关联程度。实验结果表明,与传统的向量空间模型相比,该方法能更准确地表示藏文文本。
关键词:
藏文信息处理,
改进卡方统计量,
文本表示,
自动断句,
向量空间模型
Abstract: Tibetan document representation is to transfer the non-structure Tibetan text into an information form which can be processed by the computer, which is the premise of the categorization and clustering of the Tibetan text. Traditional Tibetan document representation methods take little relational degree of the feature items into account. As a result, some semantic information will be lost, and the accuracy of the document representation will be reduced. Integrated with the Vector Space Model(VSM) which is a classical model in information retrieval, this paper proposes a new document representation method. The terms with high value of TF-IDF are extracted as compared terms first, and then Tibetan sentences are segmented from Tibetan document as context subject, and the Chi-square statistic is used to compute the degree of bias between each term and the compared terms. Experimental results show that this method works more accurately than the traditional VSM in Tibetan document representation.
Key words:
Tibetan information processing,
improved Chi-squared statistic,
document representation,
auto sentence segmentation,
Vector Space Model(VSM)
中图分类号:
徐涛,于洪志,加羊吉. 基于改进卡方统计量的藏文文本表示方法[J]. 计算机工程.
XU Tao, YU Hong-zhi, JIA Yang-ji. Tibetan Document Representation Method Based on Improved Chi-squared Statistic[J]. Computer Engineering.