计算机工程 ›› 2008, Vol. 34 ›› Issue (8): 90-92.doi: 10.3969/j.issn.1000-3428.2008.08.031

• 软件技术与数据库 • 上一篇    下一篇

基于分类的汉语语料库词性标注一致性检查

张 虎,郑家恒   

  1. (山西大学计算机与信息技术学院,太原 030006)
  • 收稿日期:1900-01-01 修回日期:1900-01-01 出版日期:2008-04-20 发布日期:2008-04-20

Consistency Check on POS Tagging of Chinese Corpus Based on Classification

ZHANG Hu, ZHENG Jia-heng   

  1. (School of Computer & Information Technology, Shanxi University, Taiyuan 030006)
  • Received:1900-01-01 Revised:1900-01-01 Online:2008-04-20 Published:2008-04-20

摘要: 制约语料库加工质量的一个重要方面是多标记词语的词性标注一致性问题。该文通过对大规模语料库兼类词的词性标注结果的分析,提出一种语料库词性标注一致性检查的方法,分析词性标记序列的特征并建立兼类词语境向量模型,运用k最近邻法,对兼类词语境进行向量分类,判定兼类词词性标注是否一致,得出每篇文章的词性标注的一致性情况,并测试了北京大学的150万语料。

关键词: 分类, 词性标注, 兼类词, 词性标注一致性

Abstract: It is a key problem to guarantee the consistency of POS(Part of Speech) tagging of Chinese corpus. After analyzing the POS tagging in the large-scale corpus, the new consistency check method of POS tagging are put forward. This paper builds the vector model of the context of trans-classed word, then uses k-NN to classify the POS tagging sequence vectors, judges their consistency, and obtains circumstances of the consistency of POS tagging of every text. The methods are evaluated on our 1 5000-word corpus.

Key words: classification, POS tagging, multi-category words, consistency of speech tagging

中图分类号: