Abstract:
It is a key problem to guarantee the consistency of POS(Part of Speech) tagging of Chinese corpus. After analyzing the POS tagging in the large-scale corpus, the new consistency check method of POS tagging are put forward. This paper builds the vector model of the context of trans-classed word, then uses k-NN to classify the POS tagging sequence vectors, judges their consistency, and obtains circumstances of the consistency of POS tagging of every text. The methods are evaluated on our 1 5000-word corpus.
Key words:
classification,
POS tagging,
multi-category words,
consistency of speech tagging
摘要: 制约语料库加工质量的一个重要方面是多标记词语的词性标注一致性问题。该文通过对大规模语料库兼类词的词性标注结果的分析,提出一种语料库词性标注一致性检查的方法,分析词性标记序列的特征并建立兼类词语境向量模型,运用k最近邻法,对兼类词语境进行向量分类,判定兼类词词性标注是否一致,得出每篇文章的词性标注的一致性情况,并测试了北京大学的150万语料。
关键词:
分类,
词性标注,
兼类词,
词性标注一致性
CLC Number:
ZHANG Hu; ZHENG Jia-heng. Consistency Check on POS Tagging of Chinese Corpus Based on Classification[J]. Computer Engineering, 2008, 34(8): 90-92.
张 虎;郑家恒. 基于分类的汉语语料库词性标注一致性检查[J]. 计算机工程, 2008, 34(8): 90-92.