摘要: 通过分析几种统计评价方法发现,互信息法可用于衡量二元独立性,淘汰机会二元组;χ2检验能更合理地评价词汇组合的选择倾向性,获取频繁二元组发现;对数似然比检验可以有效获取稀疏二元组,弥补其他方法无法克服的稀疏数据问题。将互信息、χ2检验、对数似然比检验组合,并加入词汇子范畴框架的启发式规则,提出一个层次分明的综合多种统计评价方法的词汇获取方法。
关键词:
自然语言处理,
词汇获取,
新词发现,
选择倾向性,
统计评价方法
Abstract: This paper analyzes some statistical evaluation methods, and finds that mutual information is able to measure the independency of two meta in order to discard irrelevant ones; χ2-test is more reasonable to evaluate lexical selection preference; log likelihood ratio can obtain spare lexical combination and solve spare data problem, which is a bottleneck to other methods. An approach of Lexical Acquisition is presented, which effectively integrates mutual information, χ2-test and log likelihood ratio with heuristic rules of subcategorization frame.
Key words:
nature language processing,
lexical acquisition,
unknown word detection,
selection preference,
statistical evaluation method
中图分类号:
王大亮;蒋宏潮;涂序彦;郑雪峰;佟子健. 基于选择倾向性的词汇获取方法[J]. 计算机工程, 2008, 34(12): 169-171.
WANG Da-liang; JIANG Hong-chao; TU Xu-yan; ZHENG Xue-feng; TONG Zi-jian. Lexical Acquisition Method Based on Selection Preference[J]. Computer Engineering, 2008, 34(12): 169-171.