摘要: 对海量文本语料进行上下位语义关系自动抽取是自然语言处理的重要内容,利用简单模式匹配方法抽取 得到候选上下位关系后,对其进行验证过滤是难点问题。为此,分别通过对词汇语境相似度与布朗聚类相似度计 算,提出一种结合语境相似度和布朗聚类相似度特征对候选下位词集合进行聚类的上下位关系验证方法。通过对 少量已标注训练语料的语境相似度和布朗聚类相似度进行计算,得到验证模型和2 种相似度的结合权重系数。该 方法无需借助现有的词汇关系词典和知识库,可对上下位关系抽取结果进行有效过滤。在CCF NLP&2012 词汇语 义关系评测语料上进行实验,结果表明,与模式匹配和上下文比较等方法相比,该方法可使F 值指标得到明显提升。
关键词:
上下位关系,
语境相似度,
布朗聚类相似度,
点互信息,
模式匹配,
聚类验证
Abstract: Hyponymy has many important applications in the field of Natural Language Processing (NLP) and the
automatic extraction of hyponym relation from massive text datasets is naturally one of important NLP research tasks. The emphasis and difficult point of the research is how to validate a hyponym which is extracted with simple pattern matching method is really correct. By calculating the context feature similarity ( SimCF ) and Brown clustering similarity (SimBrown ), this paper proposes a novel approach of hyponymy validation. It applies a clustering on hyponym candidates,and the clustering similarity feature is obtained by combining SimCF and SimBrown. The combination coefficient of two kinds of similarity is derived based on the SimCFs and SimBrowns between all labeled training words and their hyponyms. The model can filter roughly extraction results without any existed lexical relation dictionary or knowledge base. Evaluation on CCF NLP&CC2012 word semantic relation corpus shows that the proposed approach in this paper significantly improves the F measure value compared with other approaches including pattern matching and simple context comparison.
Key words:
hyponymy relation,
context similarity,
Brown clustering similarity,
Point Mutual Information ( PMI ),
pattern matching,
clustering validation
中图分类号: