摘要: 为提高维吾尔文网络内容查询的扩展性能,提出一种将维语同义词和互联网资源相结合的扩展词构建算 法。利用维吾尔语同义词词典、近义词词典和反义词词典等建立基本候选词库,将互联网作为超大规模语料库,以 搜索引擎为工具,使用改进的点互信息对基本扩展词进行相似度评价,选取前N 个词形成候选扩展词库1,对包含 关键词的互联网语料,基于局部共现和点互信息分析,构建候选扩展词库2,对上述2 种候选扩展词库加权求和,按 顺序选择部分词为扩展词。通过搜索引擎实现扩展查询验证,结果表明,与常规查询和同义词查询扩展算法相比,该算法能明显提高查询的准确率。
关键词:
查询扩展,
局部共现分析,
点互信息算法,
扩展词,
大规模语料库
Abstract: In order to improve the performance of Uighur network content query expansion,this paper presents a kind of
expansion words construction algorithm that is based on the combination of the Uygur synonym resources and Internet resources. An initial candidate words set is created by the Uyghur synonym,near-synonym and antonyms dictionary. The Internet is acted as a very large scale corpus,the similarity between the keywords and every word in the initial candidate words set is computed by the improved point mutual information algorithm. The words are sorted by the similarity evaluation and the top N words are selected to the candidate expansion words set-1. Meanwhile based on the partial collinear and point mutual information algorithm,it analyzes the Internet corpus which contained keywords and create the candidate expansion words set-2. The final expansion words are selected according to the results of weighted summation for the candidate expansion words set-1 and set-2. Compared with the normal keywords query and synonym expansion query,the query results based on the expansion words in this paper show that the accuracy of this algorithm is much better than the others.
Key words:
query expansion,
local co-occurrence analysis,
point mutual information algorithm,
expansion word,
large
scale corpus
中图分类号: