Abstract:
As many research papers appear on the Internet, it is very important to accurately extract paper header information and citation from these papers. Thispaper proposes an algorithm based on hidden Markov model for extracting paper header information and citation from Chinese research papers, analyzes the key to the learning of the module structure and method of parameter estimation. In the processing, the algorithm makes full use of the format information of list separators and special-labels to segment text, and gains extraction information of special-fields, based on hidden Morkov model. Experimental results show that the algorithm has good performance in precision and recall.
Key words:
hidden Markov model,
information extraction,
paper header information
摘要: 随着大量的科研论文出现在互联网上,从中精确地抽取论文头部信息和引文信息显得十分重要。该文提出了一种基于隐马尔可夫模型的中文科研论文头部信息和引文信息抽取算法,分析了模型结构的学习和参数估计方法。在进行信息抽取时,利用分隔符、特定标识符等格式信息对文本进行分块,利用隐马尔可夫模型进行指定域的抽取。实验结果表明,该算法具有良好的准确率和召回率。
关键词:
隐马尔可夫模型,
信息抽取,
论文头部信息
CLC Number:
YU Jiang-de; FAN Xiao-zhong; YIN Ji-hao; GU Yi-jun. Information Extraction from Chinese Research Papers Based on Hidden Markov Model[J]. Computer Engineering, 2007, 33(19): 190-192.
于江德;樊孝忠;尹继豪;顾益军. 基于隐马尔可夫模型的中文科研论文信息抽取[J]. 计算机工程, 2007, 33(19): 190-192.