摘要: 提出一种基于混合二维条件随机场的Web记录抽取模型,以克服线性链条件随机场不能充分利用Web实体间二维依赖关系的缺点,且训练条件随机场模型时无需大量手工标注的样本数据。对当当网上的742个数据记录进行抽取,对比同等情况下的其他模型。实验结果表明,混合二维条件随机场模型在抽取TDS数据集时展现了更优越的性能。
关键词:
条件随机场,
混合条件随机场,
信息抽取,
Web记录
Abstract: This paper presents a model of two-dimensional Mix Conditional Random Fields(MCRF) which are used for the extraction of Web records. It overcomes the shortcomings of linear-chain conditional random that it can not take full advantage of dependencies between the various elements of Web entities. Meanwhile, it solves the problem that training CRF model often requires large number of hand-labeling sample data. In the experiment, it tries to extract 742 data records from Dangdang online, and compared with other models under the same conditions. Experimental results show a more superior performance during extracting TDS.
Key words:
Conditional Random Fields(CRF),
Mixed CRF(MCRF),
information extraction,
Web records
中图分类号:
卓林, 杨舟, 赵朋朋, 崔志明. 基于二维混合条件随机场的Web记录抽取模型[J]. 计算机工程, 2011, 37(5): 59-61,64.
ZHUO Lin, YANG Zhou, DIAO Peng-Peng, CUI Zhi-Meng. Web Records Extraction Model Based on 2D Mixed Conditional Random Fields[J]. Computer Engineering, 2011, 37(5): 59-61,64.