摘要: 提出一种基于层次关联边条件随机场(HCC-CRFs)模型的Web对象抽取方法。将数据块检测和属性标注合并为标签分配问题,避免误差传播现象。通过在数据块之间增加条件依赖关系,使HCC-CRFs模型能充分利用Web页面的内容层次结构。实验结果表明,该方法具有较好的抽取效果。
关键词:
Web对象,
信息抽取,
数据块检测,
属性标注,
条件随机场,
层次关联边
Abstract: This paper presents a Web object extraction method based on Hierarchical Correlative-chain Conditional Random Fields(HCC-CRFs) model. This method performs data record detection and attributes labeling simultaneously to avoid error propagation. It can get the most out of the content hierarchy of Web page by adding more conditional dependencies between data record. Experimental results show this method has good extraction effect.
Key words:
Web object,
information extraction,
data block detection,
attribute labeling,
Conditional Random Fields(CRFs),
hierarchical correlative-chain
中图分类号:
胡丽娟, 梁久祯. 基于层次关联边条件随机场的Web对象抽取[J]. 计算机工程, 2012, 38(20): 45-48.
HU Li-Juan, LIANG Jiu-Zhen. Web Object Extraction Based on Hierarchical Correlative-chain Conditional Random Fields[J]. Computer Engineering, 2012, 38(20): 45-48.