摘要: 目前多数抽取方法主要针对主题信息块的提取,未深入到各单独信息块。为此,设计一种基于DOM树的视频元数据抽取系统。通过改进Heritrix的链接过滤功能和URL队列管理策略,结合网页DOM树节点类型,从各单独信息块中抽取网页元数据。实验结果表明,该系统的网页平均查准率为95.7%,平均抽取准确率为98.4%,高于同类系统。
关键词:
网络爬虫,
信息采集,
URL调度,
增量更新,
DOM树
Abstract: Most of the extraction methods mainly focus on the extraction of the subject information block, and pay no attention on the individual information piece. A video metadata extraction system based on DOM tree is proposed to solve this problem. Combining with the node type of Web DOM tree, it extracts the metadata of Web pages thorough individual subject information block by improving the links filter functions of Heritrix and queue management strategy of URL. Experimental results show that the average precision ratio of Web page and the average extraction ratio of the system are 95.7% and 98.4%, greatly higher than the similar systems.
Key words:
Web crawler,
information collection,
URL schedule,
incremental update,
DOM tree
中图分类号:
唐朝伟, 李俊, 苗光胜, 杜欣慧. 基于DOM树的视频元数据抽取系统[J]. 计算机工程, 2012, 38(08): 268-270.
TANG Chao-Wei, LI Dun, MIAO Guang-Qing, DU Xin-Hui. Video Metadata Extraction System Based on DOM Tree[J]. Computer Engineering, 2012, 38(08): 268-270.