摘要: 针对生活信息服务网站的列表式商户信息,提出一种基于文档对象模型(DOM)树和视觉特征的网页信息自动抽取方法。利用商户信息列表页面中数据区域的DOM树结构和视觉特征,搜索得到候选目标数据区域,再利用视觉特征识别真正目标数据区域,从而抽取其中的数据记录。对10个生活信息服务网站进行测试,结果表明,有8个网站的召回率和准确率达到100%,取得了较好的结果。
                                                        
                                                        关键词: 
                               																				                                       文档对象模型树, 
	                                                                        											                                       视觉特征, 
	                                                                        											                                       自动抽取, 
	                                                                        											                                       数据记录, 
	                                                                        											                                       数据区域, 
	                                                                        											                                       挖掘算法 
	                                                                                                    
                                                                                    Abstract: This paper proposes an automatic extraction method based on Document Object Model(DOM) tree and visual features for Web information to extract businesses information in list pages of life information websites. By analyzing and using DOM tree and visual features of data regions in list pages, the method can get the candidate target data regions firstly. The method identifies the target data region by making use of visual features and extracts data records finally. The method tests ten life information websites and achieves 100% recall and 100% precision on eight websites. The results show that the proposed method can get better results.
                                                        	                            Key words: 
	                            																				                                       Document Object Model(DOM) tree, 
	                                    	                            											                                       visual feature, 
	                                    	                            											                                       automatic extraction, 
	                                    	                            											                                       data recording, 
	                                    	                            											                                       data region, 
	                                    	                            											                                       mining algorithm 
	                                    	                                                            
                                                        
                            
                                                        	
								
								中图分类号: 
								 
								
								
								                            
                            
                            
                                
                                    
                                
                                
                                    
                                        															黄武冠,朱明,尹文科. 基于DOM树和视觉特征的网页信息自动抽取[J]. 计算机工程.
															                                                                                                        	                                                           															HUANG Wu-guan, ZHU Ming, YIN Wen-ke. Web Information Automatic Extraction   Based on DOM Tree and Visual Feature[J]. Computer Engineering.