计算机工程 ›› 2014, Vol. 40 ›› Issue (12): 161-165,171.doi: 10.3969/j.issn.1000-3428.2014.12.030

• 人工智能及识别技术 • 上一篇    下一篇

基于相似图片聚类的Web文本特征算法

方爽,殷俊杰,徐武平   

  1. 武汉大学计算机学院,武汉 430072
  • 收稿日期:2013-11-14 修回日期:2014-01-22 出版日期:2014-12-15 发布日期:2015-01-16
  • 作者简介:方 爽(1982-),男,硕士研究生,主研方向:自然语言处理;殷俊杰,硕士;徐武平,副教授。

Web Text Feature Algorithm Based on Similar Image Clustering

FANG Shuang,YIN Junjie,XU Wuping   

  1. School of Computer,Wuhan University,Wuhan 430072,China
  • Received:2013-11-14 Revised:2014-01-22 Online:2014-12-15 Published:2015-01-16

摘要: 对于图文不符的低质量网页,现有基于文本关键词的图片搜索引擎得到的结果相关性较差。针对该问题,将图片的相似性聚类信息和网页质量因素融入文本分析过程,提出一种基于相似图片聚类的Web文本特征算法。根据网页PageRank值、关键词HTML标签类别和关键词词性类别的不同,分别赋予其不同的权重并代入计算公式,综合计算得到整个聚类中全部关键词的文本特征值,并通过设置阈值提取高相关文本。对随机选取的15个图片聚类进行实验分析,结果表明,与百度和谷歌目前所用图片搜索算法相比,该算法能够准确地找到反映图片内容的真实文本,提高图片检索的精度。

关键词: Web文本特征, 图片搜索引擎, 基于文本的图像检索, 基于内容的图像检索, 倒排索引, Web文本分析

Abstract: Due to the problem of poor correlation between index text and target image caused by the own unconformity between image and text of Web page,which can not be fundamentally solved by existing image search engines based on text keywords,joining picture similarity clustering information and Web quality factors into analysis,this paper puts forward a Web text feature algorithm based on similar picture clustering.It brings the PageRank values of websites,tag category and speech category of HTML keywords with different weights into the formula,and calculates all of keywords’ text feature values of the whole cluster,by setting a threshold to extract highly relevant text at last.By experimental analyzing 15 clustering of pictures randomly selected,compared with the image searching algorithm which Baidu or Google currently uses,result shows that the proposed algorithm can exactly find the true pictures which reflect content text,and further improve the accuracy of image retrieval.

Key words: Web text feature, image search engine, Text-based Image Retrieval(TBIR), Content-based Image Retrieval(CBIR), inverted index, Web text analysis

中图分类号: