Abstract:
This paper introduces a new algorithm of text representation, which applies in Web document information filtering system. Compared with the traditional VSM, such an improved algorithm based on VSM makes more rapid filtering speed and higher filtering precision. This algorithm straight picks out attribute from attribute aggregate of filtering template, just disposes of the place where this attribute appeared in Web document. Then it gives different coefficient of weighting according to Web label which attribute perched on, and gets more exact weightiness of attribute. Finally it finds Web document representation model from the above result
Key words:
Web document; Text representation; VSM; Attribute; Weighting
摘要: 介绍了一种新的文本表示算法,应用在网页文本过滤系统中。比起传统的向量空间模型,这种建立在其上的改进算法有更快的过滤速度和更高的过滤精度。该算法直接从过滤模板的特征集中取出词条,只在网页文本出现该词的地方进行精确处理。根据特征项所在的网页标签,赋予不同的权值系数,以准确定义特征词在文中的重要程度,最后建立该网页的文本表示模型。
关键词:
网页;文本表示;向量空间模型;特征项;权值
ZENG Zhiyuan, ZHANG Li. Improved Algorithm of Web Document Representation Based on Vector Space Model[J]. Computer Engineering, 2006, 32(3): 134-135,139.
曾致远,张 莉. 基于向量空间模型的网页文本表示改进算法[J]. 计算机工程, 2006, 32(3): 134-135,139.