基于文档标引图模型的文本相似度策略

doi:10.3969/j.issn.1000-3428.2008.07.007

计算机工程 ›› 2008, Vol. 34 ›› Issue (7): 19-22. doi: 10.3969/j.issn.1000-3428.2008.07.007

基于文档标引图模型的文本相似度策略

高茂庭1，王正欧2

(1. 上海海事大学计算机科学与工程系，上海 200135；2. 天津大学系统工程研究所，天津 300072)

收稿日期:1900-01-01 修回日期:1900-01-01 出版日期:2008-04-05 发布日期:2008-04-05

Document Similarity Strategy Based on Document Index Graph Model

GAO Mao-ting1, WANG Zheng-ou2

(1. Computer Science and Engineering Department, Shanghai Maritime University, Shanghai 200135; 2. Institute of Systems Engineering, Tianjin University, Tianjin 300072)

Received:1900-01-01 Revised:1900-01-01 Online:2008-04-05 Published:2008-04-05

摘要/Abstract

摘要： 文档标引图是一种基于短语的图结构文本特征表示模型，能更加全面、准确地表达文本特征信息，实现渐增的文本聚类和信息处理。该文基于文档标引图特征模型，提出文档相似度计算加法策略和乘法策略，采用变换函数对文档相似度值进行调整，增强文档之间的可区分性，改进文本聚类和分类等处理的性能，实例证明了策略的有效性。

关键词: 文本聚类, 文档标引图, 文本相似度, 文本特征模型

Abstract: Document Index Graph(DIG) is a kind of phrase-based graph structure text feature representation model, which is able to express text feature information more completely and exactly to realize incremental text clustering and information processing. Based on DIG, document similarity additive and multiplicative strategy are proposed, document similarity is adjusted by a set of transform function, distinguishability between documents is strengthened, and performance of text clustering and classification are improved. Experiments demonstrate the efficiency of the methods.

Key words: text clustering, Document Index Graph(DIG), document similarity, text feature model

中图分类号:

TP311.13

高茂庭;王正欧. 基于文档标引图模型的文本相似度策略[J]. 计算机工程, 2008, 34(7): 19-22.

GAO Mao-ting; WANG Zheng-ou. Document Similarity Strategy Based on Document Index Graph Model[J]. Computer Engineering, 2008, 34(7): 19-22.

http://www.ecice06.com/CN/Y2008/V34/I7/19

[1]	许伟佳, 秦永彬, 黄瑞章, 陈艳平. 基于DMA与特征划分的多源文本主题模型[J]. 计算机工程, 2021, 47(7): 59-66.
[2]	石彩霞, 李书琴, 刘斌. 多重检验加权融合的短文本相似度计算方法[J]. 计算机工程, 2021, 47(2): 95-102.
[3]	冯兴杰, 张乐, 曾云泽. 基于多注意力CNN的问题相似度计算模型[J]. 计算机工程, 2019, 45(9): 284-290.
[4]	肖晓丽,吴瑶,周锡玲,廖卓凡. 基于差分进化的两阶段文本特征选择算法[J]. 计算机工程, 2019, 45(2): 303-309,314.
[5]	缪峰,贾华丁,熊于宁. 基于服务相似度的移动用户近似邻居选取方法[J]. 计算机工程, 2018, 44(5): 162-167,173.
[6]	夏青,严馨,余正涛,汪建成,高盛祥,洪旭东. 融合要素及主题的汉越双语新闻话题分析[J]. 计算机工程, 2016, 42(9): 186-191.
[7]	陶舒怡，王明文，万剑怡，罗远胜，左家莉. 一种基于簇相合性的文本增量聚类算法[J]. 计算机工程, 2014, 40(6): 195-200.
[8]	吴舜尧，邵峰晶，王金龙，孙仁诚，王营. 融合语义资源和关键词的文本聚类[J]. 计算机工程, 2014, 40(4): 223-227.
[9]	邱云飞，王琳颍，邵良杉，郭红梅. 基于微博短文本的用户兴趣建模方法[J]. 计算机工程, 2014, 40(2): 275-279.
[10]	王永贵,林琳,刘宪国. 基于改进粒子群优化的文本聚类算法研究[J]. 计算机工程, 2014, 40(11): 172-177.
[11]	刘一正,杨静,李强. 基于URL 的中文多语义名词在线语义标注[J]. 计算机工程, 2014, 40(10): 150-154.
[12]	程传鹏?, 齐晖. 文本相似度计算在主观题评分中的应用[J]. 计算机工程, 2012, 38(5): 288-290.
[13]	曹泽文, 周姚. 基于MapReduce的JP算法设计与实现[J]. 计算机工程, 2012, 38(24): 14-16.
[14]	王少康, 董科军, 阎保平. 基于语句节奏特征的作者身份识别研究[J]. 计算机工程, 2011, 37(9): 4-5,8.
[15]	钟将, 刘龙海, 梁传伟. 基于成对约束的主动半监督文本聚类[J]. 计算机工程, 2011, 37(13): 183-186.

选择文件类型/文献管理软件名称

选择包含的内容

基于文档标引图模型的文本相似度策略

Document Similarity Strategy Based on Document Index Graph Model

PDF

可视化

被引次数

摘要/Abstract

引用本文

使用本文

参考文献

相关文章 15

编辑推荐

Metrics

本文评价

模态框（Modal）标题

选择文件类型/文献管理软件名称

选择包含的内容

基于文档标引图模型的文本相似度策略

Document Similarity Strategy Based on Document Index Graph Model

PDF

可视化

被引次数

摘要/Abstract

引用本文

使用本文

参考文献

相关文章 15

编辑推荐

Metrics

本文评价