Feature Dimension Reduction Short Text Clustering Combined with Semantic and Statistics

doi:10.3969/j.issn.1000-3428.2012.22.042

Computer Engineering ›› 2012, Vol. 38 ›› Issue (22): 171-175. doi: 10.3969/j.issn.1000-3428.2012.22.042

• Networks and Communications • Previous Articles Next Articles

Feature Dimension Reduction Short Text Clustering Combined with Semantic and Statistics

YANG Wan-xia^1,2, SUN Li-he ³, HUANG Yong-feng ²

(1. College of Technology, Gansu Agricultural University, Lanzhou 730070, China; 2. Department of Electronic Engineering, Tsinghua University, Beijing 100084, China; 3. College of Foreign Languages and Literature, Northwest Normal University, Lanzhou 730070, China)

Received:2012-06-14 Revised:2012-09-12 Online:2012-11-20 Published:2012-11-17

结合语义与统计的特征降维短文本聚类

杨婉霞 ^1,2，孙理和 ³，黄永峰 ²

(1. 甘肃农业大学工学院，兰州 730070；2. 清华大学电子工程系，北京 100084；3. 西北师范大学外国语学院，兰州 730070)

作者简介:杨婉霞(1979－)，女，讲师、硕士研究生，主研方向：信息处理，机器学习；孙理和，讲师、硕士；黄永峰，副教授、博士
基金资助:
国家“863”计划基金资助项目(2011AA010704, 2012AA011004)；清华大学自主科研基金资助项目“跨媒体分布式垂直搜索及舆情分析的关键技术”(20111081023)

Abstract

Abstract: The primary difficulty of text clustering lies in the multi-dimensional sparseness of texts. A short text clustering algorithm which takes semantic and statistic features into account is proposed. A dimensionality reduction is achieved via the semantic relativity analysis of lexical semantics by semantic dictionary. The second dimension reduction is completed after a feature selection through statistical methods. The short text clustering is obtained with the combination of the two reductions. Experimental result shows that the algorithm has better clustering effect and efficiency on short text.

Key words: feature selection, clustering, short text, Vector Space Model(VSM), semantic, dimension reduction

摘要： 为解决文本聚类时文本的高维稀疏性问题，提出一种语义和统计特征相结合的短文本聚类算法。该算法通过语义词典对词汇的语义相关性分析实现一次降维，结合统计方法进行特征选择实现二次降维，并融合二次降维特征实现短文本聚类。实验结果表明，该算法具有较好的短文本聚类效果和效率。

关键词: 特征选择, 聚类, 短文本, 向量空间模型, 语义, 降维

CLC Number:

TP391

YANG Wan-Xia, SUN Li-He, HUANG Yong-Feng. Feature Dimension Reduction Short Text Clustering Combined with Semantic and Statistics[J]. Computer Engineering, 2012, 38(22): 171-175.

杨婉霞, 孙理和, 黄永峰. 结合语义与统计的特征降维短文本聚类[J]. 计算机工程, 2012, 38(22): 171-175.

/ / Recommend / Download Citations

URL: http://www.ecice06.com/EN/10.3969/j.issn.1000-3428.2012.22.042

http://www.ecice06.com/EN/Y2012/V38/I22/171

References

[1] Hotho A, Maedche A, Staab S. Ontologies Improve Text Document Clustering[C]//Proc. of the IEEE International Conference on Data Mining. Melbourne, Australia: [s. n.], 2003: 541-544.
[2] Choudhary B, Bhattacharyya P. Text Clustering Using Semantics[C]// Proc. of the 11th International World Wide Web Conference. Hawaii, USA: [s. n.], 2002.
[3] 赵鹏, 耿焕同, 蔡庆生. 一种基于语义和统计特征的中文文本特征表示方法[J]. 小型微型计算机系统, 2007, 28(7): 1311- 1313.
[4] 谭松波, 王月粉. 中文文本分类语料库——TanCorp V1.0[EB/OL]. (2010-05-18). http://www.searchforum.org.cn/tansongbo/corpus. htm.
[5] Rogati M, Yang Yiming. High-performing Feature Selection for Text Classification[C]//Proc. of the 11th ACM International Conference on Information and Knowledge Management. New York, USA: ACM Press, 2002: 659-661.
[6] Makrehchi M, Kamel M S. Text Classification Using Small
Number of Features[C]//Proc. of the 4th International Conference on Machine Learning and Data Mining in Pattern Recognition. [S. l.]: ACM Press, 2005: 580-589
[7] Mladenic D, Brank J, Grobelnik M, et al. Feature Selection Using Linear Classifier Weights: Interaction with Classification Models[C]//Proc. of the 27th ACM International Conference on Research and Development in Information Retrieval. [S. l.]: ACM Press, 2004: 234-241.
[8] 王博. 文本分类中特征选择技术研究[M]. 长沙: 国防科学技术大学, 2009.
[9] 陈彬, 洪家荣, 王亚东. 最优特征子集选择问题[J]. 计算机学报, 1997, 20(2): 133-138.
[10] 陈友, 程学旗, 李洋, 等. 基于特征选择的轻量级入侵检测系统[J]. 软件学报, 2007, 18(7): 1639-1651.
[11] Zhang Lijuan, Li Zhoujun, Chen Huowang. An Effective Gene Selection Method Based on Relevance Analysis and Discernibility Matrix[C]//Proc. of the 11th Pacific-Asia Conference on Advances in Knowledge Discovery and Data Mining. Berlin, Germany: Springer-Verlag, 2007: 1088-1095.
[12] 陈文亮, 朱靖波, 朱慕华. 基于领域词典的文本特征表示[J]. 计算机研究与发展, 2005, 42(12): 2155-2160.
[13] 吕震宇, 林永民, 赵爽, 等. 基于同义词词林的文本特征选择与加权研究[J]. 情报杂志, 2008, 27(5): 130-132.
[14] Metzler D, Dumais S, Meek C. Similarity Measures for Short Segments of Text[C]//Proc. of the 29th European Conference in Information Retrieval Research. Rome, Italy: Springer-Verlag, 2007: 16-27.
[15] Peng Tao, Zuo Wanli, He Fengling. SVM Based Adaptive Learning Method for Text Classification from Positive and Unlabeled Documents[J]. Journal of Knowledge and Information Systems, 2008, 16(3): 961-976.
[16] Phan X H, Nguyen L M, Horiguchi S. Learning to Classify Short and Sparse Text & Web with Hidden Topics from Large-scale Data Collections[C]//Proc. of the 17th International Conference on World Wide Web. New York, USA: ACM Press, 2008: 91-100.

Please choose a citation manager

Content to export

Feature Dimension Reduction Short Text Clustering Combined with Semantic and Statistics

结合语义与统计的特征降维短文本聚类

PDF

Knowledge

Cited

Abstract

Cite this article

share this article

References

Related Articles 15

Recommended Articles

Metrics

Comments

[1]	Junhang CHEN, Zuyuan YANG, Mingyang LIU, Lingjiang LI. Generalized Separable Nonnegative Matrix Factorization Algorithm Based on Orthogonal Constraints [J]. Computer Engineering, 2023, 49(8): 46-53.
[2]	Chunbo XU, Juan YAN, Huibin YANG, Bo WANG, Han WU. Visual SLAM Algorithm Based on Target Detection and Semantic Segmentation [J]. Computer Engineering, 2023, 49(8): 199-206, 214.
[3]	Xiaodan CUI, Dawei LIU, Yifan LIU, Zhibin ZHAO, Yougui REN, Yongming YAN. Research and Implementation of Key Frame Summarization Model for News Short Video [J]. Computer Engineering, 2023, 49(8): 182-189.
[4]	Yuyan JIANG, Chengfeng TAO, Ping LI. Deep Subspace Clustering Algorithm with Data Augmentation and Adaptive Self-Paced Learning [J]. Computer Engineering, 2023, 49(8): 96-103, 110.
[5]	Meiguang ZHENG, Yong YANG. Personalized Federated Learning Algorithm Based on Mutual Information and Soft Clustering [J]. Computer Engineering, 2023, 49(8): 20-28.
[6]	Zeshui LI, Junzhong JI, Cuicui YANG. Functional Module Detection Based on Deep Network Embedding of Edge Weighing Information in PPIN [J]. Computer Engineering, 2023, 49(8): 69-76.
[7]	Tianchen QIU, Xiaoying ZHENG, Yongxin ZHU, Songlin FENG. Federated Learning Architecture for Non-IID Data [J]. Computer Engineering, 2023, 49(7): 110-117.
[8]	Xuan YANG, Jianmin MA, Manjun ZHAO. Feature Selection of High-Dimensional Time-Series Data Based on Neighborhood Mutual Information [J]. Computer Engineering, 2023, 49(7): 135-142.
[9]	Kuan WANG, Shibin XUAN, Xuedong HE, Ziwei LI, Jiaxiang LI. Human Pose Estimation Method Based on Cross Attention Transformer [J]. Computer Engineering, 2023, 49(7): 223-231.
[10]	CHEN Ming, LIU Rong, ZHANG Ye. Chinese Medical Entity Recognition Based on Multiple Attention Mechanism [J]. Computer Engineering, 2023, 49(6): 314-320.
[11]	WEI Ya, ZHANG Zhengjun, HE Kailin, TANG Li. Density Peak Clustering Algorithm Based on Relative Density [J]. Computer Engineering, 2023, 49(6): 53-61.
[12]	DAI Haolei, HUANG Yonghui, ZHOU Guoxu. Clustering Analysis Based on Hyper-graph Regularized Non-Negative Tensor Train Decomposition [J]. Computer Engineering, 2023, 49(6): 81-89.
[13]	FU Jiahao, YANG Jiayi, LI Aiguo. High-Utility Semantic Trajectory Pattern Mining for Security System [J]. Computer Engineering, 2023, 49(6): 62-70.
[14]	GAO Xiaofang, YUAN Yuliang, WEN Jing, BAI Xuefei. Label Propagation Algorithm for Intersecting Multi-manifolds Clustering [J]. Computer Engineering, 2023, 49(6): 90-98.
[15]	ZHAO Hong, CHEN Zhiwen, GUO Lan, AN Dong. Video Content Caption Generation Based on ViT and Semantic Guidance [J]. Computer Engineering, 2023, 49(5): 247-254.

模态框（Modal）标题

Please choose a citation manager

Content to export

Feature Dimension Reduction Short Text Clustering Combined with Semantic and Statistics

结合语义与统计的特征降维短文本聚类

PDF

Knowledge

Cited

Abstract

Cite this article

share this article

References

Related Articles 15

Recommended Articles

Metrics

Comments