Abstract:
Aiming at the particularity of text genre classification in feature selection and weight calculation, this paper presents the text content category information, which improves the conventional CHI feature selection method and the tf.idf formula of feature weight. By using Support Vector Machine(SVM), an automatic classification on a Chinese text corpus consisting of five genres is carried out. Experimental results show this scheme is feasible.
Key words:
Chinese information processing,
genre classification,
feature selection,
Support Vector Machine(SVM)
摘要: 针对文本体裁自动分类在特征选择和权重计算方面的特殊性,提出文本的内容类别信息,改进传统特征选择方法CHI以及权重计算公式tf.idf,并运用支持向量机在含5类体裁的语料上进行中文文本体裁自动分类。实验结果表明,该方案是可行的。
关键词:
中文信息处理,
体裁分类,
特征项选择,
支持向量机
CLC Number:
DENG Qi; SU Yi-dan; CAO Bo; BI Jian-ting. Research on Feature Selection in Chinese Text Genre Classification[J]. Computer Engineering, 2008, 34(23): 89-91.
邓 琦;苏一丹;曹 波;闭剑婷. 中文文本体裁分类中特征选择的研究[J]. 计算机工程, 2008, 34(23): 89-91.