作者投稿和查稿 主编审稿 专家审稿 编委审稿 远程编辑

计算机工程 ›› 2019, Vol. 45 ›› Issue (8): 178-183. doi: 10.19678/j.issn.1000-3428.0051516

• 人工智能及识别技术 • 上一篇    下一篇

一种中文真词错误检测与修复方法

叶俊民a, 徐松a, 罗达雄a, 王志锋b, 陈曙a   

  1. 华中师范大学 a. 计算机学院;b. 教育信息技术学院, 武汉 430070
  • 收稿日期:2018-05-10 修回日期:2018-07-11 出版日期:2019-08-15 发布日期:2018-07-17
  • 作者简介:叶俊民(1965-),男,教授、博士,主研方向为机器学习;徐松、罗达雄,硕士研究生;王志锋,副教授;陈曙,讲师。
  • 基金资助:
    国家社会科学基金(17BTQ061)。

A Chinese Real-word Error Detection and Repairing Method

YE Junmina, XU Songa, LUO Daxionga, WANG Zhifengb, CHEN Shua   

  1. a. School of Computer;b. School of Educational Information Technology, Central China Normal University, Wuhan 430070, China
  • Received:2018-05-10 Revised:2018-07-11 Online:2019-08-15 Published:2018-07-17

摘要: 在线学习社区中的中文真词错误会给中文文本语义的理解带来困难,从而影响基于在线学习社区文本的学习分析效果。为此,提出一种针对在线学习社区短文本的真词错误检测与修复方法。构建混淆词集和混淆词对应的固定搭配知识库,基于n-gram概率统计模型、上下文语境模型和固定搭配知识库,分别计算每一个混淆词的n-gram得分、上下文语境得分和固定搭配得分,对其加权求和作为判断原文是否出错的依据,并将最高得分的混淆词作为修复意见。实验结果表明,该方法召回率、准确率与修复率分别为85.6%、86.3%、92.9%,能准确有效检测与修复学习社区中的中文真词错误。

关键词: 真词错误, 混淆词集, n-gram概率统计模型, 上下文语境, 中文固定搭配

Abstract: The Chinese real-word error in the online learning community will make it difficult to understand the semantics of Chinese texts,which affects the learning and analyzing effects based on online learning community texts.To this end,this paper proposes a real-word error detection and repairing method for short texts in online learning communities.Firstly,the confusion word set and the fixed collocation knowledge base corresponding to the confusion word are automatically constructed.Then,n-gram scores,context scores and fixed match scores are calculated for each confusion word based on the n-gram probability statistical model,context model,and fixed collocation knowledge base respectively.Finally,the weighted summation is used as the basis for judging whether the original text is wrong,and the confusing word with the highest score is used as the repair opinion.Experimental results show that this method can effectively detect and repair Chinese real-word error in the learning community,whose Recall,Precision,and Correction are 85.6%,86.3%,92.9% respectively.

Key words: real-word error, confusion word set, n-gram probability statistical model, context, Chinese fixed collocation

中图分类号: