摘要: 为提高汉维句子对齐方法的准确率,提出一种分段句子对齐方法。采用词汇信息和长度信息相结合的策略,识别出能作为锚点的一对句子(锚点句对),并将其作为分割标志对全文进行分段,在各片段内使用基于长度的方法实现全部句子的对齐,采用词汇、数字、标点符号和长度信息提高方法的领域移植性,使用分段方法避免复杂的计算过程,从而解决错误蔓延问题。实验结果表明,该方法的准确率达到95. 2% ,比基于长度的句子对齐方法提高了2. 7% 。
关键词:
平行语料库,
句子对齐,
锚点,
基于长度的方法,
基于词汇的方法
Abstract: The step-by-step sentence alignment method is introduced in order to improve current Chinese-Uyghur
sentence alignment method. Lexical and length information is used to generate some anchor sentences. Texts are divided into several sections by using anchor sentence as boundary,and then sentences in each section are aligned using lengthbased method. This method is effective in multi domain text because it uses words,numbers,and punctuation marks. It avoids complex computing and error spreading because of its “subsection” technique. Experimental results show that the precision of this method is 95. 2% in Chinese-Uyghur multi-domain texts,which is 2. 7% higher than length-based method.
Key words:
parallel corpora,
sentence alignment,
anchor,
length-based method,
lexical-based method
中图分类号: