作者投稿和查稿 主编审稿 专家审稿 编委审稿 远程编辑

计算机工程 ›› 2008, Vol. 34 ›› Issue (15): 211-213. doi: 10.3969/j.issn.1000-3428.2008.15.076

• 人工智能及识别技术 • 上一篇    下一篇

基于最小生成树聚类的中文版面分割法

张 充,苗秀芬,司建辉,史青宣,田学东   

  1. (河北大学数学与计算机学院,保定 071002)
  • 收稿日期:1900-01-01 修回日期:1900-01-01 出版日期:2008-08-05 发布日期:2008-08-05

Chinese Document Layout Segmentation Method Based on Minimal Spanning Tree Clustering

ZHANG Chong, MIAO Xiu-fen, SI Jian-hui, SHI Qing-xuan, TIAN Xue-dong   

  1. (College of Mathematics and Computer, Hebei University, Baoding 071002)
  • Received:1900-01-01 Revised:1900-01-01 Online:2008-08-05 Published:2008-08-05

摘要: 针对中文版面多横竖混排的特点,提出一种基于最小生成树聚类的版面分割方法。对原图像进行水平和垂直游程平滑,并对平滑后所得的连通域进行预分类处理,将文本进行横排、竖排分类。对预分类后的各类文本采用最小生成树聚类算法进行聚类处理。经实验,准确率达97%。实验表明,该方法对中文文档有良好的分割效果。

关键词: 版面分割, 游程平滑, 最小生成树聚类

Abstract: Aiming at the feature that transverse documents and vertical documents blend mostly in Chinese document layout, a menthod based on minimal spanning tree clustering is presented. Apply run_length smoothing algorithm on the document in horizontal direction, and vertical direction. Then, a pre_classification step is applied to the connected components generated after classifying run_length smoothing to body text into horizontally aligned and vertically aligned. Minimal spanning tree clustering algorithm is applied to the body text that are generated after pre_classification. Via experiment, the accurate rate reaches 97%. As is shown from the experiment, the method has a good effect on segmentation of Chinese documents.

Key words: layout segmentation, run_length smoothing, minimal spanning tree clustering

中图分类号: