Abstract:
Aiming at the feature that transverse documents and vertical documents blend mostly in Chinese document layout, a menthod based on minimal spanning tree clustering is presented. Apply run_length smoothing algorithm on the document in horizontal direction, and vertical direction. Then, a pre_classification step is applied to the connected components generated after classifying run_length smoothing to body text into horizontally aligned and vertically aligned. Minimal spanning tree clustering algorithm is applied to the body text that are generated after pre_classification. Via experiment, the accurate rate reaches 97%. As is shown from the experiment, the method has a good effect on segmentation of Chinese documents.
Key words:
layout segmentation,
run_length smoothing,
minimal spanning tree clustering
摘要: 针对中文版面多横竖混排的特点,提出一种基于最小生成树聚类的版面分割方法。对原图像进行水平和垂直游程平滑,并对平滑后所得的连通域进行预分类处理,将文本进行横排、竖排分类。对预分类后的各类文本采用最小生成树聚类算法进行聚类处理。经实验,准确率达97%。实验表明,该方法对中文文档有良好的分割效果。
关键词:
版面分割,
游程平滑,
最小生成树聚类
CLC Number:
ZHANG Chong; MIAO Xiu-fen; SI Jian-hui; SHI Qing-xuan; TIAN Xue-dong. Chinese Document Layout Segmentation Method Based on Minimal Spanning Tree Clustering[J]. Computer Engineering, 2008, 34(15): 211-213.
张 充;苗秀芬;司建辉;史青宣;田学东. 基于最小生成树聚类的中文版面分割法[J]. 计算机工程, 2008, 34(15): 211-213.