Due to the rapid growth of digital libraries, digitizing large documents has become an important topic. In a quite long book, similar characters, sub-words and words will occur many times. In this paper, we propose a sub-word image clustering method for the applications More
Due to the rapid growth of digital libraries, digitizing large documents has become an important topic. In a quite long book, similar characters, sub-words and words will occur many times. In this paper, we propose a sub-word image clustering method for the applications dealing with large uniform documents. We assumed that the whole document is printed in a single font and print quality is not good. To test our method, we created a dataset of all sub-words of a Farsi book. The book has 233 pages with more than 111000 sub-words manually labeled. We use an incremental clustering algorithm. Four simple features are extracted from each sub-word and compared with the corresponding features of each cluster center. If all features' differences lie within certain thresholds, the sub-word and the winner cluster center are finely compared using a template matching algorithm. In our experiments, we show that all sub-words of the book are recognized with more than 99.7% accuracy by assigning the label of each cluster center to all of its members.
Manuscript profile