Sub-Word Image Clustering in Old Printed Documents Using Template Matching
Subject Areas : electrical and computer engineeringM. R. Soheili 1 * , E. Kabir 2
1 - Tarbiat Modares University
2 - Tarbiat Modares University
Keywords: Document image analysis large document incremental clustering segmentation dataset,
Abstract :
Due to the rapid growth of digital libraries, digitizing large documents has become an important topic. In a quite long book, similar characters, sub-words and words will occur many times. In this paper, we propose a sub-word image clustering method for the applications dealing with large uniform documents. We assumed that the whole document is printed in a single font and print quality is not good. To test our method, we created a dataset of all sub-words of a Farsi book. The book has 233 pages with more than 111000 sub-words manually labeled. We use an incremental clustering algorithm. Four simple features are extracted from each sub-word and compared with the corresponding features of each cluster center. If all features' differences lie within certain thresholds, the sub-word and the winner cluster center are finely compared using a template matching algorithm. In our experiments, we show that all sub-words of the book are recognized with more than 99.7% accuracy by assigning the label of each cluster center to all of its members.
[1] http://en.wikipedia.org/wiki/Google_Books
[2] K. Pramod Sankar and C. V. Jawahar, "Enabling search over large collections of telugu document images - an automatic annotation based approach," in Proc. of the 5th Indian Conf. on Computer Vision, Graphics, and Image Processing, ICVGIP, vol. 4338, pp. 837-848, Dec. 2006.
[3] K. Pramod Sankar, V. Ambati, L. Pratha, and C. V. Jawahar, "Digitizing a million books: challenges for document analysis," in Proc. of the 7th IAPR Int. Workshop on Document Analysis Systems, DAS'06, vol. 3872, pp. 425-436, Feb. 2006.
[4] M. Meshesha and C. V. Jawahar, "Self adaptable recognizer for document image collections," in Proc. of the 2nd Int. Conf. on Pattern Recognition and Machine Intelligence, vol. 4815, pp. 560-567, Dec. 2007.
[5] N. V. Neeba and C. V. Jawahar, "Recognition of books by verification and retraining," in Proc. of the 19th Int. Conf. on Pattern Recognition, ICPR'08, 4 pp., Dec. 2008.
[6] V. Rasagna, A. Kumar, C. V. Jawahar, and R. Manmatha, "Robust recognition of documents by fusing results of word clusters," in Proc. of the 10th Int. Conf. on Document Analysis and Recognition, ICDAR'09, pp.566-570, Jul. 2009.
[7] K. Pramod Sankar, C. V. Jawahar, and R. Manmatha, "Nearest neighbor based collection OCR," in Proc. of the 9th IAPR International Workshop on Document Analysis Systems, DAS'10, pp. 207-214, 2010.
[8] P. Xiu and H. S. Baird, "Whole-book recognition using mutual-entropy-driven model adaptation," in Proc. 15th Document Recognition and Retrieval Conf., DRR'08, vol. 6815, 2008.
[9] P. Xiu and H. S. Baird, "Towards whole - book recognition," in Proc. of the 8th IAPR Int. Workshop on Document Analysis Systems, DAS'08, pp.629-636, Sep. 2008.
[10] P. Xiu and H. S. Baird, "Scaling up whole-book recognition," in Proc. of the 10th Int. Conf. on Document Analysis and Recognition, ICDAR'09, pp.698-702, Jul. 2009.
[11] P. Xiu and H. S. Baird, "Analysis of whole-book recognition," in Proc. of the 9th IAPR Int. Workshop on Document Analysis Systems, DAS'10, pp. 199-206, 2010.
[12] P. Xiu and H. S. Baird, "Incorporating linguistic post-processing into whole - book recognition," in Proc. of the 17th Document Recognition and Retrieval Conf., DRR'10, Jan. 2010.
[13] P. Xiu and H. S. Baird, "Incorporating linguistic model adaptation into whole-book recognition," in Proc. of the IAPR 20th Int. Conf. on Pattern Recognition, ICPR'10, pp.2057-2060, Aug. 2010.
[14] V. Kluzner, A. Tzadok, Y. Shimony, E. Walach, and A. Antonacopoulos, "Word-based adaptive OCR for historical books," in Proc. of the 10th Int Conf. on Document Analysis and Recognition, ICDAR'09, pp.501-505, Jul. 2009.
[15] J. J. Hull, "Document image skew detection: survey and annotated bibliography," Document Analysis Systems II, World Scientific, pp. 40-64, 1998.
[16] M. Valizadeh and E. Kabir, "Binarization of degraded document image based on feature space partitioning and classification," Int. J. on Document Analysis and Recognition, vol. 15, no. 1, pp. 57-69, 2012.
[17] C. D. Manning, P. Raghavan, and H. Schutze, An Introduction to Information Retrieval, Cambridge University Press, 2009.