مرکز منطقه ای اطلاع رسانی علوم و فناوری فصلنامه مهندسی برق و مهندسی کامپيوتر ايران 16823745 7 4 2009 12 21 Evaluating Two Approaches for Farsi OCR Based on Sub-Word Shape Recognition ارزیابی روش‌های بازشناسی متون فارسی بر مبنای شکل کلی زیرکلمات 267 280 fa حسین خسروی احسان‌اله کبیر 2015 11 25 Two approaches for the recognition of printed Farsi documents based on sub-word shape recognition is proposed. First approach is based on recognition of sub-word shape as a whole and the second is based on the recognition of the body of sub-words. Sub-word body is constructed via removing dots and signs of the sub word. In second approach, information of dots and signs will be added after recognition of the body. Both approaches have two phases: training and test. In training phase, sub-words are clustered based on ISODATA algorithm. Initial centers of the clusters are computed through a hierarchical clustering algorithm. In first approach, sub-word recognition is performed in two stages: finding clusters close to the input sub-word and then finding the best match within the sub-words of these clusters. In the second approach another stage is required to find the final sub-word including dots and signs. Experimental results show that on clean images the first algorithm have better performance; 94% versus 93% in word level. But when dealing with low quality and noisy images, both algorithms are suffering from reduced accuracy. Sometimes this reduction is significant. The reasons of this behavior are inspected and some solutions are presented. Finally we compared both methods and inspected pros and cons of Farsi OCR based on sub-word shape. دو رویکرد برای بازشناسی متون فارسی با استفاده از شکل کلی زیركلمات ارائه شده و ضمن مقایسه آنها، مزایا و معایب استفاده از روش‌های مبتنی بر شکل کلی بیان شده است. رویکرد اول بر بازشناسی زیرکلمات، بدون حذف نقاط و علائم آنها استوار است و رویکرد دوم مبتنی بر شکل بدنه زیرکلمات است که از حذف نقاط و علائم زیرکلمه حاصل می‏شود و پس از بازشناسی بدنه، اطلاعات نقاط و علائم افزوده می‌شود. هر دو رویکرد شامل دو مرحله آموزش و آزمایش هستند. در مرحله آموزش، زیرکلمات مجموعه آموزش، خوشه‌بندی می‌شوند. برای خوشه‌یابی از الگوریتم ISODATA استفاده شده و مراکز اولیه خوشه‌ها توسط یک الگوریتم خوشه‌یابی سلسله مراتبی محاسبه شده‌اند. در رویکرد اول، بازشناسی طی دو مرحله صورت می‌گیرد: یافتن خوشه‌های نزدیک به ورودی و یافتن نزدیک‌ترین زیرکلمه از بین خوشه‌های نزدیک. در رویکرد دوم علاوه بر این مراحل، یک مرحله اضافی برای یافتن زیرکلمه نهایی بر اساس الگوی نقاط نیز وجود دارد. هر دو روش نتایج قابل قبولی روی تصاویر تمیز ارائه می‏دهند به‌طوری که رویکرد بانقطه دقتی حدود 94% و رویکرد بدون نقطه دقتی حدود 93% در سطح کلمه ارائه می‌دهد. لیکن در برخورد با تصاویر کم‌کیفیت و نویزی دچار افت دقت می‏شوند که این کاهش در برخی موارد بسیار شدید است. دلایل این کاهش دقت ارزیابی شده و راهکاری برای بهبود آن ارائه شده است. همچنین ضمن مقایسه دو رویکرد، مزایا و معایب بازشناسی بر مبنای شکل کلی ارائه شده است.

http://ijece.org/en/Article/Download/27958