مرکز منطقه ای اطلاع رسانی علوم و فناوری فصلنامه مهندسی برق و مهندسی کامپيوتر ايران 16823745 11 4 2014 3 21 Phrase Segmentation on Persian Texts Using Neural Networks قطعه‌بندی عبارات متون فارسی با استفاده از شبکه‌های عصبی 76 84 fa محمدمهدی میردامادی علی‌محمد زارع بیدکی مهدی رضائیان 2015 11 29 Word and phrase segmentation is one of the main activities in natural languages processing (NLP). Many programs in NLP need to be preprocessed for extraction of text’s words and distinction phrases. Getting meaningful words with their prefix and suffix is the main and the final goal of segmentation. This activity depends on various natural languages can be easy or hard. Persian is among the languages with complex preprocessing tasks. One of the complexity sources is handling different writing scripts. In written Persian texts, we have two kinds of spaces: short space and white space. Also there are various scripts for writing Persian texts, differing in the style of writing words, using or elimination of spaces within or between words, using various forms of characters and so on. In this paper, we want to suggest a statistical method for phrase segmentation on Persian texts using neural networks due to using in search engines. For this purpose, we use occurrence likelihood of uniwords and biwords in corpus. The suggested algorithm includes four steps and could detect about 89.6% of correct tokens. Experimental results show this method can improve the performance of the usual methods قطعه‌بندی کلمات و عبارات متن، یکی از فعالیت‌های اصلی در حوزه پردازش زبان‌های طبیعی است. اکثر برنامه‌های پردازش زبان‌های طبیعی به یک پیش‌پردازش برای استخراج کلمات متن و تشخیص عبارات احتیاج دارند. هدف اصلی و نهایی قطعه‌بندی عبارات، به دست آوردن کلمات معنی‌دار همراه با پیشوندها و پسوندهایشان است و این فعالیت متناسب با زبان‌های طبیعی مختلف می‌تواند سخت یا آسان باشد. در زبان فارسی به علت وجود فاصله و نیم‌فاصله، عدم توجه کاربران به فاصله‌گذاری‌ها و نبود قواعد دقیق در نوشتن کلمات چندقسمتی، تشخیص و قطعه‌بندی کلمات چندقسمتی و مرکب با مشکلات و پیچیدگی‌های خاص خود روبه‌رو است. در این مقاله برآنیم تا با استفاده از شبکه‌های عصبی، یک روش آماری برای قطعه‌بندی عبارات متون فارسی جهت استفاده در موتورهای جستجو ارائه کنیم. الگوریتم پیشنهادی شامل 4 فاز است که با استفاده از احتمال رخداد تک‌کلمات و دوکلمه‌ای‌های موجود در پیکره و با دقت 6/89% عمل قطعه‌‌بندی را انجام می‌دهد. نتایج آزمایشات نشان دادند این روش می‌تواند با قطعه‌بندی بهتر عبارات، بهبود نسبی در کارایی روش‌های معمول به وجود آورد.

http://ijece.org/fa/Article/Download/28067