مرکز منطقه ای اطلاع رسانی علوم و فناوری فصلنامه مهندسی برق و مهندسی کامپيوتر ايران 16823745 18 1 2020 5 11 A Distributed Method for Extracting Persian-English Chunks یک روش توزیع‌شده برای استخراج چندتایی‎های فارسی- انگلیسی 42 48 fa سیده سارا میرمبین محمد قاسم زاده امین نظارات 2019 5 26 This research is in the field of machine translation and in relation to extraction of Persian-English chunks from bilingual corpus by Spark. In this regard, the most important challenge is that the operation must be carried out on large corpus; therefore, it requires distributed computing along with big data analysis techniques and tools. When translating text, we are usually confronted with chunks that we need to find the corresponding chunks of each one in the target language and insert it in our translation; this is accomplished by locating it in a corpus that contain the chunks and their corresponding translations. The existing methods, perform this operations in a non-distributed way, therefore while they run slowly, they cannot use a very large corpus. To overcome this shortcoming, in this research a distributed method has been presented, which also takes distance between the sections of chunks into account. The proposed method extracts all possible chunks from the input sentences in the monolingual corpus and uses the correlation coefficient to translate those chunks using the bilingual corpus. We implemented the proposed algorithm in a platform consisting of a computing cluster with sixty-four GB of memory and a twenty-four-core processor in Spark. The incorporated experimental data was a Persian and an English monolingual corpus along with an English-Persian bilingual corpus, each of which containing 100,000 sentences. Experimental results show that run time could greatly be reduced, and the quality of translation is also significantly improved. این پژوهش در حوزه ترجمه ماشینی و در رابطه با استخراج چندتایی‌ها از پیکره‌های دوزبانه به وسیله اسپارک است. در این رابطه، مهم‌ترین چالش این است که عملیات بایستی بر روی پیکره‌های متنی بزرگ انجام شود لذا بایستی به صورت توزیع‌شده و با بهره‌گیری از راهکارها و ابزارهای تحلیل داده‌های حجیم، طراحی و پیاده‌سازی شود. در واقع هنگام ترجمه متون، به وفور با چندتایی‌هایی مواجه می‌شویم که بایستی چندتایی‌های متناظر با هر کدام را بیابیم و در ترجمه‌مان درج کنیم، این کار می‌تواند از طریق جستجو در پیکره‌هایی که شامل چندتایی‌ها و ترجمه متناظر با آنها است انجام شود. روش‌های موجود، این کار را به صورت غیر توزیع‌شده انجام می‌دهند، لذا ضمن این که نیاز به زمان زیادی دارند، نمی‌توانند از پیکره‌های خیلی بزرگ بهره ببرند. برای رفع این نارسایی، در این پژوهش یک روش توزیع‌شده ارائه گردیده که فاصله بین بخش‌های چندتایی‌ها را نیز لحاظ می‌کند. راه‌حل پیشنهادی به صورت توزیع‌شده، تمام چندتایی‌های ممکن را از جملات پیکره تک‌زبانه استخراج نموده و با استفاده از ضریب همبستگی، چندتایی‌های معتبر جداشده را با استفاده از پیکره دوزبانه ترجمه می‌کند. روش پیشنهادی روی یک کلاستر محاسباتی با 64 گیگابایت حافظه اصلی و پردازنده 24هسته‌ای، در محیط اسپارک پیاده‌سازی گردید. داده‌های آزمایش شامل پیکره‌های فارسی و انگلیسی تک‌زبانه و نیز پیکره دوزبانه، حاوی به‌ طور متوسط 100 هزار جمله بودند. نتایج آزمایشی نشان می‌دهند که بدین طریق، زمان اجرا به شدت کاهش و کیفیت ترجمه نیز به طور قابل ملاحظه‌ای بهبود می‌یابد.

http://ijece.org/ar/Article/Download/28481