A Distributed Method for Extracting Persian-English Chunks

Subject Areas : electrical and computer engineering

Seyedeh Sara Mirmobin ¹ , Mohammad Ghasemzadeh ^{2
*} , Amin Nezarat ³

1 - Yazd University
2 - عضو هیئت علمی دانشگاه
3 -

Received: 2019-05-26 Accepted : 2019-11-19 Published : 2020-05-11

Keywords: Distributed algorithmscorpusmachine translationchunks,

Abstract :

This research is in the field of machine translation and in relation to extraction of Persian-English chunks from bilingual corpus by Spark. In this regard, the most important challenge is that the operation must be carried out on large corpus; therefore, it requires distributed computing along with big data analysis techniques and tools. When translating text, we are usually confronted with chunks that we need to find the corresponding chunks of each one in the target language and insert it in our translation; this is accomplished by locating it in a corpus that contain the chunks and their corresponding translations. The existing methods, perform this operations in a non-distributed way, therefore while they run slowly, they cannot use a very large corpus. To overcome this shortcoming, in this research a distributed method has been presented, which also takes distance between the sections of chunks into account. The proposed method extracts all possible chunks from the input sentences in the monolingual corpus and uses the correlation coefficient to translate those chunks using the bilingual corpus. We implemented the proposed algorithm in a platform consisting of a computing cluster with sixty-four GB of memory and a twenty-four-core processor in Spark. The incorporated experimental data was a Persian and an English monolingual corpus along with an English-Persian bilingual corpus, each of which containing 100,000 sentences. Experimental results show that run time could greatly be reduced, and the quality of translation is also significantly improved.

References:

[1] ا. سادات علوی، ﻫ. مشایخی، ح. حسن‌پور و ب. رحیم‌پور کامی، "استفاده از خوشه‌بندی تکاملی برای تشخیص موضوع در بلاگ‌نویسی کوچک با لحاظ‌نمودن اطلاعات شبکه اجتماعی،" نشريه مهندسي برق و مهندسي كامپيوتر ايران، ب- مهندسي كامپيوتر، جلد 17، شماره 4، صص. 286-277، زمستان 1398.
[2] دبيرخانه شورای عالی اطلاع‌رسانی، ‌جمع‌آوری اطلاعات چالش‌ها و روش‌های ترجمه ماشینی زبان انگلیسی به فارسی و بالعکس،‌ شورای عالی اطلاع‌رسانی، مستند شماره 1/1/2537/190، دانشگاه علم و صنعت ايران، تهران، 1388.
[3] م. عاصی، "پردازش دستوري زبان فارسي با رايانه،" نامه فرهنگستان، جلد 1، شماره 1، صص. 51-29، اسفند 1383.
[4] ش. عباسی، "داده‌های عظیم تعاریف و چالش‌ها،" مجموعه مقالات کنفرانس بین‌المللی سیستم‌های غیر خطی و بهینه‌سازی کامپیوتر، 13 صص.، شیراز، دبی، امارات متحده عربی، خرداد 1394.
[5] م. جهانی، "نو پرداز"، شرکت نوپرداز، 19/03/1397. [درون‌خطی]. Available: https://nopardazco.com. [دستيابی در 22/ 05/ 1398].
[6] C. Dyer, A. Cordova, A. Mont, and J. Lin, "Fast, easy, and cheap: construction of statistical machine translation models with mapreduce," in Proc. of the 3rd Workshop on Statistical Machine Translation, pp. 199-207, Columbus, OH, USA, Jun. 2008.
[7] ا. نظارات و ط. موسوی میانگاه، "طراحی و پیاده‌سازی یک سامانه بازیابی اطلاعات دوزبانه با استفاده از پیکره‌های زبانی،" پژوهش‌نامه پردازش و مدیریت اطلاعات (علوم و فناوری اطلاعات سابق)، جلد 27، شماره 2، صص. 211-198، زمستان 1390.
[8] T. Mousavimiyangah, "Constructing a large-scale English-Persian parallel corpus," Meta, vol. 54, no. 1, pp. 181-188, Jan. 2009.
[9] Y. Zhou, C. Zong, and B. Xu, "Bilingual chunk alignment in statical machine translation," in Proc. Int. Conf. on System Man and Cybernetics, pp. 1401-1406, Hague, The Netherlands, 10-13 Oct. 2004.
[10] M. Murata, T. Ohno, S. Matsubara, and Y. Inagaki, "Construction of chunk-aligned bilingual lecture corpus for simultaneous machine translation," in Proc. of the 7th Conf. on International Language Resources and Evaluation, LREC'10, pp. 1765-1770, Valletta, Malta, 19-21 May 2010.

Share To

Article Url

A Distributed Method for Extracting Persian-English Chunks