بهبود رتبه‌بندی با استفاده از BERT

محورهای موضوعی : مهندسی برق و کامپیوتر

شکوفه بستان ^{1
*} , علی محمد زارع بیدکی ² , محمدرضا پژوهان ³

1 - دانشكده مهندسی كامپيوتر، دانشگاه یزد، ایران
2 - دانشكده مهندسی كامپيوتر، دانشگاه یزد، ایران
3 - دانشكده مهندسی كامپيوتر، دانشگاه یزد، ایران

تاریخ دریافت : 1402/04/11 تاریخ پذیرش : 1402/09/29 تاریخ انتشار : 1403/04/17

کلید واژه: بردار معنایی, درون‌سازی واژه, رتبه‌بندی, یادگیری عمیق,

چکیده مقاله :

رتبه‌بندی کارآمد اسناد در عصر اطلاعات امروز، نقش مهمی در سیستم‌های بازیابی اطلاعات ایفا می‌کند. این مقاله یک رویکرد جدید برای رتبه‌بندی اسناد با استفاده از مدل‌های درون‌سازی با تمرکز بر مدل زبانی BERT برای بهبود نتایج رتبه‌بندی ارائه می‌کند. رویکرد پیشنهادی از روش‌های درون‌سازی واژگان برای به‌تصویرکشیدن نمایش‌های معنایی پرس‌وجوهای کاربر و محتوای سند استفاده می‌کند. با تبدیل داده‌های متنی به بردارهای معنایی، ارتباط و شباهت بین پرس‌و‌جوها و اسناد تحت روابط رتبه‌بندی پیشنهادی با هزینه کمتر مورد ارزیابی قرار می‌گیرد. روابط رتبه‌بندی پیشنهادی عوامل مختلفی را برای بهبود دقت در نظر می‌گیرند که این عوامل شامل بردارهای درون‌سازی واژگان، مکان واژگان کلیدی و تأثیر واژگان باارزش در رتبه‌بندی بر مبنای بردارهای معنایی است. آزمایش‌ها و تحلیل‌های مقایسه‌ای برای ارزیابی اثربخشی روابط پیشنهادی اعمال گردیده است. نتایج تجربی، اثربخشی رویکرد پیشنهادی را با دستیابی به دقت بالاتر در مقایسه با روش‌های رتبه‌بندی رایج نشان می‌دهند. این نتایج بیانگر آن مسئله است که استفاده از مدل‌های درون‌سازی و ترکیب آن در روابط رتبه‌بندی پیشنهادی به‌طور قابل توجهی دقت رتبه‌بندی را تا 87/0 در بهترین حالت بهبود می‌بخشد. این بررسی به بهبود رتبه‌بندی اسناد کمک می‌کند و پتانسیل مدل درون‌سازی BERT را در بهبود عملکرد رتبه‌بندی نشان می‌دهد.

چکیده انگلیسی:

In today's information age, efficient document ranking plays a crucial role in information retrieval systems. This article proposes a new approach to document ranking using embedding models, with a focus on the BERT language model to improve ranking results. The proposed approach uses vocabulary embedding methods to represent the semantic representations of user queries and document content. By converting textual data into semantic vectors, the relationships and similarities between queries and documents are evaluated under the proposed ranking relationships with lower cost. The proposed ranking relationships consider various factors to improve accuracy, including vocabulary embedding vectors, keyword location, and the impact of valuable words on ranking based on semantic vectors. Comparative experiments and analyses were conducted to evaluate the effectiveness of the proposed relationships. The empirical results demonstrate the effectiveness of the proposed approach in achieving higher accuracy compared to common ranking methods. These results indicate that the use of embedding models and their combination in proposed ranking relationships significantly improves ranking accuracy up to 0.87 in the best case. This study helps improve document ranking and demonstrates the potential of the BERT embedding model in improving ranking performance.

منابع و مأخذ:

[1] Y. Yum, et al., "A word pair dataset for semantic similarity and relatedness in Korean medical vocabulary: reference development and validation," JMIR Medical Informatics, vol. 9, no. 6, Article ID: e29667, Jun. 2021.
[2] E. Hindocha, V. Yazhiny, A. Arunkumar, and P. Boobalan, "Short-text semantic similarity using GloVe word embedding," International Research J. of Engineering and Technology, vol. 6, no. 4, pp. 553-558, Apr. 2019.
[3] J. Zhang, Y. Liu, J. Mao, W. Ma, and J. Xu, "User behavior simulation for search result re-ranking," ACM Trans. on Information Systems, vol. 41, no. 1, Article ID: 5, 35 pp., Jan. 2023.
[4] V. Zosimov and O. Bulgakova, "Usage of inductive algorithms for building a search results ranking model based on visitor rating evaluations," in Proc. IEEE 13th Int. Scientific and Technical Conf. on Computer Sciences and Information Technologies, CSIT'18, pp. 466-469, Lviv, Ukraine, 11-14 Sept. 2018.
[5] B. Mitra and N. Craswell, Neural Models for Information Retrieval, arXiv preprint arXiv:1705.01509, vol. 1, 2017.
[6] V. Gupta, A. Dixit, and S. Sethi, "A comparative analysis of sentence embedding techniques for document ranking," J. of Web Engineering, vol. 21, no. 7, pp. 2149-2186, 2022.
[7] J. Pennington, R. Socher, C. Ma, and C. Manning, "GloVe: global vectors for word representation," in Proc. Conf. on Empirical Methods in Natural Language Processing, EMNLP'14, pp. 1532-1543, Doha, Qatar, 25-29 Oct. 2014.
[8] T. Mikolov, K. Chen, G. Corrado, and J. Dea, "Efficient estimation of word representations in vector space," in Proc. In. Conf. on Learning Representations, ICLR'13, 12 pp., Scottsdale, AZ, USA, 2-4 May 2013.
[9] P. Bojanowski, E. Grave, A. Joulin, and T. Mikolov, "Enriching word vectors with subword information," Trans. of the Association for Computational Linguistics, vol. 5, pp. 135-146, 2017.
[10] M. E. Peters, et al., "Deep contextualized word representations," in Proc. Conf. of the North American Chapter of the Association of Computational Linguistics, NAACL-HLT'18, 11 pp., New Orleans, LA, USA, 1-6 Jun. 2018.
[11] J. Devlin, M. W. Chang, and K. L. Kristina, "BERT: pre-training of deep bidirectional transformers for language understanding," in Proc. Conf. of the North American Chapter of the Association of Computational Linguistics, NAACL-HLT'18, 16 pp., New Orleans, LA, USA, 1-6 Jun. 2018.
[12] T. Brown, et al., "Language models are few-shot learners," in Proc. 34th Conf. on Neural Information Processing Systems, NeurIPS'20, 25 pp., Vancouver, Canada, 6-12 Dec. 2020.
[13] P. Sherki, S. Navali, and R. Inturi, "Retaining semantic data in binarized word embedding," in ¬Proc. IEEE 15th Int. Conf. on Semantic Computing, ICSC'21, pp. 130-133, Laguna Hills, CA, USA, 27-29 Jan. 2021.
[14] L. Shaohua, C. Tat-Seng, Z. Jun, and C. Miao, Generative Topic Embedding: A Continuous Representation of Documents (Extended Version with Proofs), arXiv preprint arXiv:1606.02979, vol. 1, 2016.
[15] B. Mitra, E. Nalisnick, N. Craswell, and R. Caruana, "A dual embedding space model for document ranking," in Proc. 25th Int. Conf. Companion on World Wide Web, WWW'16, 10 pp., Montreal, Canada, 11-15 Apr. 2016.
[16] M. Dehghani, H. Zamani, A. Severyn, and J. Kamps, "Neural ranking models with weak supervision," in Proc. of the 40th Int. ACM SIGIR Conf. on Research and Development in Information Retrieval, SIGIR '17, pp. 65-74, Tokyo, Japan, 7-11 Aug. 2017.
[17] C. Xiong, Z. Dai, and J. Callan, "End-to-end neural ad-hoc ranking with kernel pooling," in Proc. of the 40th Int. ACM SIGIR Conf. on Research and Development in Information Retrieval, pp. 55-64, Tokyo, Japan, 7-11 Aug. 2017.
[18] R. Brochier, A. Guille, and J. Velcin, "Global vectors for node representations," in Proc. ACM World Wide Web Conf., WWW'19, San Francisco, pp. 2587-2593, San Francisco, CA, USA, 13-17 May 2019.
[19] A. Gourru and J. Velcin, "Gaussian embedding of linked documents from a pretrained semantic space," in Proc. 29th Int. Joint Conf. on Artificial Intelligence, IJCAI'20, pp. 3912-3918, Yokohama, Japan, 8-10 Jan. 2021.
[20] R. Menon, J. Kaartik, and K. Nambiar, "Improving ranking in document based search systems," in Proc. 4th Int. Conf. on Trends in Electronics and Informatics, ICOEI'20, pp. 914-921, Tirunelveli, India, 15-17 Jun. 2020.
[21] J. Li, C. Guo, and Z. Wei, "Improving document ranking with relevance-based entity embeddings," in Proc. 8th Int. Conf. on Big Data and Information Analytics, BigDIA'22, China, pp. 186-192, Guiyang, China, 24-25 Aug. 2022.
[22] S. Han, X. Wang, M. Bendersky, and M. Najork, Learning-to-Rank with BERT in TF-Ranking, Google Research Tech Report, 2020.
[23] ش. بستان، ع. زارع بیدکی و م. ر. پژوهان، "درون¬سازی معنایی واژه¬ها با استفاده از BERT روی وب فارسی،" نشریه مهندسی برق و مهندسی کامپیوتر ایران، ب- مهندسی کامپیوتر، سال 21، شماره 2، صص. 100-89، تابستان 1402.
[24] M. Farahani, M. Gharachorloo, M. Farahani, and M. Manthouri, "Parsbert: transformer-based model for Persian language understanding," Neural Processing Letters, vol. 53, pp. 3831-3847, 2021.
[25] D. Yang and Y. Yin, "Evaluation of taxonomic and neural embedding methods for calculating semantic similarity," Natural Language Engineering, vol. 28, no. 6, pp. 733-761, Nov. 2022.
[26] R. Mihalcea, C. Corley, and C. Strapparava, "Corpus-based and knowledge-based measures of text semantic similarity," in Proc. 21st National Conf. on Artificial Intelligence, vol. 1, pp. 775-780, Boston, MA, USA, 16-20 Jul. 2006.
[27] K. Jarvelin and J. Kekalainen, "Cumulated gain-based evaluation of IR techniques," ACM Trans. on Information Systems, vol. 20, no. 4, pp. 422-446, Oct. 2002.

اشتراک گذاری

آدرس مقاله

بهبود رتبه‌بندی با استفاده از BERT