بهبود رتبهبندی با استفاده از BERT
محورهای موضوعی : مهندسی برق و کامپیوترشکوفه بستان 1 * , علی محمد زارع بیدکی 2 , محمدرضا پژوهان 3
1 - دانشكده مهندسی كامپيوتر، دانشگاه یزد، ایران
2 - دانشكده مهندسی كامپيوتر، دانشگاه یزد، ایران
3 - دانشكده مهندسی كامپيوتر، دانشگاه یزد، ایران
کلید واژه: بردار معنایی, درونسازی واژه, رتبهبندی, یادگیری عمیق,
چکیده مقاله :
رتبهبندی کارآمد اسناد در عصر اطلاعات امروز، نقش مهمی در سیستمهای بازیابی اطلاعات ایفا میکند. این مقاله یک رویکرد جدید برای رتبهبندی اسناد با استفاده از مدلهای درونسازی با تمرکز بر مدل زبانی BERT برای بهبود نتایج رتبهبندی ارائه میکند. رویکرد پیشنهادی از روشهای درونسازی واژگان برای بهتصویرکشیدن نمایشهای معنایی پرسوجوهای کاربر و محتوای سند استفاده میکند. با تبدیل دادههای متنی به بردارهای معنایی، ارتباط و شباهت بین پرسوجوها و اسناد تحت روابط رتبهبندی پیشنهادی با هزینه کمتر مورد ارزیابی قرار میگیرد. روابط رتبهبندی پیشنهادی عوامل مختلفی را برای بهبود دقت در نظر میگیرند که این عوامل شامل بردارهای درونسازی واژگان، مکان واژگان کلیدی و تأثیر واژگان باارزش در رتبهبندی بر مبنای بردارهای معنایی است. آزمایشها و تحلیلهای مقایسهای برای ارزیابی اثربخشی روابط پیشنهادی اعمال گردیده است. نتایج تجربی، اثربخشی رویکرد پیشنهادی را با دستیابی به دقت بالاتر در مقایسه با روشهای رتبهبندی رایج نشان میدهند. این نتایج بیانگر آن مسئله است که استفاده از مدلهای درونسازی و ترکیب آن در روابط رتبهبندی پیشنهادی بهطور قابل توجهی دقت رتبهبندی را تا 87/0 در بهترین حالت بهبود میبخشد. این بررسی به بهبود رتبهبندی اسناد کمک میکند و پتانسیل مدل درونسازی BERT را در بهبود عملکرد رتبهبندی نشان میدهد.
In today's information age, efficient document ranking plays a crucial role in information retrieval systems. This article proposes a new approach to document ranking using embedding models, with a focus on the BERT language model to improve ranking results. The proposed approach uses vocabulary embedding methods to represent the semantic representations of user queries and document content. By converting textual data into semantic vectors, the relationships and similarities between queries and documents are evaluated under the proposed ranking relationships with lower cost. The proposed ranking relationships consider various factors to improve accuracy, including vocabulary embedding vectors, keyword location, and the impact of valuable words on ranking based on semantic vectors. Comparative experiments and analyses were conducted to evaluate the effectiveness of the proposed relationships. The empirical results demonstrate the effectiveness of the proposed approach in achieving higher accuracy compared to common ranking methods. These results indicate that the use of embedding models and their combination in proposed ranking relationships significantly improves ranking accuracy up to 0.87 in the best case. This study helps improve document ranking and demonstrates the potential of the BERT embedding model in improving ranking performance.
[1] Y. Yum, et al., "A word pair dataset for semantic similarity and relatedness in Korean medical vocabulary: reference development and validation," JMIR Medical Informatics, vol. 9, no. 6, Article ID: e29667, Jun. 2021.
[2] E. Hindocha, V. Yazhiny, A. Arunkumar, and P. Boobalan, "Short-text semantic similarity using GloVe word embedding," International Research J. of Engineering and Technology, vol. 6, no. 4, pp. 553-558, Apr. 2019.
[3] J. Zhang, Y. Liu, J. Mao, W. Ma, and J. Xu, "User behavior simulation for search result re-ranking," ACM Trans. on Information Systems, vol. 41, no. 1, Article ID: 5, 35 pp., Jan. 2023.
[4] V. Zosimov and O. Bulgakova, "Usage of inductive algorithms for building a search results ranking model based on visitor rating evaluations," in Proc. IEEE 13th Int. Scientific and Technical Conf. on Computer Sciences and Information Technologies, CSIT'18, pp. 466-469, Lviv, Ukraine, 11-14 Sept. 2018.
[5] B. Mitra and N. Craswell, Neural Models for Information Retrieval, arXiv preprint arXiv:1705.01509, vol. 1, 2017.
[6] V. Gupta, A. Dixit, and S. Sethi, "A comparative analysis of sentence embedding techniques for document ranking," J. of Web Engineering, vol. 21, no. 7, pp. 2149-2186, 2022.
[7] J. Pennington, R. Socher, C. Ma, and C. Manning, "GloVe: global vectors for word representation," in Proc. Conf. on Empirical Methods in Natural Language Processing, EMNLP'14, pp. 1532-1543, Doha, Qatar, 25-29 Oct. 2014.
[8] T. Mikolov, K. Chen, G. Corrado, and J. Dea, "Efficient estimation of word representations in vector space," in Proc. In. Conf. on Learning Representations, ICLR'13, 12 pp., Scottsdale, AZ, USA, 2-4 May 2013.
[9] P. Bojanowski, E. Grave, A. Joulin, and T. Mikolov, "Enriching word vectors with subword information," Trans. of the Association for Computational Linguistics, vol. 5, pp. 135-146, 2017.
[10] M. E. Peters, et al., "Deep contextualized word representations," in Proc. Conf. of the North American Chapter of the Association of Computational Linguistics, NAACL-HLT'18, 11 pp., New Orleans, LA, USA, 1-6 Jun. 2018.
[11] J. Devlin, M. W. Chang, and K. L. Kristina, "BERT: pre-training of deep bidirectional transformers for language understanding," in Proc. Conf. of the North American Chapter of the Association of Computational Linguistics, NAACL-HLT'18, 16 pp., New Orleans, LA, USA, 1-6 Jun. 2018.
[12] T. Brown, et al., "Language models are few-shot learners," in Proc. 34th Conf. on Neural Information Processing Systems, NeurIPS'20, 25 pp., Vancouver, Canada, 6-12 Dec. 2020.
[13] P. Sherki, S. Navali, and R. Inturi, "Retaining semantic data in binarized word embedding," in ¬Proc. IEEE 15th Int. Conf. on Semantic Computing, ICSC'21, pp. 130-133, Laguna Hills, CA, USA, 27-29 Jan. 2021.
[14] L. Shaohua, C. Tat-Seng, Z. Jun, and C. Miao, Generative Topic Embedding: A Continuous Representation of Documents (Extended Version with Proofs), arXiv preprint arXiv:1606.02979, vol. 1, 2016.
[15] B. Mitra, E. Nalisnick, N. Craswell, and R. Caruana, "A dual embedding space model for document ranking," in Proc. 25th Int. Conf. Companion on World Wide Web, WWW'16, 10 pp., Montreal, Canada, 11-15 Apr. 2016.
[16] M. Dehghani, H. Zamani, A. Severyn, and J. Kamps, "Neural ranking models with weak supervision," in Proc. of the 40th Int. ACM SIGIR Conf. on Research and Development in Information Retrieval, SIGIR '17, pp. 65-74, Tokyo, Japan, 7-11 Aug. 2017.
[17] C. Xiong, Z. Dai, and J. Callan, "End-to-end neural ad-hoc ranking with kernel pooling," in Proc. of the 40th Int. ACM SIGIR Conf. on Research and Development in Information Retrieval, pp. 55-64, Tokyo, Japan, 7-11 Aug. 2017.
[18] R. Brochier, A. Guille, and J. Velcin, "Global vectors for node representations," in Proc. ACM World Wide Web Conf., WWW'19, San Francisco, pp. 2587-2593, San Francisco, CA, USA, 13-17 May 2019.
[19] A. Gourru and J. Velcin, "Gaussian embedding of linked documents from a pretrained semantic space," in Proc. 29th Int. Joint Conf. on Artificial Intelligence, IJCAI'20, pp. 3912-3918, Yokohama, Japan, 8-10 Jan. 2021.
[20] R. Menon, J. Kaartik, and K. Nambiar, "Improving ranking in document based search systems," in Proc. 4th Int. Conf. on Trends in Electronics and Informatics, ICOEI'20, pp. 914-921, Tirunelveli, India, 15-17 Jun. 2020.
[21] J. Li, C. Guo, and Z. Wei, "Improving document ranking with relevance-based entity embeddings," in Proc. 8th Int. Conf. on Big Data and Information Analytics, BigDIA'22, China, pp. 186-192, Guiyang, China, 24-25 Aug. 2022.
[22] S. Han, X. Wang, M. Bendersky, and M. Najork, Learning-to-Rank with BERT in TF-Ranking, Google Research Tech Report, 2020.
[23] ش. بستان، ع. زارع بیدکی و م. ر. پژوهان، "درون¬سازی معنایی واژه¬ها با استفاده از BERT روی وب فارسی،" نشریه مهندسی برق و مهندسی کامپیوتر ایران، ب- مهندسی کامپیوتر، سال 21، شماره 2، صص. 100-89، تابستان 1402.
[24] M. Farahani, M. Gharachorloo, M. Farahani, and M. Manthouri, "Parsbert: transformer-based model for Persian language understanding," Neural Processing Letters, vol. 53, pp. 3831-3847, 2021.
[25] D. Yang and Y. Yin, "Evaluation of taxonomic and neural embedding methods for calculating semantic similarity," Natural Language Engineering, vol. 28, no. 6, pp. 733-761, Nov. 2022.
[26] R. Mihalcea, C. Corley, and C. Strapparava, "Corpus-based and knowledge-based measures of text semantic similarity," in Proc. 21st National Conf. on Artificial Intelligence, vol. 1, pp. 775-780, Boston, MA, USA, 16-20 Jul. 2006.
[27] K. Jarvelin and J. Kekalainen, "Cumulated gain-based evaluation of IR techniques," ACM Trans. on Information Systems, vol. 20, no. 4, pp. 422-446, Oct. 2002.