یک چارچوب یادگیری نیمه‌نظارتی جهت دسته‌بندی دقیق موارد آزمون با بهره‌گیری از تعبیه‌های زبانی و ویژگی‌های معنایی متن

محورهای موضوعی : مهندسی برق و کامپیوتر

1 - گروه مهندسی کامپیوتر، واحد بین المللی اروند، دانشگاه آزاد اسلامی، آبادان، ایران
2 - عضو هیات علمی دانشگاه آزاد اسلامی، واحد بین المللی اروند

تاریخ دریافت : 1404/06/08 تاریخ پذیرش : 1404/11/03 تاریخ انتشار : 1405/02/22

کلید واژه: یادگیری نیمه‌نظارت‌شده, پردازش زبان طبیعی, یادگیری معنایی, VM 3S, SVM.,

چکیده مقاله :

با گسترش کاربرد هوش مصنوعی در مهندسی نرم‌افزار، استفاده از روش‌های هوشمند برای دسته‌بندی موارد آزمون به ضرورتی کلیدی تبدیل شده است. یکی از چالش‌های اصلی در این زمینه، وابستگی شدید مدل‌ها به داده‌های برچسب‌خورده است که تولید آن‌ها هزینه‌بر و زمان‌بر است. در این پژوهش، با هدف بررسی اثربخشی یادگیری نیمه‌نظارتی در چنین شرایطی، چارچوبی مبتنی بر pseudo-labeling طراحی شد تا داده‌های بدون برچسب را در فرآیند آموزش مدل ادغام کند و به بخش بدون‌نظارت وزن مناسبی در تابع خطا اختصاص دهد. برای ارزیابی، از مجموعه‌داده AG News شامل 12۰٬00۰ نمونه آموزشی و ۷٬۶۰۰ نمونه آزمایشی استفاده شد که از میان داده‌های آموزشی، ۲۰٪ (۲۴٬۰۰۰ نمونه) به‌عنوان داده برچسب‌خورده و ۸۰٪ (۹۶٬۰۰۰ نمونه) به‌عنوان داده بدون‌برچسب به کار رفت. استخراج ویژگی‌ها با مدل BERT-base انجام شد که بردارهای ۷۶۸ بعدی تولید می‌کند. نتایج بر اساس سنجه‌های صحّت، دقّت، فراخواني و معيارF1- نشان داد که روش نیمه‌نظارتی در مقایسه با ماشین بردار پشتیبان نظارتی، بهبود اندک اما معناداری در عملکرد ارائه می‌دهد. این یافته‌ها نشان می‌دهد که داده‌های بدون‌برچسب می‌توانند به‌طور مؤثر در بهبود مدل‌های یادگیری ماشین در شرایط کم‌داده به‌کار گرفته شوند.

چکیده انگلیسی:

With the growing importance of integrating artificial intelligence and software testing, moving toward the intelligent automation of evaluation processes and exam item classification has become an essential necessity. One of the key challenges in this domain is the strong dependency on labeled data, the production of which is costly and time-consuming. In this study, a semi-supervised learning framework was designed and implemented using pseudo-labeling to incorporate unlabeled data into the training process and weighting the unsupervised loss. The dataset used was AG News, consisting of four news categories, where only 20% of the data was considered labeled and 80% unlabeled. For feature extraction, the BERT-base model was employed as a language embedder, producing 768-dimensional vectors (default configuration). Data preprocessing included tokenization with BertTokenizer, removal of punctuation and irrelevant characters, and text normalization. Performance evaluation using Accuracy, Precision, Recall, and F1-Score demonstrated that the semi-supervised approach outperformed the supervised SVM under limited labeled data conditions, achieving an average improvement of 5–10% across the metrics.

منابع و مأخذ:

[1] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, "BERT: Pre-training of deep bidirectional transformers for language understanding," in Proc The 16th Annual Conf. of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, vol. 1, pp. 4171-4186, 16 pp., New Orleans, LA, USA, 1-6 Jun. 2018.
[2] X. Zhang, J. Zhao, and Y. LeCun, "Character-level convolutional networks for text classification," in Proc. of the 28th Annual Conf. on Neural Information Processing Systems, pp. 560-567, Montreal, Canada, 8-13 Dec. 2015.
[3] T. Joachims, "Transductive inference for text classification using support vector machines," in Proc. of the 16th Int. Conf. on Machine Learning, pp. 200-209, Bled, Slovenia, 27-30 Jun. 1999.
[4] K. Bennett and A. Demiriz, "Semi-supervised support vector machines," in Proc. of the 12th Int. Conf. on Neural Information Processing Systems, pp. 368-374, Denver, CO, USA, 30 Nov.-5 Dec. 1998.
[5] D. -H. Lee, "Pseudo-label: The simple and efficient semi-supervised learning method for deep neural networks," in Workshop on Challenges in Representation Learning, p. 896, Atlanta, GA, USA, 2013.
[6] X. Zhu, J. Lafferty, and Z. Ghahramani, "Combining active learning and semi-supervised learning using gaussian fields and harmonic functions," in Proc. ICML 2003 Workshop on the Continuum from Labeled to Unlabeled Data in Machine Learning and Data Mining, vol. 3, pp. 58-65, Washington, DC, USA, 21-21 Aug. 2003.
[7] O. Chapelle, B. Schölkopf, and A. Zien, "A discussion of semi-supervised learning and transduction," in Semi-Supervised Learning: MIT Press, pp. 473-478, 2006.
[8] Y. Liu, et al., Roberta: A Robustly Optimized Bert Pretraining Approach, arXiv preprint arXiv:1907.11692, 2019.
[9] P. He, X. Liu, J. Gao, and W. Chen, Deberta: Decoding-Enhanced Bert with Disentangled Attention, arXiv preprint arXiv: 2006.03654, 2020.
[10] O. Levy and Y. Goldberg, "Dependency-based word embeddings," in Proc. of the 52nd Annual Meeting of the Association for Computational Linguistics, vol. 2: Short Papers, pp. 302-308, Baltimore, MD, USA, 23-24 Jun. 2014.
[11] Q. Xie, M. -T. Luong, E. Hovy, and Q. V. Le, "Self-training with noisy student improves Imagenet classification," in Proc. of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10687-10698, Seattle, WA, USA, 13-19 Jun. 2020.
[12] A. Oliver, A. Odena, C. A. Raffel, E. D. Cubuk, and J. Goodfellow, "Realistic evaluation of deep semi-supervised learning algorithms," in Proc. of the 32nd Int. Conf. on Neural Information Processing Systems, pp. 3239-3250, Montréal, Canada, 3-8 Dec. 2018.
[13] F. Pedregosa, et al., "Scikit-learn: Machine learning in Python," Journal of Machine Learning Research, vol. 12, pp. 2825-2830, 2011.
[14] D. Berthelot, et al., "Mixmatch: A holistic approach to semi-supervised learning," in Proc. of the 33rd Int. Conf. on Neural Information Processing Systems, pp. 5049-5059, Montréal, Canada, 8-14 Dec. 2019.
[15] M. N. J. A. S. E. Abadeh, "Knowledge-enhanced software refinement: leveraging reinforcement learning for search-based quality engineering," Automated Software Engineering, vol. 31, Article ID: 57, 2024.
[16] A. Radford, et al., "Language models are unsupervised multitask learners," OpenAI Blog, vol. 1, no. 8, p. 9, 2019.
[17] S. Ruder, M. E. Peters, S. Swayamdipta, and T. Wolf, "Transfer learning in natural language processing," in Proc. of the 2019 Conf. of the North American Chapter of the Association for Computational Linguistics: Tutorials, pp. 15-18, Florence, Italy, 28 Jul.-2 Aug. 2019.
[18] G. Quétant, P. Molchanov, and S. J. Voloshynovskiy, TwinTURBO: Semi-Supervised Fine-Tuning of Foundation Models via Mutual Information Decompositions for Downstream Task and Latent Spaces, arXiv preprint arXiv:2503.07851, 2025.
[19] H. Chen, W. Han, and S. Poria, "SAT: Improving semi-supervised text classification with simple instance-adaptive self-training," in Proc. of the Findings of the Association for Computational Linguistics, pp. 6141-6146, Dublin, Ireland, 22-27 May 2022.
[20] E. Kotei and R. J. I. Thirunavukarasu, "A systematic review of transformer-based pre-trained language models through self-supervised learning," Information, vol. 14, no. 3, Article ID: 187, Mar. 2023.
[21] W. Yang, R. Zhang, J. Chen, and J. Sheng, "Calibrating Pseudo-Labeling with Class Distribution for Semi-supervised Text Classification," in Proc. of the 2025 Conf. on Empirical Methods in Natural Language Processing, pp. 13026-13039, Suzhou, China, 4-9 Nov.
2025. [22] I. Sirbu, R. -A. Popovici, C. Caragea, Ș. Trăușan-Matu, and T. Rebedea, "MultiMatch: Multihead consistency regularization matching for semi-supervised text classification," in Proc. of the Conf. on Empirical Methods in Natural Language Processing, pp. 2792-2808, Suzhou, China, 4-9 Nov. 2025.
[23] J. M. Duarte and L. A. Berton, "A review of semi-supervised learning for text classification," Artificial Intelligence Review, vol. 56, pp. 9401-9469, 2023.
[24] K. Sohn, et al., "Fixmatch: Simplifying semi-supervised learning with consistency and confidence," in Proc. 34th Conf. on Neural Information Processing Systems, pp. 596-608, Vancouver, Canada, 6-12 Dec. 2020.
[25] A. Hatefi, X. -S. Vu, M. Bhuyan, and F. Drewes, The Efficiency of Pre-training with Objective Masking in Pseudo Labeling for Semi-Supervised Text Classification, arXiv preprint arXiv:2505.06624, 2025.
[26] S. Cheng, W. Chen, W. Liu, and H. Qu, "Improving lightweight semi-supervised text classification via teacher intervention," Applied Soft Computing, vol. 184, pt. B, Article ID: 113844, Dec. 2025.
[27] Y. Grandvalet and Y. Bengio, "Semi-supervised learning by entropy minimization," in Proc. of the 18th Int. Conf. on Neural Information Processing Systems, pp. 529-536, Vancouver, Canada, 13-16, Dec. 2004.

اشتراک گذاری

آدرس مقاله

یک چارچوب یادگیری نیمه‌نظارتی جهت دسته‌بندی دقیق موارد آزمون با بهره‌گیری از تعبیه‌های زبانی و ویژگی‌های معنایی متن