یک چارچوب یادگیری نیمهنظارتی جهت دستهبندی دقیق موارد آزمون با بهرهگیری از تعبیههای زبانی و ویژگیهای معنایی متن
محورهای موضوعی : مهندسی برق و کامپیوترمحمد حسین پروانه 1 , مریم نورائی 2 *
1 - گروه مهندسی کامپیوتر، واحد بین المللی اروند، دانشگاه آزاد اسلامی، آبادان، ایران
2 - عضو هیات علمی دانشگاه آزاد اسلامی، واحد بین المللی اروند
کلید واژه: یادگیری نیمهنظارتشده, پردازش زبان طبیعی, یادگیری معنایی, VM 3S, SVM.,
چکیده مقاله :
با گسترش کاربرد هوش مصنوعی در مهندسی نرمافزار، استفاده از روشهای هوشمند برای دستهبندی موارد آزمون به ضرورتی کلیدی تبدیل شده است. یکی از چالشهای اصلی در این زمینه، وابستگی شدید مدلها به دادههای برچسبخورده است که تولید آنها هزینهبر و زمانبر است. در این پژوهش، با هدف بررسی اثربخشی یادگیری نیمهنظارتی در چنین شرایطی، چارچوبی مبتنی بر pseudo-labeling طراحی شد تا دادههای بدون برچسب را در فرآیند آموزش مدل ادغام کند و به بخش بدوننظارت وزن مناسبی در تابع خطا اختصاص دهد. برای ارزیابی، از مجموعهداده AG News شامل 12۰٬00۰ نمونه آموزشی و ۷٬۶۰۰ نمونه آزمایشی استفاده شد که از میان دادههای آموزشی، ۲۰٪ (۲۴٬۰۰۰ نمونه) بهعنوان داده برچسبخورده و ۸۰٪ (۹۶٬۰۰۰ نمونه) بهعنوان داده بدونبرچسب به کار رفت. استخراج ویژگیها با مدل BERT-base انجام شد که بردارهای ۷۶۸ بعدی تولید میکند. نتایج بر اساس سنجههای صحّت، دقّت، فراخواني و معيارF1- نشان داد که روش نیمهنظارتی در مقایسه با ماشین بردار پشتیبان نظارتی، بهبود اندک اما معناداری در عملکرد ارائه میدهد. این یافتهها نشان میدهد که دادههای بدونبرچسب میتوانند بهطور مؤثر در بهبود مدلهای یادگیری ماشین در شرایط کمداده بهکار گرفته شوند.
With the growing importance of integrating artificial intelligence and software testing, moving toward the intelligent automation of evaluation processes and exam item classification has become an essential necessity. One of the key challenges in this domain is the strong dependency on labeled data, the production of which is costly and time-consuming. In this study, a semi-supervised learning framework was designed and implemented using pseudo-labeling to incorporate unlabeled data into the training process and weighting the unsupervised loss. The dataset used was AG News, consisting of four news categories, where only 20% of the data was considered labeled and 80% unlabeled. For feature extraction, the BERT-base model was employed as a language embedder, producing 768-dimensional vectors (default configuration). Data preprocessing included tokenization with BertTokenizer, removal of punctuation and irrelevant characters, and text normalization. Performance evaluation using Accuracy, Precision, Recall, and F1-Score demonstrated that the semi-supervised approach outperformed the supervised SVM under limited labeled data conditions, achieving an average improvement of 5–10% across the metrics.
[1] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, "BERT: Pre-training of deep bidirectional transformers for language understanding," in Proc The 16th Annual Conf. of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, vol. 1, pp. 4171-4186, 16 pp., New Orleans, LA, USA, 1-6 Jun. 2018.
[2] X. Zhang, J. Zhao, and Y. LeCun, "Character-level convolutional networks for text classification," in Proc. of the 28th Annual Conf. on Neural Information Processing Systems, pp. 560-567, Montreal, Canada, 8-13 Dec. 2015.
[3] T. Joachims, "Transductive inference for text classification using support vector machines," in Proc. of the 16th Int. Conf. on Machine Learning, pp. 200-209, Bled, Slovenia, 27-30 Jun. 1999.
[4] K. Bennett and A. Demiriz, "Semi-supervised support vector machines," in Proc. of the 12th Int. Conf. on Neural Information Processing Systems, pp. 368-374, Denver, CO, USA, 30 Nov.-5 Dec. 1998.
[5] D. -H. Lee, "Pseudo-label: The simple and efficient semi-supervised learning method for deep neural networks," in Workshop on Challenges in Representation Learning, p. 896, Atlanta, GA, USA, 2013.
[6] X. Zhu, J. Lafferty, and Z. Ghahramani, "Combining active learning and semi-supervised learning using gaussian fields and harmonic functions," in Proc. ICML 2003 Workshop on the Continuum from Labeled to Unlabeled Data in Machine Learning and Data Mining, vol. 3, pp. 58-65, Washington, DC, USA, 21-21 Aug. 2003.
[7] O. Chapelle, B. Schölkopf, and A. Zien, "A discussion of semi-supervised learning and transduction," in Semi-Supervised Learning: MIT Press, pp. 473-478, 2006.
[8] Y. Liu, et al., Roberta: A Robustly Optimized Bert Pretraining Approach, arXiv preprint arXiv:1907.11692, 2019.
[9] P. He, X. Liu, J. Gao, and W. Chen, Deberta: Decoding-Enhanced Bert with Disentangled Attention, arXiv preprint arXiv: 2006.03654, 2020.
[10] O. Levy and Y. Goldberg, "Dependency-based word embeddings," in Proc. of the 52nd Annual Meeting of the Association for Computational Linguistics, vol. 2: Short Papers, pp. 302-308, Baltimore, MD, USA, 23-24 Jun. 2014.
[11] Q. Xie, M. -T. Luong, E. Hovy, and Q. V. Le, "Self-training with noisy student improves Imagenet classification," in Proc. of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10687-10698, Seattle, WA, USA, 13-19 Jun. 2020.
[12] A. Oliver, A. Odena, C. A. Raffel, E. D. Cubuk, and J. Goodfellow, "Realistic evaluation of deep semi-supervised learning algorithms," in Proc. of the 32nd Int. Conf. on Neural Information Processing Systems, pp. 3239-3250, Montréal, Canada, 3-8 Dec. 2018.
[13] F. Pedregosa, et al., "Scikit-learn: Machine learning in Python," Journal of Machine Learning Research, vol. 12, pp. 2825-2830, 2011.
[14] D. Berthelot, et al., "Mixmatch: A holistic approach to semi-supervised learning," in Proc. of the 33rd Int. Conf. on Neural Information Processing Systems, pp. 5049-5059, Montréal, Canada, 8-14 Dec. 2019.
[15] M. N. J. A. S. E. Abadeh, "Knowledge-enhanced software refinement: leveraging reinforcement learning for search-based quality engineering," Automated Software Engineering, vol. 31, Article ID: 57, 2024.
[16] A. Radford, et al., "Language models are unsupervised multitask learners," OpenAI Blog, vol. 1, no. 8, p. 9, 2019.
[17] S. Ruder, M. E. Peters, S. Swayamdipta, and T. Wolf, "Transfer learning in natural language processing," in Proc. of the 2019 Conf. of the North American Chapter of the Association for Computational Linguistics: Tutorials, pp. 15-18, Florence, Italy, 28 Jul.-2 Aug. 2019.
[18] G. Quétant, P. Molchanov, and S. J. Voloshynovskiy, TwinTURBO: Semi-Supervised Fine-Tuning of Foundation Models via Mutual Information Decompositions for Downstream Task and Latent Spaces, arXiv preprint arXiv:2503.07851, 2025.
[19] H. Chen, W. Han, and S. Poria, "SAT: Improving semi-supervised text classification with simple instance-adaptive self-training," in Proc. of the Findings of the Association for Computational Linguistics, pp. 6141-6146, Dublin, Ireland, 22-27 May 2022.
[20] E. Kotei and R. J. I. Thirunavukarasu, "A systematic review of transformer-based pre-trained language models through self-supervised learning," Information, vol. 14, no. 3, Article ID: 187, Mar. 2023.
[21] W. Yang, R. Zhang, J. Chen, and J. Sheng, "Calibrating Pseudo-Labeling with Class Distribution for Semi-supervised Text Classification," in Proc. of the 2025 Conf. on Empirical Methods in Natural Language Processing, pp. 13026-13039, Suzhou, China, 4-9 Nov.
2025. [22] I. Sirbu, R. -A. Popovici, C. Caragea, Ș. Trăușan-Matu, and T. Rebedea, "MultiMatch: Multihead consistency regularization matching for semi-supervised text classification," in Proc. of the Conf. on Empirical Methods in Natural Language Processing, pp. 2792-2808, Suzhou, China, 4-9 Nov. 2025.
[23] J. M. Duarte and L. A. Berton, "A review of semi-supervised learning for text classification," Artificial Intelligence Review, vol. 56, pp. 9401-9469, 2023.
[24] K. Sohn, et al., "Fixmatch: Simplifying semi-supervised learning with consistency and confidence," in Proc. 34th Conf. on Neural Information Processing Systems, pp. 596-608, Vancouver, Canada, 6-12 Dec. 2020.
[25] A. Hatefi, X. -S. Vu, M. Bhuyan, and F. Drewes, The Efficiency of Pre-training with Objective Masking in Pseudo Labeling for Semi-Supervised Text Classification, arXiv preprint arXiv:2505.06624, 2025.
[26] S. Cheng, W. Chen, W. Liu, and H. Qu, "Improving lightweight semi-supervised text classification via teacher intervention," Applied Soft Computing, vol. 184, pt. B, Article ID: 113844, Dec. 2025.
[27] Y. Grandvalet and Y. Bengio, "Semi-supervised learning by entropy minimization," in Proc. of the 18th Int. Conf. on Neural Information Processing Systems, pp. 529-536, Vancouver, Canada, 13-16, Dec. 2004.