ترکیب تکنیک‌های انتخاب نمونه و داده‌افزایي برای حل مسئله طبقه‌بندی مجموعه داده‌های نامتوازن

محورهای موضوعی : مهندسی برق و کامپیوتر

پرستو محقق ¹ , سميرا نوفرستی ^{2
*} , مهری رجائی ³

1 - دانشکده مهندسی برق و کامپیوتر، دانشگاه سیستان و بلوچستان
2 - دانشکده مهندسی برق و کامپیوتر، دانشگاه سیستان و بلوچستان
3 - دانشکده مهندسی برق و کامپیوتر، دانشگاه سیستان و بلوچستان

تاریخ دریافت : 1401/12/19 تاریخ پذیرش : 1402/07/10 تاریخ انتشار : 1403/01/29

کلید واژه: انتخاب نمونه, داده‌افزایی, طبقه‌بندی, مجموعه داده نامتوازن, داده‌کاوی, یادگیری ماشین,

چکیده مقاله :

در عصر کلان‌داده‌ها، تکنیک‌های تجزیه و تحلیل خودکار مانند داده‌کاوی به‌طور گسترده‌ای برای تصمیم‌گیری به‌کار گرفته شده و بسیار مؤثر واقع شده‌اند. از جمله تکنیک‌های داده‌کاوی می‌توان به طبقه‌بندی اشاره کرد که یک روش رایج برای تصمیم‌گیری و پیش‌بینی است. الگوریتم‌های طبقه‌بندی به‌طور معمول بر روی مجموعه داده‌های متوازن به‌خوبی عمل می‌کنند. با وجود این، یکی از مشکلاتی که الگوریتم‌های طبقه‌بندی با آن مواجه هستند، پیش‌بینی صحیح برچسب نمونه‌های جدید بر اساس یادگیری بر روی مجموعه داده‌های نامتوازن است. در این نوع از مجموعه داده‌ها، توزیع ناهمگونی که داده‌ها در کلاس‌های مختلف دارند باعث نادیده گرفته‌شدن نمونه‌های کلاس با تعداد نمونه کمتر در یادگیری طبقه‌بند می‌شوند؛ در حالی که این کلاس در برخی مسائل پیش‌بینی دارای اهمیت بیشتری است. به‌منظور مقابله با مشکل مذکور در این مقاله، روشی کارا برای متعادل‌سازی مجموعه داده‌های نامتوازن ارائه می‌شود که با متعادل‌نمودن تعداد نمونه‌های کلاس‌های مختلف در مجموعه داده‌ای نامتوازن، پیش‌بینی صحیح برچسب کلاس نمونه‌های جدید توسط الگوریتم یادگیری ماشین را بهبود می‌بخشد. بر اساس ارزیابی‌های صورت‌گرفته، روش پیشنهادی بر اساس دو معیار رایج در ارزیابی طبقه‌بندی مجموعه داده‌های نامتوازن به نام‌های «صحت متعادل» و «ویژگی»، عملکرد بهتری در مقایسه با روش‌های دیگر دارد.

چکیده انگلیسی:

Mohaghegh, S. Noferesti*, and M. Rajaei Abstract: In the era of big data, automatic data analysis techniques such as data mining have been widely used for decision-making and have become very effective. Among data mining techniques, classification is a common method for decision making and prediction. Classification algorithms usually work well on balanced datasets. However, one of the challenges of the classification algorithms is how to correctly predicting the label of new samples based on learning on imbalanced datasets. In this type of dataset, the heterogeneous distribution of the data in different classes causes examples of the minority class to be ignored in the learning process, while this class is more important in some prediction problems. To deal with this issue, in this paper, an efficient method for balancing the imbalanced dataset is presented, which improves the accuracy of the machine learning algorithms to correct prediction of the class label of new samples. According to the evaluations, the proposed method has a better performance compared to other methods based on two common criteria in evaluating the classification of imbalanced datasets, namely "Balanced Accuracy" and "Specificity".

منابع و مأخذ:

[1] H. Kim, H. Cho, and D. Ryu, "Corporate bankruptcy prediction using machine learning methodologies with a focus on sequential data," Computational Economics, vol. 59, pp. 1231-1249, 2022.
[2] D. Yousif Mikhail, F. Al-Mukhtar, and S. Wahab Kareem, "A comparative evaluation of cancer classification via TP53 gene mutations using machine learning," Asian Pacific J. of Cancer Prevention, vol. 23, no. 7, pp. 2459-2467, Jul. 2022.
[3] L. Yang and Y. Jiachen, "Few-shot cotton pest recognition and terminal," Computers and Electronics in Agriculture, vol. 169, Article ID: 105240, 2020.
[4] P. Kumar, R. Bhatnagar, K. Gaur, and A. Bhatnagar, "Classification of imbalanced data: review of methods and applications," IOP Conf. Series: Materials Science and Engineering, vol. 1099, no 1, Article ID: 012077, 2021.
[5] C. F. Tsai, W. C. Lin, Y. H. Hu, and G. T. Yao, "Under-sampling class imbalanced datasets by combining clustering analysis and instance selection," Information Sciences, vol. 477, pp. 47-54, Mar. 2019.
[6] I. Czarnowski and P. Jedrzejowicz, "An approach to imbalanced data classification based on instance selection and over-sampling," in Proc. 11th Int. Conf.on Computational Collective Intelligence, pp. 601-610, Hendaye, France, 4-6 Sept. 2019.
[7] D. Gan, J. Shen, B. An, M. Xu, and N. Liu, "Integrating TANBN with cost sensitive classification algorithm for imbalanced data in medical diagnosis," Computers & Industrial Engineering, vol. 140, Article ID: 106266, Feb. 2020.
[8] L. Yang and Y. Jiachen, "Meta-learning baselines and database for few-shot classification in agriculture," Computers and Electronics in Agriculture, vol. 182, Article ID: 106055, Mar. 2021.
[9] Z. Peng, Z. Li, J. Zhang, Y. Li, G. J. Qi, and J. Tang, "Few-shot image recognition with knowledge transfer," in Proc. of the IEEE/CVF Int. Conf. on Computer Vision, pp. 441-449, Seoul, South Korea, 27 Oct.-2 Nov. 2019.
[10] F. Jimenez, G. Sanchez, J. Palma, and G. Sciavicco, "Three-objective constrained evolutionary instance selection for classification: wrapper and filter approaches," Engineering Applications of Artificial Intelligence, vol. 107, Article ID: 104531, Jan. 2022.
[11] G. E. Melo-Acosta, F. Duitama-Muñoz, and J. D. Arias-Londoño, An Instance Selection Algorithm for Big Data in High Imbalanced Datasets Based on LSH, arXiv: 2210.04310, Oct. 2022.
[12] X. Chao and L. Zhang, "Few-shot imbalanced classification based on data augmentation," Multimedia Systems, vol. 29, no. 5, pp. 2843-2851, 2023.
[13] S. Bej, N. Davtyan, M. Wolfien, M. Nassar, and O. Wolkenhauer, "LoRas: an oversampling approach for imbalanced datasets," Machine Learning, vol. 110, pp. 279-301, 2021.
[14] J. C. Requelme, J. S. Aguilar-Ruiz, and M. Toro, "Finding representative patterns with ordered projections," Pattern Recognition, vol. 36, no. 4, pp. 1009-1018, Apr. 2003.
[15] D. R. Wilson and T. R. Martinez, "Instance pruning techniques," in Proc. of the 14th Int. Conf. on Machine Learning, pp. 400-411, 8-12 Jul. 1997.
[16] M. Moran, T. Cohen, Y. Ben-Zion, and G. Gordon, "Curious instance selection," Information Sciences, vol. 608, pp. 794-808, Aug. 2022.
[17] N. V. Chawla, K. W. Bowyer, L. O. Hall, and W. P. Kegelmeyer, "SMOTE: synthetic minority over-sampling technique," J. of Artificial Intelligence Research, vol. 16, pp. 321-357, Jan. 2002.
[18] ش. سرگلزایی، ف. حسین‌زاده سلجوقی و ﻫ. آقایاری، "ارائه روشی نوین برای رتبه‌بندی اعداد فازی با استفاده از مرکز محیطی دایره و کاربرد آن در ارزیابی عملکرد مدیریت زنجیره تأمین،" نشریه تصمیم‌گیری و تحقیق در عملیات، دوره 3، شماره 3، صص. 236-248، پاییز 1397.
[19] S. N. Kumpati and A. T. Mandayam, Learning Automata: An Introduction, Courier Corporation, 2012.
[20] J. C. Dominguz, et al., "Teaching chemical engeering using Jupyter notebook: problem generators and lecturing tools," Education for Chemical Engineers, vol. 37, pp. 1-10, Oct. 2021.
[21] M. Grandini, E. Bagli, and G. Visani, Multi-Class Classification: An Overview, arXiv:2008.05756, Aug. 2020.

اشتراک گذاری

آدرس مقاله

ترکیب تکنیک‌های انتخاب نمونه و داده‌افزایي برای حل مسئله طبقه‌بندی مجموعه داده‌های نامتوازن