Engineering

A Novel Resampling Technique for Imbalanced Classification in Software Defect Prediction by a re-sampling method with filtering

Authors

  • Kamal Bashir

    Karary University
    Author
  • Mohamed Mosadag

    karary university
    Author

DOI:

https://doi.org/10.71107/1v443128

Keywords:

Software defect prediction, Rough sets theory, Data sampling, Noise filtering

Abstract

One of the greatest difficulties that most algorithms that learn classifiers face is imbalanced data. A class imbalance problem, however, is not inherently detrimental, some contemporary studies claim, nor is the performance decline solely attributable to this issue, but rather other aspects of the data distribution, such as the presence of noise and borderline cases around the class heads. In order to rectify the issue of data imbalance, the author proposes a new hybrid preprocessing technique in this study. This technique handles class imbalance, the presence of noise, and borderline samples in software defect data. To solve this problem of data imbalance and noisy samples, a combination of hyperparameter optimization based on Rough Set Theory (RST), Iterative-Partitioning Filter (IPF), and Synthetic Minority Oversampling Technique (SMOTE) oversampling is used. The strategy begins with the application of SMOTE to synthesize artificial examples through a process of linear interpolation. The first step of the algorithm is to use SMOTE followed by linear interpolation of two defect-prone k-nearest neighbors to generate synthetic examples. Then, majority-class examples with a lower approximation of these originals and newly generated minority-class examples are removed using RST. Then IPF is applied to erase the data. We analyze the efficacy of our algorithm by conducting experiments where the learning algorithm is the C4.5 classifier. Then, statistical tests show the superiority of our proposed method over state-of-the-art sampling methods.

Downloads

Download data is not yet available.

Author Biographies

  • Kamal Bashir, Karary University

    Assistant Professor at Karary University

  • Mohamed Mosadag, karary university

    Assistant Professor at karary university

References

[1] S. Lessmann, B. Baesens, C. Mues, and S. J. I. t. o. s. e. Pietsch, "Benchmarking classification models for software defect prediction: A proposed framework and novel findings," IEEE transactions on software engineering, vol. 34, no. 4, pp. 485-496, 2008.

[2] N. V. Chawla, K. W. Bowyer, L. O. Hall, and W. P. J. J. o. a. i. r. Kegelmeyer, "SMOTE: synthetic minority over-sampling technique," Journal of artificial intelligence research, vol. 16, pp. 321-357, 2002.

[3] K. Bashir, T. Li, C. W. Yohannese, and Y. Mahama, "Enhancing software defect prediction using supervised-learning based framework," in 2017 12th International Conference on Intelligent Systems and Knowledge Engineering (ISKE), 2017, pp. 1-6: IEEE.

[4] K. Bashir, T. Li, C. W. Yohannese, M. J. J. o. I. Yahaya, and F. Systems, "SMOTEFRIS-INFFC: Handling the challenge of borderline and noisy examples in imbalanced learning for software defect prediction," Journal of Intelligent & Fuzzy Systems, vol. 38, no. 1, pp. 917-933, 2020.

[5] K. Bashir, T. Li, and C. W. J. I. J. o. C. I. S. Yohannese, "An empirical study for enhanced software defect prediction using a learning-based framework," International Journal of Computational Intelligence Systems, vol. 12, no. 1, pp. 282-298, 2018.

[6] C. W. Yohannese and T. J. I. J. o. C. I. S. Li, "A combined-learning based framework for improved software fault prediction," International Journal of Computational Intelligence Systems, vol. 10, no. 1, pp. 647-662, 2017.

[7] M. Galar, A. Fernandez, E. Barrenechea, H. Bustince, F. J. I. T. o. S. Herrera, Man,, and P. C. Cybernetics, "A review on ensembles for the class imbalance problem: bagging-, boosting-, and hybrid-based approaches," IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews), vol. 42, no. 4, pp. 463-484, 2011.

[8] T. M. Khoshgoftaar, K. Gao, and N. Seliya, "Attribute selection and imbalanced data: Problems in software defect prediction," in 2010 22nd IEEE International conference on tools with artificial intelligence, 2010, vol. 1, pp. 137-144: IEEE.

[9] D. Van Nguyen, K. Ogawa, K. Matsumoto, and M. Hashimoto, "Editing training sets from imbalanced data using fuzzy-rough sets," in Artificial Intelligence Applications and Innovations: 11th IFIP WG 12.5 International Conference, AIAI 2015, Bayonne, France, September 14-17, 2015, Proceedings 11, 2015, pp. 115-129: Springer.

[10] C. Seiffert, T. M. Khoshgoftaar, J. Van Hulse, A. J. I. t. o. s. Napolitano, man,, c.-p. A. systems, and humans, "RUSBoost: A hybrid approach to alleviating class imbalance," IEEE Transactions On Systems." Man, and Cybernetics—Part A: Systems And Humans, vol. 40, no. 1, pp. 185-197, 2009.

[11] C. W. Yohannese, T. Li, M. Simfukwe, and F. Khurshid, "Ensembles based combined learning for improved software fault prediction: A comparative study," in 2017 12th International conference on intelligent systems and knowledge engineering (ISKE), 2017, pp. 1-6: IEEE.

[12] S. Wang and X. J. I. T. o. R. Yao, "Using class imbalance learning for software defect prediction," IEEE Transactions on Reliability, vol. 62, no. 2, pp. 434-443, 2013.

[13] C. W. Yohannese, T. Li, and K. J. I. J. o. C. I. S. Bashir, "A three-stage based ensemble learning for improved software fault prediction: an empirical comparative study," International Journal of Computational Intelligence Systems, vol. 11, no. 1, pp. 1229-1247, 2018.

[14] K. Bashir, T. Li, C. W. Yohannese, M. Yahaya, and T. Ali, "A novel preprocessing approach for imbalanced learning in software defect prediction," in Data Science and Knowledge Engineering for Sensing Decision Support: Proceedings of the 13th International FLINS Conference (FLINS 2018), 2018, pp. 500-508: World Scientific.

[15] K. Bashir, S. Pirasteh, H. Abdelrhman, M. Mosadag, and A. Mohammed, "An Enhanced Feature Selection Approach for Breast Cancer Prediction Using a Hybrid Framework," Journal of Karary University for Engineering Science, 2024.

[16] H. He and E. A. Garcia, "Learning from imbalanced data," IEEE Transactions on knowledge and data engineering, vol. 21, no. 9, pp. 1263-1284, 2009.

[17] J. A. Sáez, J. Luengo, J. Stefanowski, and F. J. I. S. Herrera, "SMOTE–IPF: Addressing the noisy and borderline examples problem in imbalanced classification by a re-sampling method with filtering," Information Sciences, vol. 291, pp. 184-203, 2015.

[18] K. Napierała, J. Stefanowski, and S. Wilk, "Learning from imbalanced data in presence of noisy and borderline examples," in Rough Sets and Current Trends in Computing: 7th International Conference, RSCTC 2010, Warsaw, Poland, June 28-30, 2010. Proceedings 7, 2010, pp. 158-167: Springer.

[19] S. Tang and S.-P. Chen, "The generation mechanism of synthetic minority class examples," in 2008 international conference on information technology and applications in biomedicine, 2008, pp. 444-447: IEEE.

[20] H. He, Y. Bai, E. A. Garcia, and S. Li, "ADASYN: Adaptive synthetic sampling approach for imbalanced learning," in 2008 IEEE international joint conference on neural networks (IEEE world congress on computational intelligence), 2008, pp. 1322-1328: Ieee.

[21] H. Han, W.-Y. Wang, and B.-H. Mao, "Borderline-SMOTE: a new over-sampling method in imbalanced data sets learning," in International conference on intelligent computing, 2005, pp. 878-887: Springer.

[22] C. Bunkhumpornpat, K. Sinapiromsaran, and C. Lursinsap, "Safe-level-smote: Safe-level-synthetic minority over-sampling technique for handling the class imbalanced problem," in Advances in knowledge discovery and data mining: 13th Pacific-Asia conference, PAKDD 2009 Bangkok, Thailand, April 27-30, 2009 proceedings 13, 2009, pp. 475-482: Springer.

[23] V. García, J. Sánchez, and R. Mollineda, "An empirical study of the behavior of classifiers on imbalanced and overlapped data sets," in Progress in Pattern Recognition, Image Analysis and Applications: 12th Iberoamericann Congress on Pattern Recognition, CIARP 2007, Valparaiso, Chile, November 13-16, 2007. Proceedings 12, 2007, pp. 397-406: Springer.

[24] E. Ramentol, Y. Caballero, R. Bello, F. J. K. Herrera, and i. systems, "Smote-rs b*: a hybrid preprocessing approach based on oversampling and undersampling for high imbalanced data-sets using smote and rough sets theory," Knowledge and Information Systems, vol. 33, pp. 245-265, 2012.

[25] E. Ramentol, N. Verbiest, R. Bello, Y. Caballero, C. Cornelis, and F. Herrera, "SMOTE-FRST: a new resampling method using fuzzy rough set theory," in Uncertainty modeling in knowledge engineering and decision making: World Scientific, 2012, pp. 800-805.

[26] G. E. Batista, A. L. Bazzan, and M. C. Monard, "Balancing training data for automated annotation of keywords: a case study," Balancing training data for automated annotation of keywords: a case study, vol. 3, pp. 10-18, 2003.

[27] G. E. Batista, R. C. Prati, and M. C. Monard, "A study of the behavior of several methods for balancing machine learning training data," ACM SIGKDD explorations newsletter, vol. 6, no. 1, pp. 20-29, 2004.

[28] I. Tumar, Y. Hassouneh, H. Turabieh, and T. J. I. A. Thaher, "Enhanced binary moth flame optimization as a feature selection algorithm to predict software fault prediction," IEEE Access, vol. 8, pp. 8041-8055, 2020.

[29] S. S. Rathore, S. S. Chouhan, D. K. Jain, and A. G. Vachhani, "Generative oversampling methods for handling imbalanced data in software fault prediction," IEEE Transactions on Reliability, vol. 71, no. 2, pp. 747-762, 2022.

[30] R. Singh and S. S. Rathore, "Linear and non-linear bayesian regression methods for software fault prediction," International Journal of System Assurance Engineering Management, vol. 13, no. 4, pp. 1864-1884, 2022.

[31] E. Elahi, S. Kanwal, and A. N. Asif, "A new ensemble approach for software fault prediction," in 2020 17th international Bhurban conference on applied sciences and technology (IBCAST), 2020, pp. 407-412: IEEE.

[32] H. Tong, W. Lu, W. Xing, B. Liu, and S. Wang, "SHSE: A subspace hybrid sampling ensemble method for software defect number prediction," Information Software Technology, vol. 142, p. 106747, 2022.

[33] S. Goyal, "Handling class-imbalance with KNN (neighbourhood) under-sampling for software defect prediction," Artificial Intelligence Review, vol. 55, no. 3, pp. 2023-2064, 2022.

[34] S. K. Pandey, D. Rathee, and A. K. Tripathi, "Software defect prediction using K‐PCA and various kernel‐based extreme learning machine: an empirical study," IET Software, vol. 14, no. 7, pp. 768-782, 2020.

[35] R. Yedida and T. Menzies, "On the value of oversampling for deep learning in software defect prediction," EEE Transactions on Software Engineering, vol. 48, no. 8, pp. 3103-3116, 2021.

[36] S. K. Pandey, A. Haldar, and A. K. Tripathi, "Is deep learning good enough for software defect prediction?," Innovations in Systems Software Engineering IET Software, pp. 1-16, 2023.

[37] C. Tantithamthavorn, A. E. Hassan, and K. Matsumoto, "The impact of class rebalancing techniques on the performance and interpretation of defect prediction models," IEEE Transactions on Software Engineering, vol. 46, no. 11, pp. 1200-1219, 2018.

[38] Nitin, K. Kumar, and S. S. Rathore, "Analyzing ensemble methods for software fault prediction," in Advances in Communication and Computational Technology: Select Proceedings of ICACCT 2019, 2021, pp. 1253-1267: Springer.

[39] A. Balaram and S. Vasundra, "Prediction of software fault-prone classes using ensemble random forest with adaptive synthetic sampling algorithm," Automated Software Engineering, vol. 29, no. 1, p. 6, 2022.

[40] Z. Pawlak, "Rough sets," International journal of computer & information sciences, vol. 11, pp. 341-356, 1982.

[41] T. M. Khoshgoftaar and P. Rebours, "Improving software quality prediction by noise filtering techniques," Journal of Computer Science and Technology, vol. 22, pp. 387-396, 2007.

[42] T. Menzies, B. Caglayan, E. Kocaguneli, J. Krall, F. Peters, and B. Turhan, "The promise repository of empirical software engineering data," ed: June, 2012.

[43] M. D’Ambros, M. Lanza, and R. J. E. S. E. Robbes, "Evaluating defect prediction approaches: a benchmark and an extensive comparison," Empirical Software Engineering, vol. 17, pp. 531-577, 2012.

Downloads

Published

2025-02-21

Most read articles by the same author(s)