Towards a software defect proneness model: feature selection
DOI:
https://doi.org/10.15276/aait.04.2021.5Keywords:
Software reliability, machine learning algorithms, defect, feature selection, software defect predictionAbstract
This article is focused on improving static models of software reliability based on using machine learning methods to select the
software code metrics that most strongly affect its reliability. The study used a merged dataset from the PROMISE Software
Engineering repository, which contained data on testing software modules of five programs and twenty-one code metrics. For the
prepared sampling, the most important features that affect the quality of software code have been selected using the following
methods of feature selection: Boruta, Stepwise selection, Exhaustive Feature Selection, Random Forest Importance, LightGBM
Importance, Genetic Algorithms, Principal Component Analysis, Xverse python. Basing on the voting on the results of the work of
the methods of feature selection, a static (deterministic) model of software reliability has been built, which establishes the
relationship between the probability of a defect in the software module and the metrics of its code. It has been shown that this model
includes such code metrics as branch count of a program, McCabe’s lines of code and cyclomatic complexity, Halstead’s total
number of operators and operands, intelligence, volume, and effort value. A comparison of the effectiveness of different methods of
feature selection has been put into practice, in particular, a study of the effect of the method of feature selection on the accuracy of
classification using the following classifiers: Random Forest, Support Vector Machine, k-Nearest Neighbors, Decision Tree
classifier, AdaBoost classifier, Gradient Boosting for classification. It has been shown that the use of any method of feature selection
increases the accuracy of classification by at least ten percent compared to the original dataset, which confirms the importance of this
procedure for predicting software defects based on metric datasets that contain a significant number of highly correlated software
code metrics. It has been found that the best accuracy of the forecast for most classifiers was reached using a set of features obtained
from the proposed static model of software reliability. In addition, it has been shown that it is also possible to use separate methods,
such as Autoencoder, Exhaustive Feature Selection and Principal Component Analysis with an insignificant loss of classification and
prediction accuracy