Within the design of a machine learning-based solution for classification or regression problems, variable selection techniques are often applied to identify the input variables, which mainly affect the considered target. The selection of such variables provides very interesting advantages, such as lower complexity of themodel and of the learning algorithm, reduction of computational time and improvement of performances. Moreover, variable selection is useful to gain a profound knowledge of the considered problem. High correlation in variables often produces multiple subsets of equally optimal variables, which makes the traditional method of variable selection unstable, leading to instability and reducing the confidence of selected variables. Stability identifies the reproducibility power of the variable selection method. Therefore, having a high stability is as important as the high precision of the developed model. The paper presents an automatic procedure for variable selection in classification (binary and multi-class) and regression tasks, which provides an optimal stability index without requiring any a priori information on data. The proposed approach has been tested on different small datasets, which are unstable by nature, and has achieved satisfactory results.

Improving the Stability of the Variable Selection with Small Datasets in Classification and Regression Tasks

Cateni S.;Colla V.
;
Vannucci M.
2022-01-01

Abstract

Within the design of a machine learning-based solution for classification or regression problems, variable selection techniques are often applied to identify the input variables, which mainly affect the considered target. The selection of such variables provides very interesting advantages, such as lower complexity of themodel and of the learning algorithm, reduction of computational time and improvement of performances. Moreover, variable selection is useful to gain a profound knowledge of the considered problem. High correlation in variables often produces multiple subsets of equally optimal variables, which makes the traditional method of variable selection unstable, leading to instability and reducing the confidence of selected variables. Stability identifies the reproducibility power of the variable selection method. Therefore, having a high stability is as important as the high precision of the developed model. The paper presents an automatic procedure for variable selection in classification (binary and multi-class) and regression tasks, which provides an optimal stability index without requiring any a priori information on data. The proposed approach has been tested on different small datasets, which are unstable by nature, and has achieved satisfactory results.
2022
File in questo prodotto:
Non ci sono file associati a questo prodotto.

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11382/547831
 Attenzione

Attenzione! I dati visualizzati non sono stati sottoposti a validazione da parte dell'ateneo

Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus 1
social impact