Investigating variable selection techniques under missing data: A simulation study Article Swipe

PDF

Related Concepts

Missing data Elastic net regularization Lasso (programming language) Imputation (statistics) Feature selection Computer science Data collection Genetic algorithm Data mining Variable (mathematics) Computation Monte Carlo method Selection (genetic algorithm) Machine learning Artificial intelligence Algorithm Statistics Mathematics Mathematical analysis World Wide Web

Catherine Bain , Dingjing Shi ·

YOU? · · 2024 · Open Access · · DOI: https://doi.org/10.31234/osf.io/d4c2k · OA: W4394874386

Variable selection is one of the most pervasive problems researchers face, especially with the increased ease in data collection arising from online data collection strategies. Machine learning methods such as LASSO and elastic net regression have gained traction in the field but are limited in the types of problems for which they are suitable. As such, researchers have pulled more complex techniques, such as the genetic algorithm, from fields like computer science. Although there is strong support in the lit- erature for the use of each of these methods on complete data (McNeish, 2015; Schroeders, Wilhelm, &amp; Olaru, 2016), less is known about their relative performance in the presence of missing data. Using a large-scale Monte Carlo simulation, the performance of the LASSO, Elastic Net, and the genetic algorithm are reviewed, for solving variable selection prob- lems in the presence of ignorable missing data. In particular, this study incorporates the state-of-art missing data handling technique multiple imputation, into the studied tools. All techniques were found to per- form at satisfactory levels (as measured by MSE, precision, false positive rate, and computation time) under MCAR and MAR conditions. The genetic algorithm was seen to be most robust to changes in the data.