
The article featured today is from the Canadian Journal of Statistics with the full article now available to read here.
Wieczorek, J. and Lei, J. (2022), Model selection properties of forward selection and sequential cross-validation for high-dimensional regression. Can J Statistics. https://doi.org/10.1002/cjs.11635
Forward Selection (FS) is a popular variable selection method for linear regression. FS is a greedy algorithm that creates a path of nested models, adding one variable at a time to minimize the residual sum of squares at each step. But despite its wide use, the properties of FS for choosing the true model (if there is one) have not been well understood. In high-dimensional settings, where there may be more variables than observations, we derive conditions under which the FS path will include the true model with probability going to 1. The next challenge is knowing when to stop adding variables along the model path. Several popular stopping rules for FS are known to fail, so we provide a stopping rule based on a sequential variant of cross-validation (CV): Split your data into training and test sets, fit the FS path one step at a time on the training set, and stop as soon as these trained models become worse at predicting the test set. We derive conditions for this stopping rule to select the true model with probability going to 1. Briefly, we cannot expect FS to recover the true model if pairs of variables are highly correlatedor if the test set is not adequately large. We demonstrate our methods using the Million Song Dataset.