Highway to the danger zone? Effects of sample size, number of parameters, and collinearity on error estimates from cross-validation
Nick led a study to examine the effects of cross-validation methods and model properties on estimating prediction errors of multiple linear regression models. By conducting simulations of known data, he examined three model properties (sample size, number of variables in the model, and degree of correlation among predictor variables) and two common cross-validation methods (k-folding and the bootstrap). Nick expected more variation among cross-validation methods at the low sample size (10 observations) than medium (30 observations) or high (100 observations), and that was exactly what he found. However, both cross-validation methods worked well when there were >30 observations for each independent variable. Surprisingly, the results held regardless of the degree of correlation among independent variables.
Based on Nick's simulations, cross-validation methods failed at the smallest sample size because of non-linear relationships, outliers, and influential points. This finding suggests ecologists may successfully enter the 'Danger Zone', defined as <30 observations per independent variable, provided they are careful to check the underlying assumptions of the model they are trying to fit. When inside the Danger Zone, Nick recommends using the bootstrap method or the k-folding method with at least 10 folds. Nick cautions against fitting or cross-validating a model with <10 observations - there simply isn't enough data to evaluate model assumptions or assess the model fit.