Highway to the danger zone? Effects of sample size, number of parameters, and collinearity on error estimates from cross-validation

Multiple linear regression is commonly used by ecologists to fit predictive models of species distributions and biodiversity patterns. Cross-validation methods can provide estimates of prediction error of these models, but getting good estimates of prediction error depends greatly on the ratio of observations to independent variables. As a general rule, cross-validation methods work best when there are at least 30 observations for each variable in the model.

graph.jpg

Percent deviation from true prediction error (PE) for various ratios of sample size to number of variables in the model.
Ecologists often approach their research as a 'measure everything, predict everything' endeavor. Life is good when the models work. When they don't...Is it the data? The model? The cross-validation method? To answer these questions, it's (past the) time to call in a statistician for help. Enter Nick Keuler, resident statistician in the SILVIS lab.

Nick led a study to examine the effects of cross-validation methods and model properties on estimating prediction errors of multiple linear regression models. By conducting simulations of known data, he examined three model properties (sample size, number of variables in the model, and degree of correlation among predictor variables) and two common cross-validation methods (k-folding and the bootstrap). Nick expected more variation among cross-validation methods at the low sample size (10 observations) than medium (30 observations) or high (100 observations), and that was exactly what he found. However, both cross-validation methods worked well when there were >30 observations for each independent variable. Surprisingly, the results held regardless of the degree of correlation among independent variables.

Based on Nick's simulations, cross-validation methods failed at the smallest sample size because of non-linear relationships, outliers, and influential points. This finding suggests ecologists may successfully enter the 'Danger Zone', defined as <30 observations per independent variable, provided they are careful to check the underlying assumptions of the model they are trying to fit. When inside the Danger Zone, Nick recommends using the bootstrap method or the k-folding method with at least 10 folds. Nick cautions against fitting or cross-validating a model with <10 observations - there simply isn't enough data to evaluate model assumptions or assess the model fit.