When researchers develop statistical models or machine learning algorithms, they typically train their models on some data and then assess the model quality on a separate, held out dataset. This is called data splitting or cross-validation (CV). It is a common way to check how good the models are, often in order to choose one production model out of several contenders. However, appropriate use of CV requires some nuance.
The most common variants of CV assume that the data are simply a random sample from the population or process under study. Yet many government surveys and political polls do not invite a simple random sample of people to respond: they have some extra constraints in the sampling process. For instance, they might oversample demographic minority groups to ensure adequate representation. There are very good reasons to collect data this way, but it means that standard CV will not be quite right for assessing models built from such data.
We propose “Survey CV,” a modified version of CV that works for the most common approaches to sampling for surveys and polls: stratified sampling, cluster sampling, and unequal probability sampling. In Survey CV, the data-splitting step mimics the original sampling design used to collect the data, and survey weights (if available) are used in calculating measures of model performance.
We provide an intuitive rationale for our new method, simulations that demonstrate its benefits, and illustration of its use on real survey data. Our open-source R software package “surveyCV” makes it easy to use this Survey CV approach for several common tasks.