# Multiscale Analysis of Survey Data: Recent Developments and Exciting Prospects

## Features

**Author:**Jonathan R. Bradley*, Christopher K. Wikle** and Scott H. Holan*****Date:**24 Mar 2015**Copyright:**Image appears courtesy of Getty Images

The cross-over between survey statistics and spatio-temporal statistics leads to important inferential questions regarding the spatio-temporal resolution of data and latent processes. This is primarily due to the fact that survey data are often collected over space and time. For example, estimates from the US Census Bureau’s American Community Survey (ACS) offers 1-year, 3-year, and 5-year period estimates (i.e., three different temporal scales) of key demographics, over many different geographic regions (or areal units), including counties, census tracts, and combined statistical areas, among others.

In practice, the set of areal units of interest are problem specific, and may differ from the areal units provided by the ACS. For example, New York City’s Department of City Planning makes policy decisions to positively influence community districts in New York City. Unfortunately, the ACS does not provide estimates at this spatial scale. This leads to an important inferential question for data users: How does one predict on a spatial support (i.e., a target support) that is misaligned with the support of the data (i.e., a source support)?

In general, this is known as the spatial change of support (COS) problem (Cressie, 1993; Waller and Gotway, 2004, among others). There are two primary approaches currently available to solve this problem. The first approach involves partitioning the source support based on the target support, and then estimating lower-resolution parameters defined by this partitioning (Mugglin and Carlin, 1998). This “top-down” method leads to bookkeeping of the partitioning formed by the source support and target support. In contrast, the second approach involves defining the latent spatial process on a point-level spatial domain. This allows one to estimate parameters at the point level and aggregate the latent spatial process to any coarser-resolution target support (Cressie and Wikle, 2011, Chap. 4). This “bottom-up” method is computationally advantageous and, in a Bayesian setting, avoids additional Markov chain Monte Carlo (MCMC) simulations every time a new target support is considered. This makes the second approach ideal for analyzing estimates from the ACS where different data users may be interested in defining their own distinct target support.

In the Bayesian context, one can easily drop the assumption of known variances by modeling the sampling variability of the survey variances themselves. By specifying a model for the survey variances, standard MCMC techniques can be used to sample from the distribution of the latent spatial process given both the survey data and the survey variances.

The spatial COS problem becomes compounded when one recognizes that survey estimates often include measures of sampling variability (i.e., survey variances). Incorporating this additional source of information is straightforward in the Gaussian spatial linear model setting (Porter et al., 2014) if one is willing to assume that the survey variances are equal to the true and unobserved variances of the survey estimates. In the Bayesian context, one can easily drop the assumption of known variances by modeling the sampling variability of the survey variances themselves. By specifying a model for the survey variances, standard MCMC techniques can be used to sample from the distribution of the latent spatial process given both the survey data and the survey variances. Recently, Bradley et al. (2014a) use the “bottom-up” approach for spatial change of support of survey count data, and incorporated survey variances using this Bayesian approach.

Considering the efforts needed for spatial COS, one might ask: For a given variable, what spatial support should a data-user perform inference on? This is known as “regionalization” (Openshaw, 1977; Spielman, 2014), and the answer to this inferential question is intimately related to the modifiable areal unit problem (MAUP). For example, consider Figure 1. In Figure 1, we plot 2013 ACS 5-year period estimates of median household income over selected states. State-level ACS estimates exhibit noticeable MAUP error. For example, Figure 1(b) suggests that households in Virginia have moderately high income, however, the county-level estimates in Figure 1(a) shows that households in counties near Richmond have high income while southern Virginian counties have low income. We can visually assess the MAUP error in Virginia (and other states that are obvious upon comparison of Figures 1(a) and 1(b)) however we would prefer to quantify this error.

Figure 1: ACS 5-year period estimates of median household income for 2013 over selected states. Panel (a), displays ACS estimates by counties, and panel (b) displays ACS estimates by state. The state boundaries are overlaid in each panel as a reference. The color-scales are different for each panel.

To see how one might quantify the MAUP we consider the example presented in Figure 2. Here, we specify a 4 by 4 grid in Figure 2 (a), and in the remaining panels, show regionalizations that have 2 or 3 areal units. Many regionalizations tell a different story than what is told in panel (a) – that the largest value is in the lower right-hand corner of the grid, the smallest value is in the upper-left-hand corner of the grid, and intermediate values are given in the remaining grid cells. For some regionalizations, one might make this same conclusion, but still have noticeable MAUP error. For example, 2 (e) shows that larger (smaller) values are given in the lower right-hand (upper left-hand) corner of the grid, but the functional complexity of the values in 2(a) are missing.

Figure 2: Example. In panel (a) the values are (from left to right, top to bottom) 0, 5, 6, and 10. Different regionalizations based on 2 or 3 areal units are given in the remaining panels. Panels (b) through (h) assume 2 areal units, and panels (i) through (n) assume 3 areal units.

For this example, the regionalization that is closest in squared error to panel (a) is given by panel (i), where the top-right and bottom-left grid cells correspond to one areal unit and the sum (over the four grid cells) of squared differences between 2(a) and 2(i) is 0.5. Hence, for this example spatial aggregation error is quantified by minimizing the sum of squared difference between the lowest spatial resolution process (i.e., panel 2(a)) and the spatial process at higher resolutions (i.e., panels (b) – (n) of Figure 2). Bradley et al. (2014b) formalized this idea to quantify the MAUP in a more general setting, where the lowest spatial resolution process can be observed at the point-level. In particular, they define the criterion for spatial aggregation error (CAGE), which they minimize to obtain an optimal regionalization. The intuition behind the use of CAGE for regionalization follows the same logic used to select 2(i), since, CAGE can be written as the integral of the squared difference between a point-level spatial process and an areal spatial process.

Notice that the recent developments of COS for survey data, and regionalization were established for the spatial-only setting, however these problems manifest in the spatio-temporal setting as well. In the spatio-temporal COS context we have developed a similar approach and apply it the ACS period estimates. Nevertheless, spatio-temporal COS is an area of ongoing research.

**References**

1) Bradley, J., Wikle, C. K., and Holan, S. H. (2014a). “Bayesian spatial change of support for count-valued survey data.” arXiv preprint: 1405.7227.

2) Bradley, J., Wikle, C. K., and Holan, S. H. (2014b). “Regionalization of Multiscale Spatial Processes using a Criterion for Spatial Aggregation Error.” arXiv preprint: 1502.01974.

3) Cressie, N. and Wikle, C. K. (2011). Statistics for Spatio-Temporal Data. Hoboken, NJ: Wiley.

4) Cressie, N. (1993). Statistics for Spatial Data, rev. edn. New York, NY: Wiley.

5) Porter, A.T., Holan, S.H., Wikle, C.K., and Cressie, N. (2013). “Spatial Fay-Herriot Models for Small Area Estimation with Functional Covariates.” Spatial Statistics, 10, 27-42.

6) Mugglin, A. and Carlin, B. (1998). “Hierarchical modeling in Geographic Information Systems: Population interpolation over incompatible zones.” Journal of Agricultural, Biological, and Environmental Statistics, 3, 111–130.

7) Openshaw, S. (1977). “A geographical solution to scale and aggregation problems in regionbuilding, partitioning and spatial modelling.” Transactions of the Institute of British Geographers, 2, 459–472.

8) Spielman, S. and Logan, J. (2014). “Reducing Uncertainty in the American Community Survey through Data-Driven Regionalization.” PLOSOne, in press.

9) Waller, L. and Gotway, C. (2004). Applied Spatial Statistics for Public Health Data. New York: Wiley.

**(to whom correspondence should be addressed) Department of Statistics, University of Missouri, 146 Middlebush Hall, Columbia, MO 65211, bradleyjr@missouri.edu
**Department of Statistics, University of Missouri, 146 Middlebush Hall, Columbia, MO 65211-6100
***Department of Statistics, University of Missouri, 146 Middlebush Hall, Columbia, MO 65211-6100*

## Connect: