Author: Carlos Grajales
A year ago this month, Donald Trump became the 45th President of the United States of America. As a statistician, I am currently reviewing five poll results which date from September 2016; all concerning this last U.S. Presidential race [1]. As you would expect, results from different pollsters vary somehow. Four of the five polls have Hillary Clinton winning. The first one has the ex-secretary of state with a three-point advantage, another pair shows a 1 point margin; the fourth one shows Clinton with a wider 4 point margin and the fifth poll shows Trump ahead by 1 point. Any person with some notions of survey methodology can correctly argue that those differences may very well be attributable to sampling error: different selected samples, with distinct sampling distribution in the estimators. But let me assure you that this is not the case, as there’s something quite remarkable with these results: they are all based on the same raw data. Same sample, same interviews. We are seeing 5 different election outcomes and those differences are attributable to one of the least understood universal practices from the polling industry: weighting.
The American Association for Public Opinion Research (AAPOR) defines weighting as adjustments to poll data made in an attempt to ensure that the sample more accurately reflects the characteristics of the population from which it was drawn [2]. In fact, weights are not exclusive to political polls, they are also very important in most official surveys around the globe, as any survey statistician in the world knows [3]. The main reason for weighting to become ubiquitous is that, over five decades ago, a remarkable statistician named Leslie Kish showed us that random samples can become practical thanks to the use of weighting [4].
Kish was a Hungarian-American statistician, born in 1910 in what is now Slovakia. He and his family migrated to the United States in 1925 [5]. Kish’s life was rather interesting for a survey researcher: he fought in the Spanish Civil War with the International Brigades, as well as in World War II as a volunteer. In 1947, he moved to the University of Michigan at Ann Arbor where he became one of the founders of the Institute for Social Research.
Kish’s work at the University of Michigan drastically changed the way surveys are done even today [6]. Before his work, most studies were conducted on non-probabilistic samples, which, by definition, are not totally representative of the population. Researchers used quotas and other methods to improve their results, but probabilistic samples, a method that only a few statisticians at the time were using, was widely disregarded. Kish, a fervent supporter of probabilistic sampling, promoted it in a new way that we pollsters use for marketing to this day: by forecasting elections. Kish was amongst the first pollsters to ever predict election results by using a probabilistic poll, when in 1948 he and his team at Michigan published a poll predicting a Truman win for that year’s U.S. presidential election, even when news media and other pollsters were saying that Dewey would win easily.
As Kish defined it, weights are a fundamental concept in the jargon of probabilistic samples, as weights are primarily used to ensure that all respondents are given the appropriate importance in the estimation, according to their chance of being in the sample. As such, a person who has a low probability of being sampled, i.e. someone hard to get, would be weighted higher that someone who is easier to see in the sample. For those not familiar with sampling, the idea of different probabilities of selection might seem weird, but the concept is closely related with the basic idea of sampling. Consider first a simple random sample, that in which every member of the population has the same probability of being part of your sample (which means every sample has the same chance of being selected). It’s like those examples in which you have a little urn with the names of all individuals within the population and then you start taking names out of the urn, selecting them in your sample only to return the name again to the urn before the previous selection. Under this scenario, weighting is unnecessary for most estimation, as every probability of selection is equal.
In practice, such means of selecting a sample is a bit impractical, to say the least. Just think about sampling the whole U.K. population with face-to-face interviews (as this is almost certainly the only method that would guarantee a chance to reach everyone). This would mean writing 66 million names on paper. Well, to be fair, we statisticians are so lazy we would end up randomly selecting from a dataset with all the names. Still, after taking such samples, we would end up with about a thousand names, surely located all around Britain, Ireland and smaller islands. You might end up visiting a town only for a couple of interviews. You might need to travel to Channel Islands for a single interview. I’m not quite familiar with traveling costs around Britain, but this idea surely sounds expensive. In order to reduce costs, an effective sampling scheme that is usually used is cluster sampling.
Cluster sampling is like using simple random sampling, but you are cheating. Instead of randomly selecting persons for our U.K. survey, we instead randomly select groups of people. How we define these groups is up to the researcher, yet postal code areas, counties, statistical defined areas or electoral districts are all popular choices. Once you have randomly selected clusters to be in your sample, you use an additional selection step to select people within these clusters. That is the reason why this sampling procedure is also called multi-stage sampling. The advantage of using clusters lies in the fact that clusters are usually geographically constrained, i.e. people who live in the same cluster tend to live close to each other. This ensures a more cost-effective selection. Sadly, this also means the probability of selection of each individual is no longer the same. It also means there’s a chance your sample will no longer produce independent observations, though I won’t go into much detail on that one. Since the probabilities of selection may no longer be the same for each interviewee, this has to be accounted within the estimation and weighting is the appropriate way of doing such adjustment. These kinds of weights, usually called sampling weights, are calculated as the inverse of the probability of selection. Even though it may sound easy, for some sampling designs, that calculation is far from trivial and usually requires a detailed 36-page methodology document just to justify it [7].
Almost all household surveys conducted worldwide use some form of complex design. As such, weighting is an integral part of every major survey. Such weights are even planned so that the weights are all equal, a self-weighted sample, avoiding all complications and calculations. Still, these types of weights, probability weights, are not the reason why five pollsters gave different results with the same raw data.
In fact, not all weighting schemes relate to the probability of selection. To be accurate, the kind of weighting U.S. pollsters did to alter their results so much is called post-stratification and, technically, with it they are not calculating weights per se – they are solely adjusting them. Most U.S. election polls are telephone-based interviews. Since the cost of an interview is almost the same for an interview in Hawaii and one in New York, they have not many incentives to use a complex design, so they assume equal probability of selection, which means the sampling weights for the sample can be equaled to any constant, usually 1. What they do next is something also called “weighting calibration”. The AAPOR describes it as “making adjustments after data are collected to bring certain features of the sample into line with other known characteristics of the population” [2]. This is where things get tricky, but let me briefly explain to you what they are doing.
Weight calibration is a popular way of correcting two common problems in surveys: non-response and over-sampling. The first one relates with the fact that some groups tend to be more willing to participate in polls than others with door-to-door: in Mexico, although women are making great advances in their rights, Mexican cultural expectations still dictate that they remain at home as head of household, so, most surveys there usually end with more women than men among their respondents. The second issue is something pollsters like to discuss less, particularly phone-based pollsters: you cannot reach everyone with a survey. Household surveys are the international standard because you can reach almost everyone at home. It doesn’t matter if they don’t have internet or cell phone (except of course, you cannot interview homeless people, a particularly complicated population to study). Phone polls cannot reach any home without a landline, there´s simply no way. Weighting calibration can somehow aid in this case by giving more weight to people you noticed you are not reaching, though the method is not infallible and different ways of using it can yield to drastically different results, as we have discussed.
The most common way of weighting calibration is post-stratification. Post-stratification means adjusting the weights to ensure the sample proportions estimated by the sample fit the distribution of the population in some key characteristics, usually demographics. The adjustment is a simple arithmetic correction. Imagine you have the following distribution in your sample:
Sex | Sample Distribution |
---|---|
Male | 41% |
Female | 51% |
I’ve seen those numbers a lot in my job, as women tend to participate more in surveys than men. According to official data in Mexico, about 52% of the population is actually female, so our sample is a bit off. Since we happened to have more women, we will have to weight them a bit lower than men. How much lower will be based on how different our sample is from our actual population. In this case, the following adjustment would ensure that our sample distribution fits the population totals:
Sex | Sample Distribution | Pop Distribution | Weight Adjustment | Calibrated Adjustment |
---|---|---|---|---|
Male | 41% | 48% | 0.48/0.41 | 1.2 |
Female | 51% | 52% | 0.52/0.59 | 0.9 |
The 1.2 means that each men in our sample will be counted as 1.2 observations, a bit more than would be expected, given that we interviewed less men than we would have wished. Women will be weighted a bit lower than one.
You don’t have to choose only one variable to calibrate weights. Most people use gender and age, though some pollsters also consider education, marital status or occupation. In theory, for post-stratification, you should only use variables that you know are highly correlated with the variables you are estimating. In practice, it is hard to know what is /will be correlated with election results. The AAPOR’s 2016 election report argued that one of the reasons Trump’s victory was originally underrepresented in some states was due to the lack of weighting calibration based on education [8] [9]. Another risk lies in the fact that using too many groups for weighting adjustments might result in unstable weights. This happened in the 2016 election as well. In one particular national poll, one Trump-supporter in Illinois was weighted as much as 30 times more than the average respondent, and as much as 300 times more than the least-weighted respondent [3]. Some pollsters decide to trim the weights as to ensure they cannot obtain values above a defined threshold, yet others decide not to do it. There’s really no consensus amongst pollsters in the best methods to calibrate weights and many of them usually keep the details of their methodology a professional secret.
Defining what variables you will use to adjust for weights isn’t the only concern: you also have to define what totals you adjust for, something remarkably hard for election polls. You could match your poll distribution with the national population, information you can get from the latest available census data, as that sounds fairly appropriate. However, remember that we are interested in voters, rather than in the total population, so a more sensible decision might be to match distribution totals from last election results, or perhaps from voter registration data, or even using latest exit poll numbers. You decide.
So, if you thought we pollsters had it easy after all these decisions to make, another methodological choice can be made to modify the weights: likely-voters adjustment. Let me tell you a secret about this one: it is the one where we pollsters agree the least. Some don’t even use this adjustment at all, some use models to calculate it, some use self-reported responses from the survey poll itself [1]. What this procedure does is to further adjust weights (even after they have been adjusted by post-stratification) by a factor that reduces the weight of the people who we believe are least likely to vote and increases the importance in the sample of the people who are more willing to cast a vote. This makes sense in practice and has proven to be a reliable way of adjusting for voter turnout, yet how useful it is varies from place to place and even from election to election. Besides, this is another “secret sauce” for most pollsters, so we hardly get the chance to discuss how we make such adjustments or how good they are. Still, it is very worth noting that all of these weight adjustments are based on sound methodology and are not exclusive of election polls. Both weighting and weight calibration are standard practices in most official surveys. The EU Labour Force Survey relies on a multi-stage design that uses sampling weights calculated as the inverse probability of selection and further weight calibration considering variables such as sex, age and region [10].
Being such an important part in the estimation process and after seeing the huge impact it can have on poll results, it is a bit shocking that weighting is one of the least understood parts of survey methodology. Weighting does not mean arbitrarily modifying results, or using your partisanship ideals to move the outcome the way you see fit. Weighting is a concerted effort by survey professionals to ensure results are accurate and unbiased. The science behind weighting adjustments is not perfect. As I have explained, there is a lack of consensus in many of the procedures still used. Still, understanding what weights are and what they mean is something crucial for everyone interested in the field. Everyone who follows election results from journalists to analysts need to realize that this procedure is one of the most important differences between pollsters (only the sample selection procedure might be that important). In an industry that has focused too much on how to phrase questions that make headlines, these methodological differences remain our main obligation to the audience; if only we pollsters could talk more about this.
[1] Cohn, Nate. We Gave Four Good Pollsters the Same Raw Data. They Had Four Different Results. The New York Times, Sept, 2016
https://www.nytimes.com/interactive/2016/09/20/upshot/the-error-the-polling-world-rarely-talks-about.html?mcubz=0
[2] Weighting. AAPOR Education for Researchers Website, 2017
http://www.aapor.org/Education-Resources/For-Researchers/Poll-Survey-FAQ/Weighting.aspx
[3] Cohn, Nate. How One 19-Year-Old Illinois Man Is Distorting National Polling Averages. The New York Times (Oct, 2016)
https://www.nytimes.com/2016/10/13/upshot/how-one-19-year-old-illinois-man-is-distorting-national-polling-averages.html?mcubz=0
[4] Kish, Leslie. Survey Sampling. New York: Wiley (1965)
[5] Fellegi, Ivan. Leslie Kish (1910-2000). International Statistical Institute (ISI) Newsletter, Volume 25, No. 73
https://ww2.amstat.org/about/statisticiansinhistory/index.cfm?fuseaction=biosinfo&BioID=9
[6] Pace, Eric Leslie Kish, 90; Improved Science of Surveys. The New York Times (Oct, 2000)
http://www.nytimes.com/2000/10/14/us/leslie-kish-90-improved-science-of-surveys.html?mcubz=0&module=ArrowsNav&contentCollection=U.S.&action=keypress®ion=FixedLeft&pgtype=article
[7] Global Adult Tobacco Survey Collaborative Group. Global Adult Tobacco Survey (GATS): Sample Weights Manual, Version 2.0. Atlanta, GA: Centers for Disease Control and Prevention, 2010.
http://www.who.int/tobacco/surveillance/9_GATS_SampleWeightsManual_v2_FINAL_15Dec2010.pdf?ua=1
[8] An Evaluation of 2016 Election Polls in the U.S. AAPOR. Ad Hoc Committee on 2016 Election Polling (May, 2016)
http://www.aapor.org/Education-Resources/Reports/An-Evaluation-of-2016-Election-Polls-in-the-U-S.aspx
[9] Grajales, Carlos. Why did polls fail to predict Trump’s election? StatisticsViews Website (June, 2017)
http://www.statisticsviews.com/details/news/10543642/Why-did-polls-fail-to-predict-Trumps-election.html?lipi=urn%3Ali%3Apage%3Ad_flagship3_profile_view_base_recent_activity_details_shares%3BiX0n4HDqSL6jvrKk%2BLMxSg%3D%3D
[10] Quality report of the European Union Labour Force Survey. Eurostat Methodologies and Working Papers. 2013 Edition
http://ec.europa.eu/eurostat/documents/3888793/5856989/KS-RA-13-008-EN.PDF/0dbeb003-d79b-4e16-a825-683ea4eebb0a