Does sample size matter?

Features

  • Author: Dr Olena Kaminska
  • Date: 19 Apr 2013
  • Copyright: Image appears courtesy of iStock Photo.

In the field of social statistics one gets used to working with social surveys of a sample size between 1,000 and 2,000 respondents. Regardless of the mode of data collection (telephone, face-to-face, postal or web), and therefore whether clustering or stratification is involved, a sample size in the mentioned range has proven to be very practical for social sciences. It provides sufficient statistical power for many estimates for the whole population, as well as for comparisons of major subgroups.

thumbnail image: Does sample size matter?

The way social surveys are often conducted is by first defining the population of interest, and then selecting and interviewing 1,000-2,000 respondents from the population. While theoretically this sounds straightforward, for some populations of interest this is not so simple in practice. For example, the National Immunization Survey (NIS) in the US is interested in information about children between 19 and 35 months of age. This is a rare group, and there is no good sampling frame with a list of children of this age or households with such children and their contact information. The National Opinion Research Center at the University of Chicago, which collects survey data for this study, uses a screening method to find households with children in the age range of interest. Essentially, the organization calls random telephone numbers and asks each household whether a child between 19 and 35 months lives there. If the household has a child they interview his / her parents. If the household does not have a child in the age range, another number is dialled. According to the NORC report, in 2011 the organization successfully reached and screened 1,141,212 households via landline phones. Of these, 1,113,511 did not have a child in the age-range of interest, and only 27,701 had such a child and were eligible for an interview (NORC, 2011). With the eligibility rate being only 2.42%, the data collection becomes very costly. Even if the study aimed for 2,000 households identified with children in the age range, the data collection company would have needed to successfully screen 82,645 households. Similar challenges are faced for studies interested in people with diabetes, ethnic minority groups and children with lone mothers among many other groups of which policy needs to be informed.

An alternative often used in the situation of limited budget is an implementation of non-random data collection. For example, patients in a particular hospital with a particular diagnosis may be interviewed and followed to examine recovery process. Alternatively, one immigrant can be asked to name others, and each of the others can be asked to name more, thus reaching more members of a group (snowball sampling). The nonprobability nature of such samples provides unpredictable properties to the generalization of the results to the population of interest.

The nonprobability nature of such samples provides unpredictable properties to the generalization of the results to the population of interest.

An interesting solution can be found when we think of this issue from a broad perspective. If a government wants to know specific information on rare subgroups, it may not make sense to launch 20 very expensive studies each looking for a rare unique subgroup. Instead, one may issue one large survey covering the general population with a sample size large enough to represent smaller rare groups. For example, a survey with a sample size of 82,645 households from the US general population has a very good chance of containing 2,000 households with children between 19 and 35 months. So, instead of hanging up each time we reach a non-eligible household, we could interview them. This way we can also obtain a sufficient number of interviews from people with diabetes or with any other rare attribute.

Such a model was used to design the UK Household Longitudinal Study (UKHLS) which follows and interviews over 70,000 respondents each year. The study covers the UK general population, and has sample boosts to provide better statistical power for the analysis of Northern Ireland, and five ethnic minority groups. As the study is longitudinal, it has a possibility to spread some question modules across the years, therefore avoiding the limitation of one interview only. So, with an abundance of detailed information on health and wellbeing, financial situation, life satisfaction and many other aspects of respondents’ lives, this one study can give information on many of those rare subgroups that would be unaffordable to study alone.

Table 1 shows the counts of respondents with different characteristics. Looking at the age distribution, we find that the study has between 700 and 1,100 members born in each year between 1945 and 2009, and at least 500 born for each year between 1936 and 1944. This provides good statistical power for studying narrow age cohorts, including those who are born in a particular year. Thanks to ethnic minority sample boost, the study also interviews over 1000 Indians, 1000 Bangladeshis, and over 800 each of Pakistanis, Afro-Caribbeans and Africans.

We don’t have to limit our interest to social issues – there are a number of other interesting rare groups that are identified in the study.

Yet, what is more interesting is to explore the numbers for rare subgroups that have particular policy and research interest but few, if any, alternative sources of data. We find that, for example, there are 2520 respondents over 16 years old who are diagnosed with diabetes, and 730 respondents who are diagnosed with cancer. Additionally, there are 1385 respondents with substantial sight difficulties. Each of these subgroups can be compared to others on a number of dimensions to understand causes and policy needs to such subgroups. The study also identifies 1160 respondents who first started smoking before reaching 12 years of age; and 2580 respondents who currently smoke over 20 cigarettes per day. In addition there are over 2000 respondents who have had an alcoholic drink daily in the 7 days before their interview.

We don’t have to limit our interest to social issues – there are a number of other interesting rare groups that are identified in the study. For example, among respondents we find over 1500 people who have been to see ballet in the last year, over 1000 who have done horse riding, and a similar number who have played basketball. There are even 1010 people who have written their own music, and 466 who practiced or learned circus skills in the last year. One can easily observe the pattern: large sample sizes allow us to study those rare and extreme subgroups that were not studied much before. And importantly, the inference would be representative of that subgroup population.

Table 1. Counts of respondents within different subgroups in UKHLS wave 2 dataset

Total

72320

England

55436

Wales

5134

Scotland

6334

Northern Ireland

5416

1 year-olds

1016

2 year-olds

1057

3 year-olds

1089

Over 85 years old

931

Separated but legally married

957

No colour TV at home

917

No landline line and no household member owning mobile phone

638

Among adults (over 15 years old)

Ethnic group: Indian

1445

Ethnic group: Bangladeshi

1147

Ethnic group: Pakistani

834

Ethnic group: Caribbean

823

Ethnic group: African

991

Diagnosed with diabetes

2520

Diagnosed with cancer

730

Substantial sight difficulties

1385

People working from home

1077

Completely dissatisfied with job

554

Written music in past 12 months

1010

Learned or practiced circus skills in past 12 months

466

Attended ballet in past 12 months

1584

Gone horse-riding in past 12 months

1090

Played basketball

1317

Drank alcoholic drink each day in past week

2162

Smokes over 20 cigarettes a day

2580

First started smoking before age of 12

1160

*The information is based on wave 2 of UKHLS Original Sample Members.

Reference:
NORC (2011). National Immunization Survey: A User’s Guide for the 2011 Public-Use Data File. Chicago, IL: NORC at the University of Chicago. Accessed from ftp://ftp.cdc.gov/pub/Health_Statistics/NCHS/Dataset_Documentation/NIS/NISPUF11_DUG.PDF

Dr Olena Kaminska is a Survey Statistician at the Institute of Social & Economic Research.

Related Topics

Related Publications

Related Content

Site Footer

Address:

This website is provided by John Wiley & Sons Limited, The Atrium, Southern Gate, Chichester, West Sussex PO19 8SQ (Company No: 00641132, VAT No: 376766987)

Published features on StatisticsViews.com are checked for statistical accuracy by a panel from the European Network for Business and Industrial Statistics (ENBIS)   to whom Wiley and StatisticsViews.com express their gratitude. This panel are: Ron Kenett, David Steinberg, Shirley Coleman, Irena Ograjenšek, Fabrizio Ruggeri, Rainer Göb, Philippe Castagliola, Xavier Tort-Martorell, Bart De Ketelaere, Antonio Pievatolo, Martina Vandebroek, Lance Mitchell, Gilbert Saporta, Helmut Waldl and Stelios Psarakis.