Sample size determination

Sample size determination is the act of choosing the number of observations or replicates to include in a statistical sample. The sample size is an important feature of any empirical study in which the goal is to make inferences about a population from a sample. In practice, the sample size used in a study is determined based on the expense of data collection, and the need to have sufficient statistical power. In complicated studies there may be several different sample sizes involved in the study: for example, in a stratified survey there would be different sample sizes for each stratum. In a census, data are collected on the entire population, hence the sample size is equal to the population size. In experimental design, where a study may be divided into different treatment groups, this may be different sample sizes for each group.

Sample sizes may be chosen in several different ways:

experience – for example, include those items readily available or convenient to collect. A choice of small sample sizes, though sometimes necessary, can result in wide confidence intervals or risks of errors in statistical hypothesis testing.
using a target variance for an estimate to be derived from the sample eventually obtained
using a target for the power of a statistical test to be applied once the sample is collected.
using a confidence level determines how accurate a result will turn out with lower chances of error.

Introduction

Larger sample sizes generally lead to increased precision when estimating unknown parameters. For example, if we wish to know the proportion of a certain species of fish that is infected with a pathogen, we would generally have a more precise estimate of this proportion if we sampled and examined 200 rather than 100 fish. Several fundamental facts of mathematical statistics describe this phenomenon, including the law of large numbers and the central limit theorem.

In some situations, the increase in precision for larger sample sizes is minimal, or even non-existent. This can result from the presence of systematic errors or strong dependence in the data, or if the data follows a heavy-tailed distribution.

Sample sizes are judged based on the quality of the resulting estimates. For example, if a proportion is being estimated, one may wish to have the 95% confidence interval be less than 0.06 units wide. Alternatively, sample size may be assessed based on the power of a hypothesis test. For example, if we are comparing the support for a certain political candidate among women with the support for that candidate among men, we may wish to have 80% power to detect a difference in the support levels of 0.04 units.

Estimation

Estimation of a proportion

A relatively simple situation is estimation of a proportion. For example, we may wish to estimate the proportion of residents in a community who are at least 65 years old.

The estimator of a proportion is $\hat p = X/n$ , where X is the number of 'positive' observations (e.g. the number of people out of the n sampled people who are at least 65 years old). When the observations are independent, this estimator has a (scaled) binomial distribution (and is also the sample mean of data from a Bernoulli distribution). The maximum variance of this distribution is 0.25/n, which occurs when the true parameter is p = 0.5. In practice, since p is unknown, the maximum variance is often used for sample size assessments.

For sufficiently large n, the distribution of ${\hat {p}}$ will be closely approximated by a normal distribution.^[1] Using this approximation, it can be shown that around 95% of this distribution's probability lies within 2 standard deviations of the mean. Using the Wald method for the binomial distribution, an interval of the form

\left({\hat {p}}-2{\sqrt {\frac {0.25}{n}}},{\hat {p}}+2{\sqrt {\frac {0.25}{n}}}\right)

will form a 95% confidence interval for the true proportion. If this interval needs to be no more than W units wide, the equation

4{\sqrt {\frac {0.25}{n}}}=W

can be solved for n, yielding^[2]^[3] n = 4/W² = 1/B² where B is the error bound on the estimate, i.e., the estimate is usually given as within ± B. So, for B = 10% one requires n = 100, for B = 5% one needs n = 400, for B = 3% the requirement approximates to n = 1000, while for B = 1% a sample size of n = 10000 is required. These numbers are quoted often in news reports of opinion polls and other sample surveys.

Estimation of a mean

A proportion is a special case of a mean. When estimating the population mean using an independent and identically distributed (iid) sample of size n, where each data value has variance σ², the standard error of the sample mean is:

{\frac {\sigma }{\sqrt {n}}}.

This expression describes quantitatively how the estimate becomes more precise as the sample size increases. Using the central limit theorem to justify approximating the sample mean with a normal distribution yields an approximate 95% confidence interval of the form

\left({\bar {x}}-{\frac {2\sigma }{\sqrt {n}}},{\bar {x}}+{\frac {2\sigma }{\sqrt {n}}}\right).

If we wish to have a confidence interval that is W units in width, we would solve

{\frac {4\sigma }{\sqrt {n}}}=W

for n, yielding the sample size n = 16σ²/W².

For example, if we are interested in estimating the amount by which a drug lowers a subject's blood pressure with a confidence interval that is six units wide, and we know that the standard deviation of blood pressure in the population is 15, then the required sample size is 100.

Required sample sizes for hypothesis tests

A common problem faced by statisticians is calculating the sample size required to yield a certain power for a test, given a predetermined Type I error rate α. As follows, this can be estimated by pre-determined tables for certain values, by Mead's resource equation, or, more generally, by the cumulative distribution function:

Tables

^[4] Power	Cohen's d
^[4] Power	0.2	0.5	0.8
0.25	84	14	6
0.50	193	32	13
0.60	246	40	16
0.70	310	50	20
0.80	393	64	26
0.90	526	85	34
0.95	651	105	42
0.99	920	148	58

The table shown on the right can be used in a two-sample t-test to estimate the sample sizes of an experimental group and a control group that are of equal size, that is, the total number of individuals in the trial is twice that of the number given, and the desired significance level is 0.05.^[4] The parameters used are:

The desired statistical power of the trial, shown in column to the left.
Cohen's d (=effect size), which is the expected difference between the means of the target values between the experimental group and the control group, divided by the expected standard deviation.

Mead's resource equation

Mead's resource equation is often used for estimating sample sizes of laboratory animals, as well as in many other laboratory experiments. It may not be as accurate as using other methods in estimating sample size, but gives a hint of what is the appropriate sample size where parameters such as expected standard deviations or expected differences in values between groups are unknown or very hard to estimate.^[5]

All the parameters in the equation are in fact the degrees of freedom of the number of their concepts, and hence, their numbers are subtracted by 1 before insertion into the equation.

The equation is:^[5]

E = N - B - T,

where:

N is the total number of individuals or units in the study (minus 1)
B is the blocking component, representing environmental effects allowed for in the design (minus 1)
T is the treatment component, corresponding to the number of treatment groups (including control group) being used, or the number of questions being asked (minus 1)
E is the degrees of freedom of the error component, and should be somewhere between 10 and 20.

For example, if a study using laboratory animals is planned with four treatment groups (T=3), with eight animals per group, making 32 animals total (N=31), without any further stratification (B=0), then E would equal 28, which is above the cutoff of 20, indicating that sample size may be a bit too large, and six animals per group might be more appropriate.^[6]

Cumulative distribution function

Let X_i, i = 1, 2, ..., n be independent observations taken from a normal distribution with unknown mean μ and known variance σ². Let us consider two hypotheses, a null hypothesis:

H_0:\mu=0

and an alternative hypothesis:

H_a:\mu=\mu^*

for some 'smallest significant difference' μ^* >0. This is the smallest value for which we care about observing a difference. Now, if we wish to (1) reject H₀ with a probability of at least 1-β when H_a is true (i.e. a power of 1-β), and (2) reject H₀ with probability α when H₀ is true, then we need the following:

If z_α is the upper α percentage point of the standard normal distribution, then

\Pr(\bar x >z_{\alpha}\sigma/\sqrt{n}|H_0 \text{ true})=\alpha

and so

'Reject H₀ if our sample average (

{\bar {x}}

) is more than

z_{\alpha}\sigma/\sqrt{n}

is a decision rule which satisfies (2). (Note, this is a 1-tailed test)

Now we wish for this to happen with a probability at least 1-β when H_a is true. In this case, our sample average will come from a Normal distribution with mean μ^*. Therefore, we require

\Pr(\bar x >z_{\alpha}\sigma/\sqrt{n}|H_a \text{ true})\geq 1-\beta

Through careful manipulation, this can be shown (see Statistical power#Example) to happen when

n \geq \left(\frac{z_{\alpha}+\Phi^{-1}(1-\beta)}{\mu^{*}/\sigma}\right)^2

where $\Phi$ is the normal cumulative distribution function.

Stratified sample size

With more complicated sampling techniques, such as stratified sampling, the sample can often be split up into sub-samples. Typically, if there are H such sub-samples (from H different strata) then each of them will have a sample size n_h, h = 1, 2, ..., H. These n_h must conform to the rule that n₁ + n₂ + ... + n_H = n (i.e. that the total sample size is given by the sum of the sub-sample sizes). Selecting these n_h optimally can be done in various ways, using (for example) Neyman's optimal allocation.

There are many reasons to use stratified sampling:^[7] to decrease variances of sample estimates, to use partly non-random methods, or to study strata individually. A useful, partly non-random method would be to sample individuals where easily accessible, but, where not, sample clusters to save travel costs.^[8]

In general, for H strata, a weighted sample mean is

\bar x_w = \sum_{h=1}^H W_h \bar x_h,

with

\operatorname{Var}(\bar x_w) = \sum_{h=1}^H W_h^2 \,\operatorname{Var}(\bar x_h).

^[9]

The weights, $W_h$ , frequently, but not always, represent the proportions of the population elements in the strata, and $W_h=N_h/N$ . For a fixed sample size, that is $N = \sum{N_h}$ ,

\operatorname{Var}(\bar x_w) = \sum_{h=1}^H W_h^2 \,Var_h \left(\frac{1}{n_h} - \frac{1}{N_h}\right),

^[10]

which can be made a minimum if the sampling rate within each stratum is made proportional to the standard deviation within each stratum: $n_h/N_h=k S_h$ , where $S_h = \sqrt{Var_h}$ and $k$ is a constant such that $\sum{n_h} = n$ .

An "optimum allocation" is reached when the sampling rates within the strata are made directly proportional to the standard deviations within the strata and inversely proportional to the square root of the sampling cost per element within the strata, $C_h$ :

\frac{n_h}{N_h} = \frac{K S_h}{\sqrt{C_h}},

^[11]

where $K$ is a constant such that $\sum{n_h} = n$ , or, more generally, when

n_h = \frac{K' W_h S_h}{\sqrt{C_h}}.

^[12]

Qualitative research

Sample size determination in qualitative studies takes a different approach. It is generally a subjective judgment, taken as the research proceeds.^[13] One approach is to continue to include further participants or material until saturation is reached.^[14] The number needed to reach saturation has been investigated empirically.^[15]^[16]^[17]^[18]

There is a paucity of reliable guidance on estimating sample sizes before starting the research, with a range of suggestions given.^[16]^[19]^[20]^[21] A tool akin to a quantitative power calculation, based on the negative binomial distribution, has been suggested for thematic analysis.^[22]^[21]

Software for power and sample size calculations

Numerous free and/or open source programs are available for performing power and sample size calculations. These include

G*Power (http://www.gpower.hhu.de/)
powerandsamplesize.com Free and open source online calculators
PS
PowerUp! provides convenient excel-based functions to determine minimum detectable effect size and minimum required sample size for various experimental and quasi-experimental designs.
PowerUpR is R package version of PowerUp! and additionally includes functions to determine sample size for various multilevel randomized experiments with or without budgetary constraints.
R package pwr
Russ Lenth's power and sample-size page
WebPower Free online statistical power analysis (http://webpower.psychstat.org)
SampSize app for Android and iOS iPhone and iPad (https://www.epigenesys.org.uk/portfolio/sampsize/)

Notes

↑ NIST/SEMATECH, "7.2.4.2. Sample sizes required", e-Handbook of Statistical Methods.
↑ "Inference for Regression". utdallas.edu.
↑ "Confidence Interval for a Proportion"
1 2 Chapter 13, page 215, in: Kenny, David A. (1987). Statistics for the social and behavioral sciences. Boston: Little, Brown. ISBN 0-316-48915-8.
1 2 Kirkwood, James; Robert Hubrecht (2010). The UFAW Handbook on the Care and Management of Laboratory and Other Research Animals. Wiley-Blackwell. p. 29. ISBN 1-4051-7523-0. online Page 29
↑ Isogenic.info > Resource equation by Michael FW Festing. Updated Sept. 2006
↑ Kish (1965, Section 3.1)
↑ Kish (1965), p.148.
↑ Kish (1965), p.78.
↑ Kish (1965), p.81.
↑ Kish (1965), p.93.
↑ Kish (1965), p.94.
↑ Sandelowski, M. (1995). Sample size in qualitative research. Research in Nursing & Health, 18, 179–183
↑ Glaser, B. (1965). The constant comparative method of qualitative analysis. Social Problems, 12, 436–445
↑ Francis, J. J., Johnston, M., Robertson, C., Glidewell, L., Entwistle, V., Eccles, M. P., & Grimshaw, J. M. (2010). What is an adequate sample size? Operationalising data saturation for theory-based interview studies. Psychology and Health, 25, 1229–1245. doi:10.1080/08870440903194015
1 2 Guest, G., Bunce, A., & Johnson, L. (2006). How many interviews are enough?: An experiment with data saturation and variability. Field Methods, 18, 59–82. doi:10.1177/1525822X05279903
↑ Wright, A., Maloney, F. L., & Feblowitz, J. C. (2011). Clinician attitudes toward and use of electronic problem lists: a thematic analysis. BMC Medical Informatics and Decision Making, 11, 36. doi:10.1186/1472-6947-11-36
↑ "Sample Size and Saturation in PhD Studies Using Qualitative Interviews – Mason – Forum Qualitative Sozialforschung / Forum: Qualitative Social Research". qualitative-research.net.
↑ Emmel, N. (2013). Sampling and choosing cases in qualitative research: A realist approach. London: Sage.
↑ Onwuegbuzie, A. J., & Leech, N. L. (2007). A call for qualitative power analyses. Quality & Quantity, 41, 105–121. doi:10.1007/s11135-005-1098-1
1 2 Fugard AJB; Potts HWW (10 February 2015). "Supporting thinking on sample sizes for thematic analyses: A quantitative tool". International Journal of Social Research Methodology. doi:10.1080/13645579.2015.1005453.
↑ Galvin R (2015). How many interviews are enough? Do qualitative interviews in building energy consumption research produce reliable knowledge? Journal of Building Engineering, 1:2–12.

References

Bartlett, J. E., II; Kotrlik, J. W.; Higgins, C. (2001). "Organizational research: Determining appropriate sample size for survey research" (PDF). Information Technology, Learning, and Performance Journal. 19 (1): 43–50.
Kish, L. (1965). Survey Sampling. Wiley. ISBN 0-471-48900-X.
Smith, Scott (8 April 2013). "Determining Sample Size: How to Ensure You Get the Correct Sample Size | Qualtrics". Qualtrics. Retrieved 15 November 2016.

Study design	Population Statistic Effect size Statistical power Sample size determination Missing data
Survey methodology	Sampling stratified cluster Standard error Opinion poll Questionnaire
Controlled experiments	Design control optimal Controlled trial Randomized Random assignment Replication Blocking Interaction Factorial experiment
Uncontrolled studies	Observational study Natural experiment Quasi-experiment

Statistical inference

Statistical theory

Frequentist inference

Point estimation	Estimating equations Maximum likelihood Method of moments M-estimator Minimum distance Unbiased estimators Mean-unbiased minimum-variance Rao–Blackwellization Lehmann–Scheffé theorem Median unbiased Plug-in
Interval estimation	Confidence interval Pivot Likelihood interval Prediction interval Tolerance interval Resampling Bootstrap Jackknife
Testing hypotheses	1- & 2-tails Power Uniformly most powerful test Permutation test Randomization test Multiple comparisons
Parametric tests	Likelihood-ratio Wald Score

Specific tests

Z (normal) Student's t-test F
Goodness of fit	Chi-squared Kolmogorov–Smirnov Anderson–Darling Lilliefors Jarque–Bera Normality (Shapiro–Wilk) Likelihood-ratio test Model selection Cross validation AIC BIC
Rank statistics	Sign Sample median Signed rank (Wilcoxon) Hodges–Lehmann estimator Rank sum (Mann–Whitney) Nonparametric anova 1-way (Kruskal–Wallis) 2-way (Friedman) Ordered alternative (Jonckheere–Terpstra)

Bayesian inference

Correlation	Pearson product-moment Partial correlation Confounding variable Coefficient of determination
Regression analysis	Errors and residuals Regression model validation Mixed effects models Simultaneous equations models Multivariate adaptive regression splines (MARS)
Linear regression	Simple linear regression Ordinary least squares General linear model Bayesian regression
Non-standard predictors	Nonlinear regression Nonparametric Semiparametric Isotonic Robust Heteroscedasticity Homoscedasticity
Generalized linear model	Exponential families Logistic (Bernoulli) / Binomial / Poisson regressions
Partition of variance	Analysis of variance (ANOVA, anova) Analysis of covariance Multivariate ANOVA Degrees of freedom

Categorical / Multivariate / Time-series / Survival analysis

Categorical

Multivariate

Time-series

General	Decomposition Trend Stationarity Seasonal adjustment Exponential smoothing Cointegration Structural break Granger causality
Specific tests	Dickey–Fuller Johansen Q-statistic (Ljung–Box) Durbin–Watson Breusch–Godfrey
Time domain	Autocorrelation (ACF) partial (PACF) Cross-correlation (XCF) ARMA model ARIMA model (Box–Jenkins) Autoregressive conditional heteroskedasticity (ARCH) Vector autoregression (VAR)
Frequency domain	Spectral density estimation Fourier analysis Wavelet

Survival

Survival function	Kaplan–Meier estimator (product limit) Proportional hazards models Accelerated failure time (AFT) model First hitting time
Hazard function	Nelson–Aalen estimator
Test	Log-rank test

Applications

Biostatistics	Bioinformatics Clinical trials / studies Epidemiology Medical statistics
Engineering statistics	Chemometrics Methods engineering Probabilistic design Process / quality control Reliability System identification
Social statistics	Actuarial science Census Crime statistics Demography Econometrics National accounts Official statistics Population statistics Psychometrics
Spatial statistics	Cartography Environmental statistics Geographic information system Geostatistics Kriging

Category
Portal
Commons
WikiProject

This article is issued from Wikipedia. The text is licensed under Creative Commons - Attribution - Sharealike. Additional terms may apply for the media files.