Jackknife resampling

In statistics, the jackknife is a resampling technique especially useful for variance and bias estimation. The jackknife predates other common resampling methods such as the bootstrap. The jackknife estimator of a parameter is found by systematically leaving out each observation from a dataset and calculating the estimate and then finding the average of these calculations. Given a sample of size $N$ , the jackknife estimate is found by aggregating the estimates of each $N-1$ estimate in the sample.

The jackknife technique was developed by Maurice Quenouille (1949, 1956). John Tukey (1958) expanded on the technique and proposed the name "jackknife" since, like a Boy Scout's jackknife, it is a "rough and ready" tool that can solve a variety of problems even though specific problems may be more efficiently solved with a purpose-designed tool.^[1]

The jackknife is a linear approximation of the bootstrap.^[1]

Estimation

The jackknife estimate of a parameter can be found by estimating the parameter for each subsample omitting the ith observation to estimate the previously unknown value of a parameter (say $\bar{x}_i$ ).^[2]

\bar{x}_i =\frac{1}{n-1} \sum_{j \neq i}^n x_j

Variance estimation

An estimate of the variance of an estimator can be calculated using the jackknife technique.

\operatorname{Var}_\mathrm {(jackknife)}=\frac{n-1}{n} \sum_{i=1}^n (\bar{x}_i - \bar{x}_\mathrm{(.)})^2

where $\bar{x}_i$ is the parameter estimate based on leaving out the ith observation, and $\bar{x}_\mathrm{(.)}$ is the estimator based on all of the subsamples.^[3]

Bias estimation and correction

The jackknife technique can be used to estimate the bias of an estimator calculated over the entire sample. Say $\hat{\theta}$ is the calculated estimator parameter of interest based on all ${n}$ observations. Here,

\hat{\theta}_\mathrm{(.)}=\frac{1}{n} \sum_{i=1}^n \hat{\theta}_\mathrm{(i)}

where $\hat{\theta}_\mathrm{(i)}$ is the estimation of parameter of interest based on sample omitting ith observation (jackknife estimator), and $\hat{\theta}_\mathrm{(.)}$ is the mean jackknife estimator. The parameter bias estimate is,

\widehat{\text{Bias}}_\mathrm{(\theta)}=n\hat{\theta} - (n-1)\hat{\theta}_\mathrm{(.)}

This removes the bias in the special case that the bias is $O(N^{-1})$ and to $O(N^{-2})$ in other cases.^[1]

This provides an estimated correction of bias due to the estimation method. The jackknife does not correct for a biased sample.

Notes

References

Cameron, Adrian; Trivedi, Pravin K. (2005). Microeconometrics : methods and applications. Cambridge New York: Cambridge University Press. ISBN 9780521848053.
Efron, B.; Stein, C. (May 1981). "The Jackknife Estimate of Variance". The Annals of Statistics 9 (3): 586–596. doi:10.1214/aos/1176345462. JSTOR 2240822.
Efron, Bradley (1982). The jackknife, the bootstrap, and other resampling plans. Philadelphia, Pa: Society for Industrial and Applied Mathematics. ISBN 9781611970319.
Quenouille, M. H. (September 1949). "Problems in Plane Sampling". The Annals of Mathematical Statistics 20 (3): 355–375. doi:10.1214/aoms/1177729989. JSTOR 2236533.
Quenouille, M. H. (1956). "Notes on Bias in Estimation". Biometrika 43 (3-4): 353–360. doi:10.1093/biomet/43.3-4.353. JSTOR 2332914.
Tukey, J. W. (1958). "Bias and confidence in not quite large samples". The Annals of Mathematical Statistics 29: 614–623. doi:10.1214/aoms/1177706647. .

Statistics

Descriptive statistics

Continuous data

Location	Mean arithmetic geometric harmonic Median Mode

Dispersion	Range Standard deviation Coefficient of variation Percentile Interquartile range

Shape	Variance Skewness Kurtosis Moments L-moments

Count data

Index of dispersion

Summary tables

Dependence

Statistical graphics

Data collection

Study design	Effect size Standard error Statistical power Sample size determination

Survey methodology	Sampling stratified cluster Opinion poll Questionnaire

Controlled experiments	Design control optimal Controlled trial Randomized Random assignment Replication Blocking Factorial experiment

Uncontrolled studies	Observational study Natural experiment Quasi-experiment

Statistical inference

Statistical theory

Frequentist inference

Confidence interval Testing hypotheses Power

Unbiased estimators	Mean unbiased minimum-variance Median unbiased

Biased estimators	Maximum likelihood Method of moments Minimum distance Density estimation

Parametric tests	Likelihood-ratio Wald Score

Specific tests

Z (normal) Student's t-test F Shapiro–Wilk Kolmogorov–Smirnov

Goodness of fit	Chi-squared G Sample source (Anderson–Darling) Sample normality (Shapiro–Wilk) Skewness / kurtosis normality (Jarque-Bera) Model comparison (Likelihood-ratio) Model quality (Akaike criterion)

Signed-rank	1-sample (Wilcoxon) 2-sample (Mann–Whitney U) 1-way anova (Kruskal–Wallis)

Bayesian inference

Correlation	Pearson product–moment Partial correlation Confounding variable Coefficient of determination

Regression analysis	Errors and residuals Regression model validation Mixed effects models Simultaneous equations models Multivariate adaptive regression splines (MARS)

Linear regression	Simple linear regression Ordinary least squares General linear model Bayesian regression

Non-standard predictors	Nonlinear regression Nonparametric Semiparametric Isotonic Robust Heteroscedasticity Homoscedasticity

Generalized linear model	Exponential families Logistic (Bernoulli) / Binomial / Poisson regressions

Partition of variance	Analysis of variance (ANOVA, anova) Analysis of covariance Multivariate ANOVA Degrees of freedom

Categorical / Multivariate / Time-series / Survival analysis

Categorical

Multivariate

Time-series

General	Decomposition Trend Stationarity Seasonal adjustment Exponential smoothing Cointegration Structural break Granger causality

Specific tests	Dickey–Fuller Johansen Q-statistic (Ljung–Box) Durbin–Watson Breusch–Godfrey

Time domain	Autocorrelation (ACF) partial (PACF) Cross-correlation (XCF) ARMA model ARIMA model (Box–Jenkins) Autoregressive conditional heteroskedasticity (ARCH) Vector autoregression (VAR)

Frequency domain	Spectral density estimation Fourier analysis Wavelet

Survival

Survival function	Kaplan–Meier estimator (product limit) Proportional hazards models Accelerated failure time (AFT) model First hitting time

Hazard function	Nelson–Aalen estimator

Test	Log-rank test

Applications

Biostatistics	Bioinformatics Clinical trials / studies Epidemiology Medical statistics

Engineering statistics	Chemometrics Methods engineering Probabilistic design Process / quality control Reliability System identification

Social statistics	Actuarial science Census Crime statistics Demography Econometrics National accounts Official statistics Population Psychometrics

Spatial statistics	Cartography Environmental statistics Geographic information system Geostatistics Kriging

Category
Portal
Commons
WikiProject

This article is issued from Wikipedia - version of the Wednesday, February 03, 2016. The text is available under the Creative Commons Attribution/Share Alike but additional terms may apply for the media files.