Instrumental variable

From Wikipedia, the free encyclopedia

In statistics, an instrumental variable (IV, or instrument) can be used in regression analysis to produce a consistent estimator when the explanatory variables (covariates) are correlated with the error terms. This can be caused by endogeneity, by omitted covariates, or by measurement errors in the covariates. In this situation, ordinary linear regression produces biased and inconsistent estimates. However, if an instrument is available, consistent estimates may still be obtained. An instrument is a variable that does not itself belong in the regression, that is correlated with the suspect explanatory variable, and that is uncorrelated with the error term.

There are three main requirements for using an IV:

The instrument must be correlated with the model's predicting (explanatory) variable.
The instrument cannot be correlated with the error term in the second stage model (that is, the instrument cannot suffer from the same problem as the original predicting variable).
The instrument must act on the outcome only through the predicting variable, not directly.

1 Econometrics
2 Applications and problems
3 Hypothesis testing
4 References

[edit] Econometrics

This ordinary least squares estimator ( $\widehat{\beta}_\mathrm{OLS}$ ) is used to estimate the mean structure of a model of the form

$y_i = \beta x_i + \varepsilon_i$

and takes the form

$\widehat{\beta}_\mathrm{OLS} = \frac{\sum_i x_i y_i}{\sum_i x_i^2} = \frac{\sum_i x_i (x_i \beta + \varepsilon_i)}{\sum_i x_i^2} = \beta + \frac{\sum_i x_i \varepsilon_i}{\sum_i x_i^2}.$

When x and $\varepsilon$ are uncorrelated, the second term goes to zero in the limit and the estimator is unbiased with decreasing variance as the number of sampled units increases and thus also consistent. When x and $\varepsilon$ are correlated, however, the estimator is biased and inconsistent.

An instrumental variable is one that is correlated with the independent variable but not with the error term. The estimator is

$\widehat{\beta}_\mathrm{IV} = \frac{\sum_i z_i y_i}{\sum_i z_i x_i} = \frac{\sum_i z_i (x_i \beta + \varepsilon_i)}{\sum_i z_i x_i} = \beta + \frac{\sum_i z_i \varepsilon_i}{\sum_i z_i x_i}.$

When z and $\varepsilon$ are uncorrelated, the final term approaches zero in the limit, providing a consistent estimator. Note that when x is uncorrelated with the error term, x is itself an instrument for itself. In this light, under certain assumptioins, OLS is a narrower version of IV estimators.

The approach above generalizes in a straightforward way to a regression with multiple explanatory variables. Suppose X is the T x K matrix of explanatory variables resulting from T observations on K variables. Let Z be a T x K matrix of instruments. Then

$\widehat{\beta}_\mathrm{IV} = (Z'X)^{-1}Z'Y = (Z'X)^{-1}Z'(X\beta+\varepsilon) = \beta + (Z'X)^{-1}Z'\varepsilon.$

One computational method often used for implementing the technique is two-stage least-squares (2SLS). One advantage of this approach is that it can efficiently combine information from multiple instruments for over-identified regressions: where there are fewer covariates than instruments. Under the 2SLS approach, in a first stage, each endogenous covariate (predictor variable) is regressed on all valid instruments, including the full set of exogenous covariates in the main regression. Since the instruments are exogenous, these approximations of the endogenous covariates will not be correlated with the error term. So, intuitively they provide a way to analyze the relationship between the outcome variable and the endogenous covariates. In the second stage, the regression of interest is estimated as usual, except that in this each endogenous covariate is replaced with its approximation estimated in the first stage. The slope estimator thus obtained is consistent. A small correction must be made to the sum-of-squared residuals in the second-stage fitted model in order that the associated standard errors be computed correctly.

Stage 1: $\widehat{X}= Z(Z'Z)^{-1}Z'X$

Stage 2: $\widehat{B}_\mathrm{IV} = (\widehat{X}'\widehat{X})^{-1}\widehat{X}'Y$

Mathematically, this estimator is identical to the single stage estimator presented above when the number of instruments is the same as the number of covariates.

[edit] Applications and problems

The use of the instrumental variables estimation technique often provides a useful, convenient and ethical alternative to the classical randomized experiment. In the randomized experiment, exogenous variation in treatment is provided by the random assignment of participants to the treatment and control conditions, causing the investigator to deny the treatment to the control participants. Using IVE, participants can be permitted to self-select into treatment and control, and the investigator can subsequently tease out the exogenous component of the treatment variation using the instrument. Of course, one does not get anything for nothing -- the IVE technique is only as good as the instruments it employs.

In comparison to randomized experiments, IV estimates local average treatment effects (LATE) rather than average treatment effects (ATE). The effect of a program is only identified for the subpopulation that is affected by the instrument. For example, using financial aid as an instrument for college (assuming financial aid changed exogenously due to a policy change) only identifies the returns to education for students who attend college solely because of financial aid. Students who receive no financial aid are not affected by the instrument.

The technique is useful for solving the errors in variables problem and for the recovery of structural parameters from simultaneous equations models such as supply and demand. Unfortunately, there is no way to prove that the independent variables are not correlated with the error term, since the error is by definition unobservable. Consequently, one problem is in the selection and defense of suitable instruments. Good instruments are often created by exogenous policy changes (i.e., the cancellation of federal student aid scholarship program), geographic differences in the application of standards (i.e., different states implement different passing standards for a common exam) or generic randomness (e.g., the Vietnam Draft Lottery) have led to exogenous disruptions in the values of the construct being measured by the selected instrument.

Another problem is caused by the selection of "weak" instruments. These are instruments that are very poor predictors of the endogenous question predictor in the first-stage equation. In this latter case, the prediction of the question predictor by the instrument will be poor and the obtained predicted values will have very little variation. Consequently, they are unlikely to have much success in predicting the ultimate outcome when they are used to replace the question predictor in the second-stage equation.

[edit] Hypothesis testing

The problem can be written as

$\widehat{\beta}=\left(Z' X\right)^{-1} Z' y$

By using the fact that $y=X \beta + \varepsilon$ , it follows that $\widehat{\beta}$ is normally distributed with mean $β$ and covariance matrix

$\Sigma = \sigma^2 \left( Z' X\right)^{-1} \left(Z' Z \right) \left(X' Z \right)^{-1} = \sigma^2 A$

where $\scriptstyle{\sigma^2}$ is the variance of $\scriptstyle{\varepsilon}.\,$

The residual sum of squares is computed with:

$RSS=\widehat{\varepsilon}'\widehat{\varepsilon}=y' \left(I - Z \left( X' Z \right)^{-1} X' \right) \left( I - X \left( Z' X \right)^{-1} Z' \right) y$