User:Petergans/a/sandbox

From Wikipedia, the free encyclopedia

Linear least squares, also known as linear regression, is the form of least squares analysis which is used to find an optimal solution for an Overdetermined system of m equations which are linear in terms of n unknown parameters (m>n). It is used in many disciplines where quantitative observations are modelled, in order to obtain better estimates of the parameters of the model than could be obtained by making only n observations.

1 Problem statement and solution
- 1.1 Straight line fitting
  - 1.1.1 Example
2 Computation
3 Properties of the least-squares estimator
- 3.1 Limitations
4 Parameter errors, correlation and confidence limits
5 Residual values and correlation
6 Objective function
7 Applications
8 External links
9 References

[edit] Problem statement and solution

The objective consists of finding the minimum value of a sum of m squared residuals,r_i, with respect to a set of n parameters of a model.

$S=\sum_{i=1}^{i=m}r_i^2$

Each residual is defined as the difference between the values of a dependent variable, y_i and of the model function, $f(x_i, \boldsymbol \beta)$ , where x_i is an independent variable and $\boldsymbol \beta$ is a vector of n parameters. The values of the dependent variable are obtained from experimental measurements.

$r_i= y_i - f(x_i, \boldsymbol \beta)$

The minimum value of S occurs when the gradient is zero. Since the model contains n parameters there are n gradient equations.

$\frac{\partial S}{\partial \beta_j}=2\sum_i r_i\frac{\partial r_i}{\partial \beta_j}=0 \ (j=1,n)$

Now, in linear least squares, the model comprises a linear combination of the parameters.

$f(x_i,\boldsymbol \beta)=\sum_{j=1}^{j=n} X_{ij}\beta_j$

The coefficients $X i j$ are constants or functions of the independent variable, x_i. So, $r_i= y_i - \sum_{j=1}^{j=n} X_{ij}\beta_j$ and $\frac{\partial r_i}{\partial \beta_j}=-X_{ij}$ . Substitution of these expressions into the gradient equations gives

$\frac{\partial S}{\partial \beta_j}=-2\sum_{i=1}^{i=m}X_{ij} \left( y_i-\sum_{j=1}^{j=n} X_{ij}\beta_j \right)=0$

which, on rearrangement, become n simultaneous linear equations, the normal equations.

$\sum_{i=1}^{i=m}\sum_{k=1}^{k=n} X_{ij}X_{ik}\hat \beta_k=\sum_{i=1}^{i=m} X_{ij}y_i\ (j=1,n)\,$

The normal equations are written in matrix notation as

$\mathbf{\left(X^TX\right)\hat \boldsymbol \beta=X^Ty}$

Solution of the normal equations yields the least squares estimators, $\hat \boldsymbol \beta$ , of the parameter values. See regression analysis(Prediction..) for a worked-out numerical example with three parameters.

When the observations are not equally reliable, a weighted sum of squares may be minimized.

$S=\sum_{i=1}^{i=m}W_{ii}r_i^2$

Each element of the diagonal weight matrix, W should,ideally, be equal to the reciprocal of the variance of the measurement.^[1] The normal equations are then

$\mathbf{\left(X^TWX\right)\hat \boldsymbol \beta=X^TWy}$

[edit] Straight line fitting

For straight line fitting there are only two normal equations. This means that a complete algebraic solution may be worked out with relative ease. For the model

$f(x_i,\boldsymbol \beta)=\alpha+\beta x_i$

$X_{i1}=1, X_{i2}=x_i\,$

With unit weights the normal equations are therefore

$\left( \sum 1^2\right) \alpha+ \left( \sum x_i\right) \beta=\sum y_i$

$\left( \sum x_i\right) \alpha+ \left( \sum x_i^2\right) \beta=\sum x_i y_i$

All the summations go from i=1 to i=m. Each summation can be representd by a single symbol

$\sum 1^2=m$

$\sum x_i=S_x$

$\sum y_i=S_y$

$\sum x_i^2=S_{x^2}$

$\sum x_i y_i=S_{xy}$

In terms of these symbols the normal equations become

$m \alpha+ S_x \beta=S_y\,$

$S_x \alpha+ S_{x^2} \beta=S_{xy}\,$

and the solution, by Cramer's rule is

$\hat\alpha = \frac {S_{x^2} S_y - S_x S_{xy}} {D} \,$

$\hat\beta = \frac {m S_{xy} - S_x S_y} {D} \,$

$D=m S_{x^2} - (S_x)^2$

These expressions have been used in hand calculators because each time a data point is added or removed, the five sums are adjusted and the parameters are recalculated, only seven operations in all. Standard deviations and correlation coefficient

$\sigma(\alpha)=\sqrt{\frac{S}{m-n}\frac{S_{x^2}}{D}}$

$\sigma(\beta)=\sqrt{\frac{S}{m-n}\frac{m}{D}}$

$\rho_{12}=\frac{-S_x}{\sqrt{mS_{x^2}}}$

[edit] Example

With a set of observed data points y=2,3,3,4 obtained with the independent variable, x at values -1,0,2,4.

$m=4\!$

$\sum x=5\!$

$\sum y=12\!$

$\sum x^2=21\!$

$\sum xy=20\!$

$D=59\!$

$\alpha=\frac{21 \times 12-5 \times 20}{59}=\frac{152}{59}$

$\beta=\frac{-5 \times 12+4 \times 20}{59}=\frac{20}{59}$

Now, the residuals are calculated

$\begin{matrix} r_1=&0.42 \\ r_2=&-0.25 \\ r_3=&0.07 \\ r_4=&-0.24 \\ \end{matrix}$

and S=0.305. After calculating the standard deviations the final result is obtained.

$\alpha =2.6 \pm 0.2$

$\beta =0.3 \pm 0.1$

Note that the error is only quoted to one significant figure.

[edit] Computation

Although the algebraic solution of the normal equations can be written as

$\mathbf{ \hat \boldsymbol\beta=\left(X^TX \right)^{-1}X^Ty}$

it is not good practice to invert the normal equations matrix. An exception occurs in numerical smoothing and differentiation where an analytical expression is required.

If the matrix $\mathbf{X^TX}$ is is well-conditioned and positive definite, that is, it has full rank, the normal equations can be solved directly by using the Cholesky decomposition $\mathbf{X^TX=R^TR}$ , where R is an upper triangular matrix, giving

$\mathbf{ R^T R \hat \boldsymbol\beta = X^Ty}$

The solution is obtained in two stages, a forward substitution, $\mathbf{R^Tz=X^Ty}$ , followed by a backward substitution $\mathbf{R\hat \boldsymbol\beta=z}$ . Both subtitutions are facilitated by the triangular nature of R.

In some cases the (weighted) normal equations matrix $\mathbf{X^TX}$ is ill-conditioned; this occurs when the measurements have only a marginal effect on one or more of the estimated parameters.^[2] In these cases, the least squares estimate amplifies the measurement noise and may be grossly inaccurate. Furthermore, Choleski decompostion may fail when attempting to take the square root of a negative number which has been produced because of round-off error. Various regularization techniques can be applied in such cases, the most common of which is called Tikhonov regularization. If further information about the parameters is known, for example, a range of possible values of x, then minimax techniques can also be used to increase the stability of the solution.

For large systems the iterative Gauss-Seidel method may be used to solve the normal equations.

[edit] Properties of the least-squares estimator

If the experimental errors, $\epsilon \,$ , are uncorrelated, have a mean of zero and a constant variance, $σ$ , the Gauss-Markov theorem states that the least-squares estimator, $\hat \beta$ , has the minimum variance of all estimators that are linear combinations of the observations. In this sense it is the best, or optimal, estimator of the parameters. Note particularly that this property is independent of the statistical distribution function of the errors. In other words, the distribution function of the errors need not be a Normal distribution.

For example, it is easy to show that the arithmetic mean of a set of measurements of a quantity is the least-squares estimator of the value of that quantity. If the conditions of the Gauss-Markov theorem apply, the arithmetic mean is optimal, whatever the distribution of errors of the measurements might be.

However, in the case that the experimental errors do belong to a Normal distribuition, the least-squares estimator is also a maximum likelihood estimator.^[3]

These two properties underpin the use of the method of least squares for all types of data fitting, even when the assumptions are not strictly valid.

[edit] Limitations

An assumption underlying the treatment given above is that the independent variable, x, is free of error. In practice, the errors on the measurements of the independent variable are usually much smaller than the errors on the dependent variable and can therefore be ignored. When this is not the case, total least squares also known as Errors-in-variables model, or Rigorous least squares, should be used. This can be done by adjusting the weighting scheme to take into account errors on both the dependant and independent variables and then following the standard procedure.^[4]^[5]

Another drawback of the least squares estimator is the fact that the norm of the residuals, $\|\mathbf{y-X\boldsymbol\beta}\|$ is minimized, whereas in some cases one is truly interested in obtaining small error in the parameter $\mathbf{\boldsymbol\beta}$ , e.g., a small value of $\|\boldsymbol\beta-\hat\boldsymbol\beta\|$ . However, since $\boldsymbol\beta$ is unknown, this quantity cannot be directly minimized. If a prior probability on $\boldsymbol\beta$ is known, then a Bayes estimator can be used to minimize the mean squared error, $E \left\{ \| \boldsymbol\beta - \hat\boldsymbol\beta \|^2 \right\}$ . The least squares method is often applied when no prior is known. Surprisingly, however, better estimators can be constructed, an effect known as Stein's phenomenon. For example, if the measurement error is Gaussian, several estimators are known which dominate, or outperform, the least squares technique; the best known of these is the James-Stein estimator.

[edit] Parameter errors, correlation and confidence limits

The parameter values are linear combinations of the observed values

$\mathbf{\hat \beta=(X^TWX)^{-1}X^TWy}$

Therefore an expression for the errors on the parameter can be obtained by error propagation from the errors on the observations. Let the variance-covariance matrix for the observations be denoted by M and that of the parameters by M^$β$. Then,

$\mathbf{M^\beta=(X^TWX)^{-1}X^TW M W^TX(X^TWX)^{-1}}$

When $\mathbf{W=M^{-1}}$ , this simplifies to

$\mathbf{M^\beta=(X^TWX)^{-1}}$

When unit weights are used ( $\mathbf{W=I, \hat \beta=(X^TX)^{-1}X^Ty}$ ) it is implied that the experimental errors are uncorrelated and all equal: $\mathbf{M}=\sigma^2 \mathbf{I}$ , where $\sigma^2\,$ is known as the variance of an observation of unit weight, and $\mathbf{I}$ is an identity matrix. In this case $\sigma^2\,$ is approximated by $\frac{S}{m-n}$ , where S is the minimum value of the objective function.

$\mathbf{M^\beta=}\frac{S}{m-n}\mathbf{(X^TX)^{-1}}$

In all cases, the variance of the parameter $β i$ is given by $M^\beta_{ii}$ and the covariance between parameters $β i$ and $β j$ is given by $M^\beta_{ij}$ . Standard deviation is the square root of variance and the correlation coefficient is given by $\rho_{ij} = M^\beta_{ij}/\sigma_i/\sigma_j$ . These error estimates reflect only random errors in the measurements. The true uncertainty in the parameters is larger due to the presence of systematic errors which, by definition, cannot be quantified. Note that even though the observations may be un-correlated, the parameters are always correlated.

It is often assumed, for want of any concrete evidence, that the error on a parameter belongs to a Normal distribution with a mean of zero and standard deviation $σ$ . Under that assumption the following confidence limits can be derived.

68% confidence limits, $\hat \beta \pm \sigma$

95% confidence limits, $\hat \beta \pm 2\sigma$

99% confidence limits, $\hat \beta \pm 2.5\sigma$

The assumption is not unreasonable when m>>n. If the experimental errors are normally distributed the parameters will belong to a Student's t-distribution with m-n degrees of freedom. When m>>n Student's t-distribution approximates to a Normal distribution. Note, however, that these confidence limits cannot take systematic error into account. Also, parameter errors should be quoted to one significant figure only, as they are subject to sampling error.^[6]

When the number of observations is relatively small, Chebychev's inequality can be used for an upper bound on probabilities, regardless of any assumptions about the distribution of experimental errors: the maximum probabilities that a parameter will be more than 1, 2 or 3 standard deviations away from its expectation value are 100%, 25% and 11% respectively.

[edit] Residual values and correlation

The residuals are related to the observations by

$\mathbf{\hat r=y-X \hat \beta=y-X \left(X^TWX \right)^{-1}X^T y}$

The symmetric, idempotent matrix $\mathbf{X \left(X^TWX \right)^{-1}X}$ is known in the statistics literature as the hat matrix, $\mathbf{H}$ . Thus,

$\mathbf{\hat r=\left(I-H \right) y}$

where I is an identity matrix. The variance-covariance matrice of the residuals, M^r is given by

$\mathbf{M^r=\left(I-H \right) M \left(I-H \right)}$

This shows that even though the observations may be un-correlated, the residuals are always correlated.

If experimental error follows a normal distribution, then, because of the linear relationship between residuals and observations, so should residuals, ^[7] but since the observations are only a sample of the population of all possible observations, the residuals should belong to a Student's t-distribution. Studentized residuals are useful in making a statistical test for an outlier when a particular residual appears to be excessively large.

[edit] Objective function

The objective function can be written as

$S=\mathbf{ y^T(I-H)^T(I-H)y=y^T(I-H)y}$

since $\mathbf{ (I-H)}$ is also symmetric and idempotent. It can be shown from this,^[8] that the expected value of S is m-n. Note, however, that this is true only if the weights have been assigned correctly. If unit weights are assumed, the expected value of S is $(m - n)σ 2$ , where $σ 2$ is the variance of an observation.

If it is assumed that the residuals belong to a Normal distribution, the objective function, being a sum of weighted squared residuals, will belong to a Chi-square ( $χ 2$ ) distribution with m-n degrees of freedom. Some illustrative percentile values of $χ 2$ are given in the following table.^[9]

m-n	$\chi ^2 _{0.50}$	$\chi ^2 _{0.95}$	$\chi ^2 _{0.99}$
10	9.34	18.3	23.2
25	24.3	37.7	44.3
100	99.3	124	136

These values can be used for a statistical criterion as to the goodness-of-fit. When unit weights are used, the numbers should be divided by the variance of an observation.

[edit] Applications

Polynomials in an independent variable, x
- Straight line: $f(x_i, \boldsymbol \beta)=\alpha +\beta x$ .^[10]
- Quadratic: $f(x_i, \boldsymbol \beta)=\alpha + \beta x +\gamma x^2$ .
- Cubic, quartic and higher polynomials. For high-order polynomials the use of orthogonal polynomials is recommended.^[2]^[11]
Numerical smoothing and differentiation This is an application of polynomial fitting.
Multinomials in more than one independent variable, including surface fitting
Curve fitting with B-splines ^[4]
Chemometrics, Calibration curve, Standard addition

[edit] External links

[edit] References

^ This implies that the observations are uncorrelated. If the observations are correlated, the expression

$S=\sum_k \sum_j r_k W_{kj} r_j\,$

applies. In this case the weight matrix should ideally be equal to the inverse of the variance-covariance matrix of the observations.
^ ^a ^b When fitting polynomials the normal equations matrix is a Vandermonde matrix. Vandermode matrices become increasingly ill-conditioned as the order of the matrix increases.
^ H. Margenau and G.M. Murphy, The Mathematics of Physics and Chemistry, Van Nostrand, 1943, 1956
^ ^a ^b P. Gans, Data fitting in the Chemical Sciences, Wiley, 1992
^ W.E. Deming, Statistical adjustment of Data, Wiley, 1943
^ J. Mandel, The Statistical Analysis of Experimental Data, Interscience, 1964
^ K.V. Mardia, J.T. Kent and J.M. Bibby, Multivariate analysis, Academic Press, 1979
^ W. C. Hamilton, Statistics in Physical Science, The Ronald Press, New York, 1964
^ M.R. Spiegel, Probability and Statistics, Schaum's Outline Series, McGraw-Hill 1982
^ F.S. Acton, Analysis of Straight-Line Data, Wiley, 1959
^ P.G. Guest, Numerical Methods of Curve Fitting,Cambridge University Press, 1961.

v • d • e Least squares and regression analysis

Least squares	Linear least squares - Non-linear least squares - Partial least squares -Total least squares - Gauss–Newton algorithm - Levenberg–Marquardt algorithm

Regression analysis	Linear regression - Nonlinear regression - Linear model - Generalized linear model - Robust regression - Least-squares estimation of linear regression coefficients- Mean and predicted response - Poisson regression - Logistic regression - Isotonic regression - Ridge regression - Segmented regression - Nonparametric regression - Regression discontinuity

Statistics	Gauss–Markov theorem - Errors and residuals in statistics - Goodness of fit - Studentized residual - Mean squared error - R-factor (crystallography) - Mean squared prediction error - Minimum mean-square error - Root mean square deviation - Squared deviations - M-estimator

Applications	Curve fitting - Calibration curve - Numerical smoothing and differentiation - Least mean squares filter - Recursive least squares filter - Moving least squares - BHHH algorithm

Categories: Optimization | Linear algebra | Singular value decomposition