Least squares
From Wikipedia, the free encyclopedia
Least squares or ordinary least squares (OLS) is a mathematical optimization technique which, when given a series of measured data, attempts to find a function which closely approximates the data (a "best fit"). It attempts to minimize the sum of the squares of the ordinate differences (called residuals) between points generated by the function and corresponding points in the data. Specifically, it is called least mean squares (LMS) when the number of measured data is 1 and the gradient descent method is used to minimize the squared residual. LMS is known to minimize the expectation of the squared residual, with the smallest operations (per iteration). But it requires a large number of iterations to converge.
An implicit requirement for the least squares method to work is that errors in each measurement be randomly distributed. The Gauss-Markov theorem proves that least square estimators are unbiased and that the sample data do not have to comply with, for instance, a normal distribution. It is also important that the collected data be well chosen, so as to allow visibility into the variables to be solved for (for giving more weight to particular data, refer to weighted least squares).
The least squares technique is commonly used in curve fitting. Many other optimization problems can also be expressed in a least squares form, by either minimizing energy or maximizing entropy.
Contents |
[edit] History
On New Year's Day in 1801, the Italian astronomer Giuseppe Piazzi discovered the asteroid Ceres. He was able to track its path for 40 days. In the course of the year, many scientists tried to estimate the trajectory on the basis of Piazzi's observations (solving Kepler's nonlinear equations of motion is very difficult). Most evaluations were useless; the only calculation precise enough to allow Zach, a German astronomer, to recover Ceres at the end of the year, was that of 24-year-old Carl Friedrich Gauss (the fundamentals of his approach had already been accomplished by him in 1795, when he was still 18 years old). But his method of least squares was not published until 1809, when it appeared in volume two of his work on celestial mechanics, Theoria Motus Corporum Coelestium in sectionibus conicis solem ambientium. Frenchman Adrien-Marie Legendre independently developed the same method in 1805, as did American Robert Adrain in 1808.
In 1829 Gauss was able to state the reason for this procedure's outstanding success: The method of least squares is simply optimal in many respects. The precise argument is known as the Gauss-Markov theorem.
[edit] Formulation of the problem
Suppose that the data set consists of the points (xi,yi) with . We want to find a function f such that
To attain this goal, we suppose that the function f is of a particular form containing some parameters which need to be determined. For instance, suppose that it is quadratic, meaning that f(x) = ax2 + bx + c, where a, b and c are not yet known. We now seek the values of a, b and c that minimize the sum of the squares of the residuals (S):
This explains the name least squares.
[edit] Solving the least squares problem
In the above example, f is linear in the parameters a, b and c. The problem simplifies considerably in this case and essentially reduces to a system of linear equations. This is explained in the article on linear least squares.
The problem is more difficult if f is not linear in the parameters to be determined. We then need to solve a general (unconstrained) optimization problem. Any algorithm for such problems, like Newton's method and gradient descent, can be used. Another possibility is to apply an algorithm that is developed especially to tackle least squares problems, for instance the Gauss-Newton algorithm or the Levenberg-Marquardt algorithm.
[edit] Least squares and regression analysis
In regression analysis, one replaces the relation
by
where the noise term ε is a random variable with mean zero. Note that we are assuming that the x values are exact, and all the errors are in the y values. Again, we distinguish between linear regression, in which case the function f is linear in the parameters to be determined (e.g., f(x) = ax2 + bx + c), and nonlinear regression. As before, linear regression is much simpler than nonlinear regression. (It is tempting to think that the reason for the name linear regression is that the graph of the function f(x) = ax + b is a line. Fitting a curve f(x) = ax2 + bx + c, estimating a, b, and c by least squares, is an instance of linear regression because the vector of least-square estimates of a, b, and c is a linear transformation of the vector whose components are f(xi) + εi.)
One frequently estimates the parameters (a, b and c in the above example) by least squares: those values are taken that minimize the sum S. The Gauss–Markov theorem states that the least squares estimates are optimal in a certain sense, if we take f(x) = ax + b with a and b to be determined and the noise terms ε are independent and identically distributed (see the article for a more precise statement and less restrictive conditions on the noise terms).
Least squares estimation for linear models is notoriously non-robust to outliers. If the distribution of the outliers is skewed, the estimates can be biased. In the presence of any outliers, the least squares estimates are inefficient and can be extremely so. When outliers occur in the data, methods of robust regression are more appropriate.
[edit] Parameter Estimates
[edit] Parameter estimation using least-squares
By recognizing that the regression model is a system of linear equations we can express the model using data matrix X, target vector Y and parameter vector δ. The ith row of X and Y will contain the x and y value for the ith data sample. Then the model can be written as
which when using pure matrix notation becomes
where ε is normally distributed with expected value 0 (i.e., a column vector of 0s) and variance σ2 In, where In is the n×n identity matrix.
The least-squares estimator for δ is
(where XT is the transpose of X) and the sum of squares of residuals is
One of the properties of least-squares is that the matrix is the orthogonal projection of Y onto the column space of X.
The fact that the matrix X(XTX)−1XT is a symmetric idempotent matrix is incessantly relied on in proofs of theorems. The linearity of as a function of the vector Y, expressed above by saying
is the reason why this is called "linear" regression. Nonlinear regression uses nonlinear methods of estimation.
The matrix In − X (XT X)−1 XT that appears above is a symmetric idempotent matrix of rank n − 2. Here is an example of the use of that fact in the theory of linear regression. The finite-dimensional spectral theorem of linear algebra says that any real symmetric matrix M can be diagonalized by an orthogonal matrix G, i.e., the matrix G′MG is a diagonal matrix. If the matrix M is also idempotent, then the diagonal entries in G′MG must be idempotent numbers. Only two real numbers are idempotent: 0 and 1. So In − X(XTX) 1XT, after diagonalization, has n − 2 0s and two 1s on the diagonal. That is most of the work in showing that the sum of squares of residuals has a chi-square distribution with n−2 degrees of freedom.
Regression parameters can also be estimated by Bayesian methods. This has the advantages that
- confidence intervals can be produced for parameter estimates without the use of asymptotic approximations,
- prior information can be incorporated into the analysis.
Suppose that in the linear regression
we know from domain knowledge that alpha can only take one of the values {−1, +1} but we do not know which. We can build this information into the analysis by choosing a prior for alpha which is a discrete distribution with a probability of 0.5 on −1 and 0.5 on +1. The posterior for alpha will also be a discrete distribution on {−1, +1}, but the probability weights will change to reflect the evidence from the data.
In modern computer applications, the actual value of β is calculated using the QR decomposition or slightly more fancy methods when XTX is near singular. The code for the matlab \ function is an excellent example of a robust method.
[edit] Summarizing the data
We sum the observations, the squares of the Ys and Xs and the products XY to obtain the following quantities.
[edit] Estimating beta (the slope)
We use the summary statistics above to calculate , the estimate of β.
[edit] Estimating alpha (the intercept)
We use the estimate of β and the other statistics to estimate α by:
A consequence of this estimate is that the regression line will always pass through the "center" .
[edit] References
- Abdi, H. "[1] (2003). Least-squares. In M. Lewis-Beck, A. Bryman, T. Futing (Eds): Encyclopedia for research methods for the social sciences. Thousand Oaks (CA): Sage. pp. 792-795.]".
- Stigler, S.M. (1986). The History of Statistics: The Measurement of Uncertainty Before 1900. Harvard University Press, Cambridge MA and London, England.
[edit] See also
- Isotonic regression
- Least mean squares filter
- Least-squares estimation of linear regression coefficients
- Linear regression
- Measurement uncertainty
- Moving least squares
- Regression analysis
- Robust regression
- Root mean square
- Total least squares or errors-in-variables model
- Weighted least squares
[edit] External links
- http://www.physics.csbsju.edu/stats/least_squares.html
- Zunzun.com - Online curve and surface fitting
- http://www.orbitals.com/self/least/least.htm
- Least squares on PlanetMath