Quasi-Newton method

From Wikipedia, the free encyclopedia

In optimization, quasi-Newton methods (a special case of variable metric methods) are algorithms for finding local maxima and minima of functions. Quasi-Newton methods are based on Newton's method to find the stationary point of a function, where the gradient is 0. Newton's method assumes that the function can be locally approximated as a quadratic in the region around the optimum, and uses the first and second derivatives to find the stationary point. In higher dimensions, Newton's method uses the gradient and the Hessian matrix of second derivatives of the function to be minimized.

In quasi-Newton methods the Hessian matrix does not need to be computed. The Hessian is updated by analyzing successive gradient vectors instead. Quasi-Newton methods are a generalization of the secant method to find the root of the first derivative for multidimensional problems. In multiple dimensions the secant equation is under-determined, and quasi-Newton methods differ in how they constrain the solution, typically by adding a simple low-rank update to the current estimate of the Hessian.

The first quasi-Newton algorithm was proposed by William C. Davidon, a physicist working at Argonne National Laboratory. He developed the first quasi-Newton algorithm in 1959: the DFP updating formula, which was later popularized by Fletcher and Powell in 1963, but is rarely used today. The most common quasi-Newton algorithms are currently the SR1 formula (for symmetric rank one), the BHHH method, the widespread BFGS method (suggested independently by Broyden, Fletcher, Goldfarb, and Shanno, in 1970), and its low-memory extension, L-BFGS. The Broyden's class is a linear combination of the DFP and BFGS methods.

The SR1 formula does not guarantee the update matrix to maintain positive-definiteness and can be used for indefinite problems. The Broyden's method does not require the update matrix to be symmetric and it is used to find the root of a general system of equations (rather than the gradient) by updating the Jacobian (rather than the Hessian).

One of the chief advantages of quasi-Newton methods over Newton's method is that the Hessian matrix (or, in the case of quasi-Newton methods, its approximation) $B$ does not need to be inverted. Newton's method, and its derivatives such as interior point methods, require the Hessian to be inverted, which is typically implemented by solving a system of linear equations and is often quite costly. In contrast, quasi-Newton methods usually generate an estimate of $B^{{-1}}$ directly.

Description of the method

As in Newton's method, one uses a second order approximation to find the minimum of a function $f(x)$ . The Taylor series of $f(x)$ around an iterate is:

$f(x_{k}+\Delta x)\approx f(x_{k})+\nabla f(x_{k})^{T}\Delta x+{\frac {1}{2}}\Delta x^{T}{B}\,\Delta x,$

where ( $\nabla f$ ) is the gradient and $B$ an approximation to the Hessian matrix. The gradient of this approximation (with respect to $\Delta x$ ) is

$\nabla f(x_{k}+\Delta x)\approx \nabla f(x_{k})+B\,\Delta x$

and setting this gradient to zero provides the Newton step:

$\Delta x=-B^{{-1}}\nabla f(x_{k}),\,$

The Hessian approximation $B$ is chosen to satisfy

$\nabla f(x_{k}+\Delta x)=\nabla f(x_{k})+B\,\Delta x,$

which is called the secant equation (the Taylor series of the gradient itself). In more than one dimension $B$ is under determined. In one dimension, solving for $B$ and applying the Newton's step with the updated value is equivalent to the secant method. The various quasi-Newton methods differ in their choice of the solution to the secant equation (in one dimension, all the variants are equivalent). Most methods (but with exceptions, such as Broyden's method) seek a symmetric solution ( $B^{T}=B$ ); furthermore, the variants listed below can be motivated by finding an update $B_{{k+1}}$ that is as close as possible to $B_{{k}}$ in some norm; that is, $B_{{k+1}}={\textrm {argmin}}_{B}\|B-B_{k}\|_{V}$ where $V$ is some positive definite matrix that defines the norm. An approximate initial value of $B_{0}=I*x$ is often sufficient to achieve rapid convergence. The unknown $x_{k}$ is updated applying the Newton's step calculated using the current approximate Hessian matrix $B_{{k}}$

$\Delta x_{k}=-\alpha _{k}B_{k}^{{-1}}\nabla f(x_{k})$ , with $\alpha$ chosen to satisfy the Wolfe conditions;
$x_{{k+1}}=x_{{k}}+\Delta x_{k}$ ;
The gradient computed at the new point $\nabla f(x_{{k+1}})$ , and

$y_{k}=\nabla f(x_{{k+1}})-\nabla f(x_{k}),$

is used to update the approximate Hessian $\displaystyle B_{{k+1}}$ , or directly its inverse $\displaystyle H_{{k+1}}=B_{{k+1}}^{{-1}}$ using the Sherman-Morrison formula.

A key property of the BFGS and DFP updates is that if $B_{k}$ is positive definite and $\alpha _{k}$ is chosen to satisfy the Wolfe conditions then $\displaystyle B_{{k+1}}$ is also positive definite.

The most popular update formulas are:

Method	$\displaystyle B_{{k+1}}=$	$H_{{k+1}}=B_{{k+1}}^{{-1}}=$
DFP	$\left(I-{\frac {y_{k}\,\Delta x_{k}^{T}}{y_{k}^{T}\,\Delta x_{k}}}\right)B_{k}\left(I-{\frac {\Delta x_{k}y_{k}^{T}}{y_{k}^{T}\,\Delta x_{k}}}\right)+{\frac {y_{k}y_{k}^{T}}{y_{k}^{T}\,\Delta x_{k}}}$	$H_{k}+{\frac {\Delta x_{k}\Delta x_{k}^{T}}{y_{k}^{{T}}\,\Delta x_{k}}}-{\frac {H_{k}y_{k}y_{k}^{T}H_{k}^{T}}{y_{k}^{T}H_{k}y_{k}}}$
BFGS	$B_{k}+{\frac {y_{k}y_{k}^{T}}{y_{k}^{{T}}\Delta x_{k}}}-{\frac {B_{k}\Delta x_{k}(B_{k}\Delta x_{k})^{T}}{\Delta x_{k}^{{T}}B_{k}\,\Delta x_{k}}}$	$\left(I-{\frac {y_{k}\Delta x_{k}^{T}}{y_{k}^{T}\Delta x_{k}}}\right)^{T}H_{k}\left(I-{\frac {y_{k}\Delta x_{k}^{T}}{y_{k}^{T}\Delta x_{k}}}\right)+{\frac {\Delta x_{k}\Delta x_{k}^{T}}{y_{k}^{T}\,\Delta x_{k}}}$
Broyden	$B_{k}+{\frac {y_{k}-B_{k}\Delta x_{k}}{\Delta x_{k}^{T}\,\Delta x_{k}}}\,\Delta x_{k}^{T}$	$H_{{k}}+{\frac {(\Delta x_{k}-H_{k}y_{k})\Delta x_{k}^{T}H_{k}}{\Delta x_{k}^{T}H_{k}\,y_{k}}}$
Broyden family	$(1-\varphi _{k})B_{{k+1}}^{{BFGS}}+\varphi _{k}B_{{k+1}}^{{DFP}},\qquad \varphi \in [0,1]$
SR1	$B_{{k}}+{\frac {(y_{k}-B_{k}\,\Delta x_{k})(y_{k}-B_{k}\,\Delta x_{k})^{T}}{(y_{k}-B_{k}\,\Delta x_{k})^{T}\,\Delta x_{k}}}$	$H_{{k}}+{\frac {(\Delta x_{k}-H_{k}y_{k})(\Delta x_{k}-H_{k}y_{k})^{T}}{(\Delta x_{k}-H_{k}y_{k})^{T}y_{k}}}$

Implementations

Owing to their success, there are implementations of quasi-Newton methods in almost all programming languages. The NAG Library contains several routines^[1] for minimizing or maximizing a function^[2] which use quasi-Newton algorithms.

In MATLAB's Optimization Toolbox, the fminunc function uses (among other methods) the BFGS Quasi-Newton method. Many of the constrained methods of the Optimization toolbox use BFGS and the variant L-BFGS. Many user-contributed quasi-Newton routines are available on MATLAB's file exchange.

Mathematica includes quasi-Newton solvers. R's optim general-purpose optimizer routine uses the BFGS method by using method="BFGS". In the SciPy extension to Python, the scipy.optimize.minimize function includes, among other methods, a BFGS implementation.

References

↑ The Numerical Algorithms Group. "Keyword Index: Quasi-Newton". NAG Library Manual, Mark 23. Retrieved 2012-02-09.
↑ The Numerical Algorithms Group. "E04 – Minimizing or Maximizing a Function". NAG Library Manual, Mark 23. Retrieved 2012-02-09.

Bonnans, J. F., Gilbert, J.Ch., Lemaréchal, C. and Sagastizábal, C.A. (2006), Numerical optimization, theoretical and numerical aspects. Second edition. Springer. ISBN 978-3-540-35445-1.
William C. Davidon, Variable Metric Method for Minimization, SIOPT Volume 1 Issue 1, Pages 1–17, 1991.
Fletcher, Roger (1987), Practical methods of optimization (2nd ed.), New York: John Wiley & Sons, ISBN 978-0-471-91547-8 .
Nocedal, Jorge & Wright, Stephen J. (1999). Numerical Optimization. Springer-Verlag. ISBN 0-387-98793-2.
Press, WH; Teukolsky, SA; Vetterling, WT; Flannery, BP (2007). "Section 10.9. Quasi-Newton or Variable Metric Methods in Multidimensions". Numerical Recipes: The Art of Scientific Computing (3rd ed.). New York: Cambridge University Press. ISBN 978-0-521-88068-8.

Optimization: Algorithms, methods, and heuristics

Unconstrained nonlinear: Methods calling ...

... functions

... and gradients

Convergence	Trust region Wolfe conditions

Quasi–Newton	BFGS and L-BFGS DFP Symmetric rank-one (SR1)

Other methods	Gauss–Newton Gradient Levenberg–Marquardt Conjugate gradient

... and Hessians

Newton's method

The graph of a strictly concave quadratic function is shown in blue, with its unique maximum shown as a red dot. Below the graph appears the contours of the function: The level sets are nested ellipses.

Constrained nonlinear

General	Barrier methods Penalty methods

Differentiable	Augmented Lagrangian methods Sequential quadratic programming Successive linear programming

Convex optimization

Convex
minimization

Linear and
quadratic

Interior point	Ellipsoid algorithm of Khachiyan Projective algorithm of Karmarkar

Basis-Exchange	Simplex algorithm of Dantzig Criss-cross algorithm Principal pivoting algorithm of Lemke

Combinatorial

Paradigms

Graph
algorithms

Minimum spanning tree	Bellman–Ford Borůvka Dijkstra Floyd–Warshall Johnson Kruskal

Network flows

Dinic
Edmonds–Karp
Ford–Fulkerson
Push-relabel maximum flow

Metaheuristics

Evolutionary algorithm Hill climbing Local search Simulated annealing Tabu search

Categories
- Algorithms and methods
- Heuristics
Software

This article is issued from Wikipedia. The text is available under the Creative Commons Attribution/Share Alike; additional terms may apply for the media files.

Quasi-Newton method

Description of the method

Implementations

See also

References

Further reading