Talk:Levenberg–Marquardt algorithm

From Wikipedia, the free encyclopedia

Contents

[edit] Possible replacement article?

Could I please recommend that people interested in this article look at Sam Roweis' document on Levenberg Marquardt at <http://www.cs.toronto.edu/~roweis/notes/lm.pdf>. It seems to be a very complete discription. If it were simply pasted in (pdf2html?) it would answer the questions below and add detail. This would have to be checked with the author, of course, which I will be happy to do. I am new to this, so my etiquette is probably short of the mark, but I hope this is useful.

I suggest that if you think there is valuable insights in Roweis' article you might use the information there to improve the existing wikipedia article. The questions below are not fundamentally important, but the article can be improved. Certainly no case for wholesale replacement.Billlion 09:49, 5 September 2006 (UTC)

[edit] correction

Should

(JTJ + λ)q = -JTf

be

(JTJ + λ I)q = -JTf. ?

The preceding unsigned comment was added by 129.13.70.88 (talk • contribs) on 08:18, 29 August 2005.

Both formulas have the same meaning, namely JTJq + λq = −JTf. As far as I am concerned, it is a matter of taste which you prefer, but the second is perhaps clearer. -- Jitse Niesen (talk) 20:46, 30 August 2005 (UTC)

[edit] Transpose

Where f and p are introduced:

fT=(f1, ..., fm), and pT=(p1, ..., pn).

they are presented with superscript T, which I think indicates that they are transposed. But f extends over 1..m, and p extends over 1..n, so I don't understand why both are transposed. Can someone please explain what I am missing here? Thanks. --CarlManaster 22:49, 15 December 2005 (UTC)

I think the reasoning is that (f1, ..., fm) is a row vector. However, we want f to be a column vector. So the transpose is used to convert from rows to columns. It is perhaps clearer to write
f=(f1, ..., fm)T, and p=(p1, ..., pn)T.
To be honest, many mathematical texts do not bother to distinguish between row and column vectors, hoping that the reader can deduce from the context which one is meant.
Perhaps you know this, and your question is: why do we want to use a column vector? Well, f has to be a column vector because fTf further down the article should be a inner product, and p has to be a column vector because it's added to q, which is a column vector because otherwise the product Jq does not make sense. -- Jitse Niesen (talk) 23:19, 15 December 2005 (UTC)

[edit] Improvement by Marqardt

What's about the improvement Marquardt made to the algorithm? He replaced I by diag[H], i.e. the diagonal of the (approximated) Hessian, to incorporate some local curvature estimation. This makes the algorithm go further in directions of smaller gradient to get out of narrow valleys on the error surface.

last entry by G.k. 13:46, 27 April 2006 (UTC)

I believe this is correct. It also has the important practical benefit of making λ dimensionless, so λ=1 is (I think) a sensible general starting point.

A way to address this might be to add a section called (say) "Improved solution" after the "Choice of damping parameter" section, with the new formula. The original "solution" section and this new one should probably be better referenced as well. Thoughts? I may give it a shot soon. Baccyak4H (Yak!) 18:29, 25 July 2007 (UTC)

I believe Marquardt's suggested improvement was actually to scale the approximate hessian matrix, JTJ. The scaling had an effect similar to replacing the identity matrix with the diagonal of the hessian approximation. Marquardt suggests scaling the hessian approximation by an amount that makes the diagonal elements ones. Adding a constant diagonal matrix to the scaled matrix is similar to adding a proportion of the diagonal elements to the unscaled matrix. The scaling applied to the other elements of the hessian approximation improves the condition of the matrix. I suggest the following:

 q = \Sigma_J[\hat{J}^T\hat{J} + \lambda{}I]^{-1}\hat{J}^T [y - f(p)]

where ΣJ is the square diagonal matrix:

 \Sigma_J = [\mbox{diag}[J^TJ]]^{-\frac{1}{2}}
 

and the scaled Jacobian, \hat{J}, is:

 \hat{J} = J\Sigma_J

The square matrix, \hat{J}^T\hat{J}, is then a scaled version of the squared Jacobian, JTJ, where the scale factor is the root mean square of the columns of JTJ. The result of the scaling is ones on all the diagonal elements of \hat{J}^T\hat{J}.

Note that more recent implementations that use the Levenberg-Marquardt tag do not include Marquardt's suggestion. It seems that clever choices of λ result in reasonable robustness and less function evaluations. Stephen.scott.webb (talk) 22:29, 4 February 2008 (UTC)

[edit] Reference to nonlinear programming

Reference to nonlinear programming seems rather less appropriate here than reference to nonlinear regression. However, the article on the latter inadequate (but developing).Dfarrar 09:08, 10 March 2007 (UTC)

[edit] explain terminology better

A short explanation (or a link to a page with an explanation) about the terminology f(t_i|p) would be in order here. I dont think it is common usage outside of statistical groups. —Preceding unsigned comment added by 130.225.102.1 (talk) 15:26, 25 September 2007 (UTC)

[edit] A major proposal

Please see talk:least squares#A major proposal for details. Petergans (talk) 17:39, 22 January 2008 (UTC)

[edit] Choice of damping parameter

- should discuss the approach of Moré, the standard used in most production-quality numerical libraries. It tends to be both faster and more robust than earlier more naive methods. Jheald (talk) 22:26, 30 January 2008 (UTC)