Talk:Matrix calculus

From Wikipedia, the free encyclopedia



Contents

[edit] Is the differentiation given here really different than Fréchet derivative ?

The definition using partial derivative can be defined for function which are not differentiable (at least Gateaux or Fréchet differentiability), but I cannot see any use of the definition given at this article on thoses cases. After having checker at several references, I think that this article should really be understood in the Fréchet meaning. According to Universalis Encyclopedia, the article "calcul infinitesimal a plusieurs variables" says that Fréchet derivative is the usual derivative in Banach spaces, and particularly in finite dimension R space vectors with the norm taken from the usual scalar product. The expression Frechet derivative, still according to Universalis, is not used anymore: it is simply called differential. I read a bit about tensors, and if I understand a tensor can be seen as the representation of a linear map between vectors: for example, a linear map from matrix spaces to matrix space is a 4 rank matrix, a linear map from n-dimension vectors to p-dimension vectors is a 2-rank tensor, equivalent to a matrix. I also found a presentation of Taylor theorem in multi-dimension, which defines mixed-partial derivative as multi-linear maps, and such having the Fréchet definition of derivative: http://gold-saucer.afraid.org/math/taylor/taylor.pdf. If we evoke tensor as linear map, I don't think we need any tensor theory (which should be avoided here, as it is a practical article). Ashigabou 08:56, 6 February 2006 (UTC)

As far as I know, the expression "Frechet derivative" is still used in English for derivatives between vector spaces. Whether you call it a tensor or a multi-linear map depends on your background. However, I don't think that a tensor treatment is more theoretical than a "multi-linear map" treatment; actually, my guess would be that physicists would prefer tensors and mathematicians multi-linear maps. -- Jitse Niesen (talk) 12:22, 15 February 2006 (UTC)

[edit] Inconsistent definition of derivative

User:Lethe did [1] on the 1. of January 2006 change the definition of the derivative of a vector function by another vector, from being the transpose of the Jacobian to being simply the Jacobian matrix. Subsequent edits has since transposed the results of most of the equations listed to reflect this change.

This new definition is, however, not consistent with the definition used in most textbooks, including two of the references listed in this article. This inconsistency severly limits the applicability of the formulas listed in the article for deriving solutions to many common statistical problems, like ML parameter estimation, Kalman filter, MMSE estimation e.g.

Is there a strong reason for using the current definition? --Fredrik Orderud 16:27, 29 May 2006 (UTC)

Consistency with other formulations was my motivation for changing. Basically, in linear algebra and differential geometry, by convention, vectors are represented as columns and dual vectors are represented as rows. As I recall, this is mentioned in the footnotes of one of the textbooks sources, where the issue is brushed aside without giving a justification for choosing the wrong convention. The notation I chose to write this article is more consistent with other Wikipedia articles, although you can also find articles who prefer to have vectors as row vectors. Although I don't know what any of the applications you mention are, I don't understand your points about the limitations of this convention. How can a convention limits its usefulness? Do these applications have some inherent preference for row vectors? -lethe talk + 16:43, 29 May 2006 (UTC)
An example of problems caused by the different definitions is the derivative of \textbf{A}\textbf{x}, which in most estimation and pattern-recognition textbooks are equal to \textbf{A}^T. This article does, however, have \textbf{A} as the solution due to the different defintion. Similar differences also occurs in the equations listed for the derivative of quadratic equations and matrix traces.
Application of this article's equations therefore leads to different results compared to the derivations found in most textbooks, which can be VERY confusing.--Fredrik Orderud 17:29, 29 May 2006 (UTC)

I didn't like your previous statement that this convention "limits the applicability" of the formulas, something which cannot be true, a formula doesn't lose validity or applicability just based on how you arrange its symbols on your paper. Nevertheless, I will admit that different notational conventions can be very confusing, and that may be a reason to switch this article over, which is certainly possible to do.

But there is indeed a strong reason to prefer the standard convention: the way matrix multiplication is defined and our convention of composing functions demands it. Let me explain what I mean. Suppose you have a column vector

\mathbf{v}=
\begin{bmatrix}
a\\
b
\end{bmatrix}

and a dual vector

f=
\begin{bmatrix}
x & y\\
\end{bmatrix}

Dual vectors act on vectors to yield scalars. In this case, we have


f(\mathbf{v})=
\begin{bmatrix}
x & y\\
\end{bmatrix}
\begin{bmatrix}
a\\
b
\end{bmatrix}
=
ax+by.

If, on the other hand, you take the alternate convention, with

\mathbf{v}=
\begin{bmatrix}
a &b
\end{bmatrix}

and

f=
\begin{bmatrix}
x\\
y\\
\end{bmatrix}

Then you have two choices: either take a nonstandard definition of matrix multiplication (written with an asterisk) which forces


f(\mathbf{v})=
\begin{bmatrix}
x\\
y\\
\end{bmatrix}
*
\begin{bmatrix}
a & b
\end{bmatrix}
=
ax+by

(normal matrix multiplication (denoted by juxtaposition) requires that this be rather


\begin{bmatrix}
x\\
y\\
\end{bmatrix}
\begin{bmatrix}
a & b
\end{bmatrix}
=
\begin{bmatrix}
ax & bx\\
ay & by\\
\end{bmatrix}

so this is weird). Or else you can keep normal matrix multiplication if you adopt the alternate notation for composition of functions. That is to say, instead of denoting a functional f acting on a vector v as f(v), use the notation (v)f. This results in


(\mathbf{v})f=
\begin{bmatrix}
a & b
\end{bmatrix}
\begin{bmatrix}
x\\
y\\
\end{bmatrix}
=
ax+by

using normal matrix multiplication. Thus, you are faced with three alternatives:

  1. Use columns for your vectors (as the article currently does)
  2. Change the definition of matrix multiplication for this article (a bizarre proposition)
  3. Reverse the convention for composition by functions (there was a movement in the 60s to switch all of mathematics to this convention, and I've dabbled with it myself, but it's not very popular)

Thus you see that no matter what we do, we have a source of confusion for someone, and it's my opinion that using standard conventions of most mathematicians (that vectors be represented as columns) rather than the conventions of the guys who wrote the matrix calculus texts (where vectors are rows) represents the best solution. I suppose there is a fourth solution, which is to simply list a bunch of formulas, and simply ignore their mathematical meaning as maps. I suppose this must be what those matrix calculus text authors do? I regard this as a not very good solution. I think it will cause just as much confusion as it saves. -lethe talk + 18:16, 29 May 2006 (UTC)

I've now added a "notice" section in the article, which explains the alternative definition used in estimation theory and pattern recognition. This would at least make readers aware of the potential for confusion. --Fredrik Orderud 21:08, 29 May 2006 (UTC)
It's a good idea to include some discussion about the alternate notation. I need to stare at those other references for a while to see if they have some way of getting around the problems I listed above. I have a suspicion that what those sources do is redefine matrix multiplication (my option number 2), but they hide this fact by throwing around a lot of extra transpose symbols. Once I figure out exactly what is going on, I'll try to add something to the article to make it clear. Stay tuned. -lethe talk + 22:14, 29 May 2006 (UTC)
Great! I'll look forward to hear back from you :-). --Fredrik Orderud 22:55, 29 May 2006 (UTC)
This article is potentially very usefull to a lot of people. Thank you for working on it, guys! Could the section now called "Notice" be expanded with an explantion of why the current definition is used? Is it used in anywhere but in pure math? Maybe "...within the field of estimation theory and pattern recognition" could be generalised? The reason the current definition is slightly awkvard, is that you often get row vectors out when you expect column vectors, which is what you usually represent your data as. An example is if you differentiate a Normal distribution with respect to the mean. (You need \frac{\partial \textbf{x}^T A \textbf{x}}{\partial \textbf{x}}). If you solve this with the currently used definitions, you get equations with row vectors. Of course, you only have to transpose the answer, but it makes it a bit harder to see the solution. I see that this is not a very compelling argument. What is very important is that it is very clear in the article that there are two ways (or more?) or defining the derivatives, why the current definition is given, and what the difference is between them. Maybe the above explanation by Lethe could be stored somewhere and linked to from the Notice? My guess is that this article will be used mostly by "non-math people". -- Nils Grimsmo 06:16, 31 May 2006 (UTC)
I will attest to the fact that this notation is not really used by pure mathematicians. This seems to be corroborated by the fact that the references are all by engineers. Thus my reasons for preferring my notation may not be very relevant to the people who would derive the most use from this article. I'm still considering what the best solution is. But the definitions currently in the article make \frac{\partial f}{\partial\mathbf{x}} a row vector, so \frac{\partial \textbf{x}^T A \textbf{x}}{\partial \textbf{x}} is also a row vector, not a column vector. -lethe talk + 15:45, 31 May 2006 (UTC)
One thing I do not understand. From Gradient#Formal_definition: By definition, the gradient is a column vector whose components are the partial derivatives of f. That is: \nabla f  = \left(\frac{\partial f}{\partial x_1 }, \dots,  \frac{\partial f}{\partial x_n }  \right) . Am I missing something here? Is not this the opposite of what is currently used in this article? (BTW: Does round parenthesis always mean column vector, while square parenthesis means row vector? -- Nils Grimsmo 08:27, 1 June 2006 (UTC)
It's either the opposite or it's the same. In the text of that article, it says "column vector", but the equation shows a row vector. In other words, the article is inconsistent, so it's hard to tell whether it contradicts or agrees with this article. I go to fix it now. And as for parentheses versus square brackets, that's simply a matter of taste, you can use whichever you like, it doesn't change the meaning. -lethe talk + 08:51, 1 June 2006 (UTC)

[edit] Matrix differential equation

Does the matrix differential equation A' = AX - XA have a solution for fixed X? --HappyCamper 18:37, 19 August 2006 (UTC)

Yes, it has a solution. I guess you want to know how to find the solution. There happens to be a nice trick for this. Start with the Ansatz
 A(t) = S(t) A(0) S(t)^{-1}. \,
Differentiating this, using the formula at Matrix inverse#The derivative of the matrix inverse, gives
 A'(t) = S'(t) A(0) S(t)^{-1} - S(t) A(0) S(t)^{-1} S'(t) S(t)^{-1} = S'(t) S(t)^{-1} A(t) - A(t) S'(t) S(t)^{-1}. \,
Comparing with the original differential equation, we find
 S'(t) S(t)^{-1} = -X, \,
which can be solved with the matrix exponential:
 S(t) = \exp(-Xt). \,
Substituting back yields the solution:
 A(t) = \exp(-Xt) A_0 \exp(Xt). \,
This shows that A evolves by similarity. In particular, the eigenvalues of A are constant. -- Jitse Niesen (talk) 03:33, 20 August 2006 (UTC)
just a comment. a particular case of this is the (say normalized, drop the Planck's constant h) Liouville's equation for density matrices, where X is the Hamiltonian times i, and A is the density matrix. then the solution is precisely time evolution of (isolated) quantum system in the Schrodinger picture. Mct mht 04:11, 20 August 2006 (UTC)

[edit] Product Rule Question

In general both 
\frac{\partial\mathbf{Y}}{\partial \mathbf{X}}
and 
\frac{\partial\mathbf{Z}}{\partial \mathbf{X}}
have 4 dimensions. Along which of the four dimensions is the multiplication performed in each of the two terms of the product rule? Since this is not clear from this article alone, maybe a note or link to another article would be useful? —The preceding unsigned comment was added by 129.82.228.14 (talk • contribs) 17:06, August 1, 2007 (UTC)

Good point. I don't have any idea how to do it, though. — Arthur Rubin | (talk) 17:26, 1 August 2007 (UTC)
If you work out the derivative at the component level you'll see that the first term in the product rule is done along the fourth dimension and the second term is done along the third dimension. While this is the only way the derivative will work, I agree that the notation is lacking since anyone trying to learn from this page would not know this. Either the notation needs to explicitly show this, or a note needs to be made on the page.