Talk:Matrix calculus

From Wikipedia, the free encyclopedia

This isn't conventional differentiation. There needs to be a discussion of what differentiation with respect to x means here. Charles Matthews 22:51, 19 Apr 2005 (UTC)

Definition of differentiation with regard to a vector is now added.--Fredrik Orderud 14:59, 7 May 2005 (UTC)

Contents

[edit] This article needs work

This certainly not standard analysis.

From what I see, the derivative of vector in respect to vector is just nothing but the Jacobian transposed.

Also, the derivative of scalar in respect to vector, is just the gradient.

As such, an important question is why on earth is this new calculus needed? I wonder if Fredrik Orderud can explain this.

The way the article is now, it is just a mumbo-jumbo of formulas, without a reason to exist. If it stays this way, I would think it would need to be nominated for deletion. Oleg Alexandrov 00:45, 8 May 2005 (UTC)

You are true in that it is calculation of Jacobians and gradient vectors I'm talking about. This should of course be pointed out in the text.
The reason for creating this article was to create a list similar to "notable derivatives" in the derivative article. The formulas are important, and among other places used in deriving linear estimators like the Wiener filter and Kalman filter, and neither the Jacobian nor the gradient article contains any of these equations.
Suggestion: What about standardizing the notation and moving the content to the gradient, Jacobian and Trace (matrix) articles? --Fredrik Orderud 10:41, 8 May 2005 (UTC)
I see your point. And I have one remark. The Jacobian formula, the way it is written now, is not correct. The matrix should be transposed, see Jacobian.
How about writing some motivation at the beginning about why these are important? And maybe renaming this article to matrix calculus, which seems at least to me to sound better, as we are not talking about differentiating one matrix, rather about a set of rules about how to manipulate vectors and matrices when differentiating. Oleg Alexandrov 12:43, 8 May 2005 (UTC)
Sounds like good and constructive suggestions. I'll try to rename the article, write an introduction and look into the Jacobian bug. Feel free to help me improve this article. --Fredrik Orderud 13:14, 8 May 2005 (UTC)
I think this article is necessary, but needs work. As the author said, thoses formula are crucial in many fundamental statistical algorithms. For example, to estimate the variance of a multivariate Gaussian pdf by maximum likelihood, one needs to be able to compute the derivate of det(A), inv(A) as functions of matrix. The more rigurous definition is done in the article Frechet derivate, but this article is difficult to find (most people who need those formula have never heard of Frechet Derivative, and the term Frechet derivative is not often used in statistical litterature). Ashigabou 07:53, 31 January 2006 (UTC)

Part of the problem seems to be the interpretation of the words "vector" and "scalar". The derivative of a "vector" f = (f1, …, fn) with respect to another "vector" g = (g1, …, gm) here means (I am guessing) the matrix of partial derivatives with dimensions of the transpose of the Jacobian, where fi are regarded as functions of independent variables gj. Then f is not a "vector" but rather a vector-valued function of the gj. This leaves the interpretation of g, but I am rather lost on this point. Also, taking the derivative of a scalar (number) would just result in 0 in the classical sense. Perhaps what the author means here are scalar-valued functions on the vector space? - Gauge 06:56, 10 September 2005 (UTC)

To be more complete, I think the problem is coming from the many possible definitions of derivation once we talk about functions which are defined on multi-dimensional spaces and other "abstract spaces". Depending on the structure you are working in, you may be interested in Gateaux differentiation, Frechet differentation, etc... For the applications mentionned here (statistics), it is the Frechet derivative which is interesting Ashigabou 07:53, 31 January 2006 (UTC)

I was reading a scientific article that used this notation (df/d\vec{v}), and I wanted to know what it meant. I found this page via google, and it was exactly what I was looking for. So for what it's worth, I found the article quite helpful the way it is now (14:22 Jan 11 2006).

I cleaned up a bit the article and restructured it to be a bot more "formal": first an introduction, with a link to the mathematical definition (Frechet derivative), then a definition of derivative of real, vector and matrix valued function of scalar, vector and matrix, followed by basic properties, and formula. The definition are verbose: they can be written more concisely using the kronecker product of two matrices, but this would complicate the reading, and thus be against the purpose of this article. The linked article on Frechet derivative needs to be completed, too, taking into account this article. Ashigabou 02:54, 1 February 2006 (UTC)

[edit] my heavy rewrite

As usual, when I do a heavy rewrite, I will try to give you a point by point listing of structural or conceptual changes. If I were a better wikipedian, I would probably discuss changes I wanted to make before I start editing, but well, I'm not. And when the mood hits me to write, I do it.

  1. I've changed the notation a bit. A lot of things are not transposed that were, and some things are transposed that weren't
  2. Emphasis on the directional derivatives, which are important
  3. Removed the possibility of complex matrices. Sure, that's possible, but why bother?
  4. I more or less killed the whole section about Fréchet derivative, mostly for the reason that I decided that this derivative is not a special case of the Fréchet derivative.
  5. Oh, I also think that the derivatives need to display where they're evaluated at. Since they have two domains, it doesn't quite parse if you don't.

and that's pretty much it. What do you old pros think? And what about you, ashigabou? This is your baby, afterall. Oh, also, I got tired before I got to the examples section. I think some transposes in there still need to be changed. -lethe talk + 17:46, 1 February 2006 (UTC)

  • Great work! The article is much more readable and simple to follow now :) --Fredrik Orderud 00:35, 2 February 2006 (UTC)
Oh, I see that this was originally your article, not ashigabou's. My mistake. Well anyway. I'm glad you like the changes. A bit more work to do, but I think it's coming along nicely. -lethe talk + 00:49, 2 February 2006 (UTC)

[edit] inverse derivative

Does anyone want to try convert the elegant

d\,\big({X^{-1}}\big) = -X^{-1}\cdot d\, X \cdot X^{-1}

into this notation? Arthur Rubin | (talk) 21:20, 1 February 2006 (UTC)

I suppose we needs must do. It's an important formula. -lethe talk + 00:49, 2 February 2006 (UTC)
there is also the determinant based formula, linked to the inverse formula through the cofactor matrix. I still cannot get my head around all the details for those two cases... 133.186.47.9 05:48, 3 February 2006 (UTC)
Converting \frac{\partial A^{-1}}{\partial A}(X) =  - A^{-1}XA^{-1} looks really ackward to me. From my POV, that's where the notation using partial derivative falls down (at least from a practical point of view). I think maybe we should add the derivative and determinant using this notation instead ? Ashigabou 06:21, 4 February 2006 (UTC)

[edit] Typo in the product rule?

There seems to be something wrong with the product-rule equation, since the first term does not differentiate on X:

\frac{\partial (\mathbf{YZ})}{\partial \mathbf{X}} = \frac{\partial\mathbf{Y}}{\partial \mathbf{Z}}\mathbf{X} + \mathbf{Y}\frac{\partial\mathbf{Z}}{\partial \mathbf{X}}

Shouldn't the equation be something simmilar to this instead?

\frac{\partial (\mathbf{YZ})}{\partial \mathbf{X}} = \frac{\partial\mathbf{Y}}{\partial \mathbf{X}}\mathbf{Z} + \mathbf{Y}\frac{\partial\mathbf{Z}}{\partial \mathbf{X}}

At last [1] seems to suggest this. --Fredrik Orderud 00:45, 2 February 2006 (UTC)

Ooops, of course you're right, that's a mistake. Lemme fix. -lethe talk + 00:49, 2 February 2006 (UTC)

[edit] Differentiation of functions of matrices with respect to matrix

Moved from Wikipedia talk:WikiProject Mathematics. 09:33, 2 February 2006 (UTC)

I have some questions concerning this topic. Let me introduce the background: I was looking for a definition of the derivative of functions such as |A|, A-1 with respect to A when A is a matrix. This kind of derivative are common for example in statistics, when you want to estimate of mean and covariance matrices of Gaussian random variables by the MLE (maximum likelihood). The problem is that whenever I looked at in wiki and on the internet, the definition of the derivative of such functions is given by formula, without any link whatsoever between them (see for example http://www.ee.ic.ac.uk/hp/staff/www/matrix/calculus.html, and http://www4.ncsu.edu/~pfackler/MatCalc.pdf, and how they look totally different when you don't know the rigorous definition). After some thought, it looks like those formula are related to Fréchet derivative, and I begun to heavily edit the article Matrix calculus in this regard. Some people are not sure that those formula coincide with Fréchet derivative, and I would like to know what other people think. One problem of those formula is that they mix the derivative, their matrix representation, etc... For example, I had a hard time to understand the derivative of tr(A) with respect to A defined as In (square identity matrix of dimension n), because In is not a representation matrix for a linear form (In and vec(In) are the same in those contexts, from what I understand). To sum up, I would like that looking for the topic of differentiability of function defined on matrix spaces gives reference to Frechet derivative, practical formula and where they are coming from, and partial derivative (which everybody has at least an intuitive idea about). Ashigabou 13:17, 1 February 2006 (UTC)

Note sure I understand your question but heres my take. The derivative of a matrix with respect to another matrix in not strictly speaking another matrix. Instead it will be an element of some multi linear form (think tensor). If M is p×q and N is a×b then dM/dN will have pqab elements. Different authors choose to arrange these elements in specific ways as fitting the aplication. The presentation in Fréchet derivative are probably the most acurate description. Matrix exponential is probably worth having a look at. If you have access to a decient library its probably worth searching out (K.V.M. Mardia, J.T. Kent and J.B. Bibby) "Multivariate Analysis" Academic Press, New York, 1979. Which has quite a good take on the subject from a statistical POV. --Salix alba (talk) 14:57, 1 February 2006 (UTC)

I agree with you, but I had a hard time to figure all out. In all formula available on the net, the derivative of a matrix with respect to an other matrix is defined by a matrix. The meaning of this matrix is not clear; I figured out recently that it was a representation of a linear map in the canonical basis: can you confirm this ( the big pqab matrix you are talking being the representation of the derivative in the canonical basis of matrices) ? My point : I think the link between the formula of Matrix calculus, the definition of Fréchet derivative, partial derivative and the definitions you can find on the internet is worth being written somewhere. For example: expanding Fréchet derivative with some special cases in finite dimension, equivalence between partial derivative and Fréchet derivative when partial derivative are continuous, plus cleaning up Matrix calculus, with formula for traces, determinant, inverse, etc... with a link to it from Matrix exponential, Determinant, etc... As I am new as a Wiki contributor, I would like to be sure not screwing things up Ashigabou 15:47, 1 February 2006 (UTC)

Are you happy with the L(M,N) notation used in Fréchet? This is a linear map from M to N. If M is Rm and N is Rn then L(M,N) is a m×n matrix. Yes I agree that this could all be given a better treatment. Actually deriving some of the formula in Matrix calculus would be a big help in understanding the topic. You could try drafting something in your user space say User:Ashigabou/Matrix calculus if you want to play about first before editing actual articles. Doing the sums is the best way to learn. --Salix alba (talk) 16:31, 1 February 2006 (UTC)

Suppose A is a 3×3 matrix with entries {ai,j}, where i and j run from 1 to 3. The trace is the real-valued function tr(A)=a1,1+a2,2+a3,3. If we take the derivative of this expression with respect to each of the matrix entries in turn, and assemble the results into a 3×3 matrix with entries {d tr/dai,j}, then we get something that looks like an identity matrix. For any real-valued function, we can apply the same idea.
However, suppose f(A)=AT, a function that maps A to a matrix (here, its transpose). Then we need to know how that matrix result depends on each of the entries of A. For example, we'll need the derivative of AT with respect to a1,1, which is not just a single numeric value, but a matrix of them. This means we'll need a "matrix of matrices". The formal way to describe these things is as a "tensor". --KSmrqT 00:51, 2 February 2006 (UTC)
(Yep, what KSmrq said). Normally, one does not differentiate with respect to a matrix, one differentiates with respect to the elements of that matrix: that is, one takes partial derivatives. That way, its clear what basis you are working in, and its clear how to change bases. One uses index notation to track things.
Re: the Frechet derivative. First, this is not a "matrix deriviative" in any sense. Second, it is identical to the ordinary partial derviative when the Banach space is finite-dimensional. A good homework problem is to understand how this is so. The point of Frechet is to define a derivative for infinite-dimensional spaces, where ordinary partial dervatives are poorly defined. Frechet is overkill for what you need, although understanding it will make you smarter. linas 00:58, 2 February 2006 (UTC)
Actually, linas, after thinking about it carefully, I've figured out what's going on here, and I think I have to disagree with you on several counts. First, the Frechet derivative allows you to take the derivative of one matrix with respect to another, so why do you say it's not a matrix derivative in any sense? Surely in some sense, it is indeed a matrix derivative. Second, the partial derivative matrix is not equivalent to the Frechet derivative even in the finite dim case. Third, I think lots of people use the Frechet derivative in finite dim space, and to say it's overkill is unfair. I think analysts probably define their limits in terms of norms, and define derivatives as a "limit over all directions", which is nothing other than the Frechet derivative for Banach spaces. Finally, why should partial derivatives be poorly defined in infinite dim space? I don't think the definition of partial derivative relies in any way, explicit or implicit, on the dimension of the space. -lethe talk + 01:13, 2 February 2006 (UTC)
I guess I was not clear. First, concerning the trace, of course you can compute the partial derivate and form a matrix, that's what it written everywhere, and everybody understand. But what does it 'mean' ? Why equating this derivative to zero gives you hints for maxima of the trace (the point of the whole thing in my case) ? I disagree with linas on the fact that frechet is overkill in finite dimension. For example, for the trace, using the Frechet derivative, it is straightforward that the trace itself is the derivate at any point, which leads to the identity matrix {d r/dai,j} in the canonical basis of matrices (Eij = delta_(i,j) wich delta being the Kronecker symbol). Without Frechet derivative, how can you have a general way of finding derivative of matrix in finite dimension normed vector spaces ? In my opinion, from what I studied yesterday and the day before, there is a clear link between Frechet derivative and all formula found on the internet for derivative of matrices bwing function of other matrices. There just seems to be an abuse on the 'notation', because when you say that the derivative of the trace of A with respect to A is identity matrix, it is actually the representation matrix of the differential, at one vec operation (vec(I)t is the representation matrix of the trace in the canonical basis for real matrices, the trace being the derivative of the trace in the Frechet sense). Derivative of matrices function of matrices may not be matrices, but that's how they are defined everywhere in applied statistical papers/books I'have read: that's why I feel there is a need for an explanation between the abuses of notation in statistics (and certainly in other fields) and the rigorous definition of those concepts, which I believe to be easily explained in the Frechet context. I will work on an extended Fréchet derivative article and Matrix calculus in my sandbox, this will be much clearer I think :). Ashigabou 02:16, 2 February 2006 (UTC)
Ashigabou, why don't you have a look at recent changes to Matrix calculus? In particular, I've decided since last night that the matrix derivative is most definitely not a special case of the Fréchet derivative, and is in fact more general. -lethe talk + 02:20, 2 February 2006 (UTC)
The article looks much better now, indeed. I think there is a good balance with rather ad-hoc definitions (using partial derivative), and links to more mathematical views. Concerning the matrix derivative, I am not sure to agree 100 % with you. The notation using partial derivative is used only in applied formula, right ? Then, those formula being used most of the time to maximise one function by computing its derivative, does this make sense when only the partial derivative exist, and the function itself not being continuous (the Hartog's theorem you are mentioning). For example, when you read the Jacobian article, the Jacobian definition in linked to the best linear approximation, thus to the frech derivativel; you also begun the article saying that all functions are assumed C1, so in that case Frechet derivative and the matrix of partial derivative should be equivalent. Bear in mind I have no theoritical knowledge of multi-variate calculus, so this is just my intuitive view. Also, in the following article, the jacobian is linked to the Frechet derivative: http://www.probability.net/WEBjacobian.pdf. This looks really clear to me, and straightforward. Why are you thinking matrix derivative are more general than frechet derivative (more exactly, why does it make sens to define matrix derivative as the matrix of partial derivative when a 'general derivative' inb the Frechet or Gateaux context does not exist ?)
In the article Jacobian, notice the clause "If p is a point in Rn and F is differentiable at p, then its derivative is given by JF(p)". When it says F is differentiable", they mean something stronger than "F has all its partial derivatives". They mean "F is Fréchet differentiable". When both derivatives are defined, they are of course the same. But there are functions for which the matrix derivative is defined while the Fréchet derivative is not. Thus, they are not equivalent. But it's true; in those cases when they are all defined, the matrix derivative = Fréchet derivative = Gâteaux derivative = Jacobian. Take a look at 113: if the Fréchet derivative exists, then all partial derivatives exist. But the converse of the theorem does not hold (Hartog's function), so the two notions are not equivalent. -lethe talk + 03:55, 2 February 2006 (UTC)
What do you mean by 'take a look at 113' ? I know that partial derivative does not imply Frechet differentiability, they also need to be continuous, but in the case of the formula in matrix calculus, it is an applied article, not a theoritical one, so I think there should be more emphasis on the link between linear approximation and matrix derivative; this does not prevent from saying that the equivalence between partial derivative and differentiability is not always true. But without the intuitive idea of linear approximation, I don't see where the notation as matrix of partial derivative would come from, and also, when the Frechet derivative exists, it becomes much easier to find formula. I am right now editing a copy ot matrix calulculus in User:Ashigabou/Matrix calculus to show what I have in mind, maybe you would have the time later to see and tell me if it makes sense and if it worth being added. Ashigabou 05:16, 2 February 2006 (UTC)
Your article says "There is equivalence between the existence of Frechet derivative and the existence of continuous partial derivative. The continuous is essential." But the counterexample at Hartog's theorem gives a function which not only has partial derivatives, but even continuous partial derivatives, but is not differentiable. So I don't think that statement is correct. -lethe talk + 08:19, 2 February 2006 (UTC)
I don't agree on the continuity of the partial derivative for (x,y) = (0,0)... Without computing them, you can see they are of the form "z^3/z^4". It is obvious when you draw the function.
Yeah, I'm sorry, I was wrong, you're right, the partials are not continuous. So your claim is that if the partials exist and are continuous, then the function is Fréchet differentiable. I might be willing to go for that. You know, another condition I found for a function to be Fréchet differentiable is that it have a Gâteaux derivative, that the Gâteaux derivative be linear, and that the linear map be bounded (and we know well that for linear maps, boundedness is the same as continuity). That, along with the fact that the Gâteaux derivative looks a lot like a partial derivative, make me think that it is actually the Gâteaux derivative, not the Fréchet derivative, that should be considered the formal version of our matrix derivative. -lethe talk + 09:13, 2 February 2006 (UTC)
PS, I think I'm going to copy this converation to talk:Matrix calculus. It's getting quite long, and I think anyone here who wants to get involved already has. -lethe talk + 09:18, 2 February 2006 (UTC)
Sorry if I sound peaky, but I hard such a hard time to really understand everything that I would like to be sure it will be crystal clear for other wikipedia readers. First, I think you can prove that the existence and continuity of partial derivative implies Frechet derivative quite easily by decomposing f(x+h)-f(x) as sum of f(li)-f(l(i-1)) for li = a + sum(h.ei), ei being the canonical basis; each f(li)-f(l(i-1)) can be approximated by the partial derivative at the point l(i-1) (using continuity derivative). This looks like the demonstration used in theorem 115 there http://www.probability.net/WEBjacobian.pdf#differentiable, I didn't check it that much. For the bounded linear map, this is a condition in Frechet derivative (this is sensible, so that it imposes the continuity of the differential, and you can have the equivalence I was talking about before). Also, using the Frechet derivative, I found most of the basic formula to be easy to find (see an stub there: User:Ashigabou/Matrix calculus). Finally, I am still not convinced about using gateaux and not Frechet for 2 reasons: first, the linear approximation is intuitive, and generalizes nicely the derivative in the scalar case (remember, the derivative, in applied science, is often used for maximization problems, and in this case, Gateaux/partial interpretation is less simple than Frechet from my POV). Secondly, in all formula of this article, the functions are Cinf, so we have the equivalence between frechet and partial anyway. Ashigabou 15:16, 2 February 2006 (UTC)
I would be happy if anybody more familiar with multi dimension analysis would take a look at my stub User:Ashigabou/Matrix calculus, chapter 6, to tell me if all this make sense (I did it by hand, I didn't bother checking neighborhood and such, and I don't think this is necessary here). In a few days, I will get access to a matrix analysis book, I hope this will clear thinks up. Ashigabou 15:16, 2 February 2006 (UTC)

Of course I would also like it if we can create an article that is also clear. I'm going to take your word for it about the proof; it doesn't seem controversial (now that I believe it). And about Gâteaux versus Fréchet: I suggested that we have a Gâteaux derivative here, but that can't be quite right, since this derivative is linear, by definition. But this derivative is weaker than the Fréchet derivative, and that certainly deserves some exploration. I don't think it's fair to just assume C everywhere. We should have a section about linear approximations and Taylor series, and for that we will of course need to assume enough differentiability. -lethe talk + 20:10, 2 February 2006 (UTC)

I think the problem is whether we can use several approaches in the same article. This article started, I think, by giving useful and common formula for some derivative; ie a very practial article. That's what it looked like when I first looked at it. I wanted to have an explanation on the reason of these definitions, hence my search about Frechet derivative. If this article is meant as a reference list of formula, we don't need to bother about differentiability (for all formula given, the function are C everywhere, right ?). What about first explaining the formula as Frechet derivative in special cases, and then saying in general, we cannot assume that, and talking about Gateaux, etc... ? Ashigabou 02:15, 3 February 2006 (UTC)

[edit] Interpretation of rank-4 tensor as a matrix

I am not sure I like the formula

\frac{\partial\mathbf{F}} {\partial\mathbf{X}}= \begin{bmatrix} \frac{\partial\mathbf{F}}{\partial X_{1,1}} & \cdots & \frac{\partial \mathbf{F}}{\partial X_{n,1}}\\ \vdots & \ddots & \vdots\\ \frac{\partial\mathbf{F}}{\partial X_{1,m}} & \cdots & \frac{\partial \mathbf{F}}{\partial X_{n,m}}\\ \end{bmatrix},

Contrary to what one would naively expect,

\frac{\partial\mathbf{X}}{\partial\mathbf{X}}

is not a big identity matrix. The definition in the article seems to disagree with the definitions in the external links, which has the numbers in a different order:

\frac{\partial\mathbf{F}} {\partial\mathbf{X}} = \frac{\partial\operatorname{vec}(\mathbf{F})}{\partial\operatorname{vec}(\mathbf{X})} = \begin{bmatrix} \frac{\partial F_{1,1}}{\partial X_{1,1}} & \cdots & \frac{\partial F_{1,1}}{\partial X_{n,m}}\\ \vdots & \ddots \vdots \\ \frac{\partial F_{p,q}}{\partial X_{1,1}} & \cdots & \frac{\partial F_{p,q}}{\partial X_{n,m}} \end{bmatrix}.

Furthermore, it seems to me that the chain rule

\frac{\partial \mathbf{Z}} {\partial \mathbf{X}} = \frac{\partial \mathbf{Z}} {\partial \mathbf{Y}} \frac{\partial \mathbf{Y}} {\partial \mathbf{X}}

is quite hard to interpret. The "multiplication" on the right is a tensor contraction, I guess, but the notation of the whole article (specifically, the phrase that a 4-tensor can be interpreted as a matrix of matrices) suggests that it is some kind of matrix multiplication.

Am I making my concerns clear or should I go in more detail? -- Jitse Niesen (talk) 17:11, 2 February 2006 (UTC)


I think you're being clear. I think "our" definition (in this article) looks something like this:

If Y is a by b and X is c by d then

\mathbf{Q}= \frac{\partial\mathbf{Y}}{\partial\mathbf{X}}

is a d by c matrix of a by b matrices with coefficients qcdab.

If X is a scalar, then Q is "like" a matrix, and dY = Q dX.
If Y is a scalar, then Q is "like" a matrix of shape XT, and dY = Tr(Q × dX).
If X and Y are (column) vectors, then Q is "like" a matrix, and dY = Q × dX.
If X and Y are row vectors, we can do something similar; I don't feel like trying to write it out.

Otherwise, Q is not like a matrix.

(Please feel free to change × to a centered dot in the expressions above.)

Arthur Rubin | (talk) 19:16, 2 February 2006 (UTC)

An alternative definition, in one of the references, involves changing Q to an (a b by c d) matrix, and using matrix operations on those. The chain rule makes sense in that space, while the product rule becomes (in our space)

\frac{\partial (\mathbf{YZ})}{\partial \mathbf{X}} = \frac{\partial\mathbf{Y}}{\partial \mathbf{X}}(1 * \mathbf{Z}) + (1 * \mathbf{Y})\frac{\partial\mathbf{Z}}{\partial \mathbf{X}},

where * represents and 1 represents the matrix of all 1's of the appropriate size. Arthur Rubin | (talk) 19:37, 2 February 2006 (UTC)

[edit] lethe reply

Hi Jitse-

  1. Firstly, you're right, according to this definition,
    \frac{\partial\mathbf{X}}{\partial\mathbf{X}}
    is not the identity matrix (it's not a matrix at all). However, you can check that
    \operatorname{tr}\left(\frac{\partial\mathbf{X}}{\partial\mathbf{X}}\mathbf{Y}\right)=\mathbf{Y}
    so that it does evaluate as the identity map on matrices. It's the identity map M(n,m)→M(n,m). So I believe everything's as it should be there.
  2. Secondly, I chose a notation that differs from the external links in one regard so that I don't have transposes some places they do. That's why the orders may be different in some places. The reason for this is so that the distinction between vectors and dual vectors is maintained more carefully, something the external sources don't seem to worry about. In my convention, ∂f/∂x is a (column) vector and ∂f/∂x is a dual (row) vector. This is also beneficial for the evaluation maps which use the Frobenius norm of matrices. If you saw a difference of the ordering of elements other than a difference of transpose, then it's probably a mistake. Also, as far as changing
    \frac{\partial \mathbf{Y}}{\partial \mathbf{X}}
    to
    \frac{\partial \operatorname{vec}(\mathbf{Y})}{\partial \operatorname{vec}(\mathbf{X})}
    I don't like that. It does make the derivative easier to write down; now it's just a matrix. But it completely loses the matrix flavor of the derivative. I do mention at the outset that the matrices can be treated as vectors. This construction makes that remark explicit, so perhaps both ways can be mentioned, but I think the matrix way is more in keeping with the spirit of matrix derivatives.
  3. Thirdly, the chain rule and the product rule is hard to interpret, I agree. That section needs work, as does the section on examples, which misses many important ones. I think I can fix the chain rule by including evaluation, which is pretty standard as far as chain rules go. But I'd like to have the evaluation-free version as well if possible.
  4. Lastly, you don't have to worry about my ego. This is very much a work in progress. Other things which need work: we need to incorporate the differential form notation, as Arthur suggests. We need to flesh out the relationship to other derivatives, there's more to it than I've put in the article. A better selection of examples needs to be chosen for the examples, and it needs to be better organized. Anyway, whatever work I may have done on the article, I'm happy to see the article improve. I only worry that you're getting enough sleep :-) -lethe talk + 19:41, 2 February 2006 (UTC)

Thanks. It all makes sense now. My only problem now is: why? It seems to be more complicated that using partial derivatives (or tensor index notation, which is basically the same), when the chain rule becomes

\frac{\partial Z_{ij}}{\partial X_{pq}} = \sum_{mn} \frac{\partial Z_{ij}}{\partial Y_{mn}} \frac{\partial Y_{mn}}{\partial X_{pq}}.
As for why, I can assure you, I don't really know why. I would certainly never use this notation, it's a nightmare! Tensor index notation works just fine for me, and is much more flexible. This is the reason I mentioned that alternative right in the intro, which I had a brief though was inappropriate, proponents of this notation don't need my put-downs right in the intro. Anyway, if people do use it (and apparently they do), then we need to have an article on it. -lethe talk + 21:45, 2 February 2006 (UTC)

And I have a slight worry that you are straying a bit into original-research terrain, but I'll let you be the judge of that.

That thought occurred to me as well. It's partly a choice of notation, though I suppose even that can be considered original research. We're not here to devise new notations, but simply to report on notations used in the field.

PS: I've no problems with your using HTML tags if you feel like it, but please close them. -- Jitse Niesen (talk) 21:17, 2 February 2006 (UTC)

Oops, that's the second time you've done that for me, thanks. I use the HTML tags instead of the wiki # when there are math tags in my list items. The wiki list markup can't deal with those. But yeah, if I'm going to use them, I should close them. Sorry, and thanks for the fix. -lethe talk + 21:45, 2 February 2006 (UTC)

Hmmm, the "more complicated" thing is probably only in the eye of the beholder. -- Jitse Niesen (talk) 21:20, 2 February 2006 (UTC)

Concerning the vector notation for matrices against the 'normal' matrix notation, I think they are more than just notation difference. For example, if you derive tr(A) with respect to A, you got the identity matrix using lethe notations. But if you are using the vector notation for matrices, you got a big n2 columns vector (I only consider square matrices here for the sake of my point), which can be viewed as the matrix representation of the linear map from square matrices of size n to the scalar in the canonical basis of square matrices, hence being the trace. Do you think this vision is accurate ? Ashigabou 02:29, 3 February 2006 (UTC)
Also, for \frac{\partial\mathbf{X}}{\partial\mathbf{X}} if you say this is by definition \frac{\partial vec(\mathbf{X})}{\partial vec(\mathbf{X})}, then you get a n2 x n2 big identity matrix, which is the matrix representation of Mn on itself, using the canonical basis. So the notation using vec and your notation are not just mere notation differences from my POV, but different object: you are talking about the derivative at one point (ie the scalar \; f^'(a)) and the big matrices are the representation of the linear map of t \mapsto t.f^'(a) if you go into the case of scalar functions. I don't know if what I am saying here makes sense yo you ? In any cases, I agree with you that the notation vec(X) is not that great, at least at the beginning. I found this article http://www4.ncsu.edu/~pfackler/MatCalc.pdf, and this is way too overkill for many cases from my POV Ashigabou 02:51, 3 February 2006 (UTC)

If you don't use the matrix product anywhere, then the two notations have to be the same, since a matrix is nothing other than a vector which has a special kind of product. -lethe talk + 06:47, 3 February 2006 (UTC)

Well, the problem is that you cannot get any complicated formula from the definitions without using product (composition and product rules). Also, when I was comparing notations, I was not talking about the notation for the definition (I agree they are essentially the same), but about the formula given after, which was for me the main problem for some time. When you say \frac{\partial\mathbf{A^t}\mathbf{A}}{\partial\mathbf{A}} = \mathbf{A^t} + \mathbf{A}, it is not obvious to see the link with the definition you gave (definition which I agree with), because both are matrices but not of the same size (abusing the equivalence between tensor and matrix). I feel like I explain really badly my point. Once again, I think my small stub User:Ashigabou/Matrix calculus#Origin of the formula shows what I am talking about. Ashigabou 09:18, 3 February 2006 (UTC)
Did Lethe say \frac{\partial\mathbf{A^t}\mathbf{A}}{\partial\mathbf{A}} = \mathbf{A^t} + \mathbf{A}? I don't remember that. By the way, I commented on User talk:Ashigabou/Matrix calculus that I think there is something fishy when you compute the derivative of the inverse. -- Jitse Niesen (talk) 17:26, 3 February 2006 (UTC)
Sorry, I made a mistake in my wordings. In French, you can say 'you' as a general subject, I didn't mean that Lethe wrote the equation I wrote above. Nevertheless, the above equation is accurate in the Frechet meaning, and is compatible with the definition given here. Concerning the inverse, this is plain wrong, you are right. Do you think adding the part about Frechet would be useful here ? At least, in my case, it helped me a lot to understand all this stuff, and you can easily find the product rule with it (I don't know for the composition rule, but I would assume the demonstration is not difficult either). Ashigabou
One can also use 'you' in English as a general subject, but it is ambiguous (as in French). The word 'one' ('on' in French) is not ambiguous, but a bit old-fashioned.
You say that \frac{\partial\mathbf{A^t}\mathbf{A}}{\partial\mathbf{A}} = \mathbf{A^t} + \mathbf{A} is accurate in the Frechet meaning. I don't understand that. The Frechet derivative \frac{\partial\mathbf{A^t}\mathbf{A}}{\partial\mathbf{A}} is a map from M(n) to M(n), where M(n) = space of n-by-n matrices. How do I interpret the right-hand side as a map from M(n) to M(n)? I think that the natural interpretation is not the correct one. -- Jitse Niesen (talk) 12:22, 15 February 2006 (UTC)

[edit] Is the differentiation given here really different than Fréchet derivative ?

The definition using partial derivative can be defined for function which are not differentiable (at least Gateaux or Fréchet differentiability), but I cannot see any use of the definition given at this article on thoses cases. After having checker at several references, I think that this article should really be understood in the Fréchet meaning. According to Universalis Encyclopedia, the article "calcul infinitesimal a plusieurs variables" says that Fréchet derivative is the usual derivative in Banach spaces, and particularly in finite dimension R space vectors with the norm taken from the usual scalar product. The expression Frechet derivative, still according to Universalis, is not used anymore: it is simply called differential. I read a bit about tensors, and if I understand a tensor can be seen as the representation of a linear map between vectors: for example, a linear map from matrix spaces to matrix space is a 4 rank matrix, a linear map from n-dimension vectors to p-dimension vectors is a 2-rank tensor, equivalent to a matrix. I also found a presentation of Taylor theorem in multi-dimension, which defines mixed-partial derivative as multi-linear maps, and such having the Fréchet definition of derivative: http://gold-saucer.afraid.org/math/taylor/taylor.pdf. If we evoke tensor as linear map, I don't think we need any tensor theory (which should be avoided here, as it is a practical article). Ashigabou 08:56, 6 February 2006 (UTC)

As far as I know, the expression "Frechet derivative" is still used in English for derivatives between vector spaces. Whether you call it a tensor or a multi-linear map depends on your background. However, I don't think that a tensor treatment is more theoretical than a "multi-linear map" treatment; actually, my guess would be that physicists would prefer tensors and mathematicians multi-linear maps. -- Jitse Niesen (talk) 12:22, 15 February 2006 (UTC)

[edit] Inconsistent definition of derivative

User:Lethe did [2] on the 1. of January 2006 change the definition of the derivative of a vector function by another vector, from being the transpose of the Jacobian to being simply the Jacobian matrix. Subsequent edits has since transposed the results of most of the equations listed to reflect this change.

This new definition is, however, not consistent with the definition used in most textbooks, including two of the references listed in this article. This inconsistency severly limits the applicability of the formulas listed in the article for deriving solutions to many common statistical problems, like ML parameter estimation, Kalman filter, MMSE estimation e.g.

Is there a strong reason for using the current definition? --Fredrik Orderud 16:27, 29 May 2006 (UTC)

Consistency with other formulations was my motivation for changing. Basically, in linear algebra and differential geometry, by convention, vectors are represented as columns and dual vectors are represented as rows. As I recall, this is mentioned in the footnotes of one of the textbooks sources, where the issue is brushed aside without giving a justification for choosing the wrong convention. The notation I chose to write this article is more consistent with other Wikipedia articles, although you can also find articles who prefer to have vectors as row vectors. Although I don't know what any of the applications you mention are, I don't understand your points about the limitations of this convention. How can a convention limits its usefulness? Do these applications have some inherent preference for row vectors? -lethe talk + 16:43, 29 May 2006 (UTC)
An example of problems caused by the different definitions is the derivative of \textbf{A}\textbf{x}, which in most estimation and pattern-recognition textbooks are equal to \textbf{A}^T. This article does, however, have \textbf{A} as the solution due to the different defintion. Similar differences also occurs in the equations listed for the derivative of quadratic equations and matrix traces.
Application of this article's equations therefore leads to different results compared to the derivations found in most textbooks, which can be VERY confusing.--Fredrik Orderud 17:29, 29 May 2006 (UTC)

I didn't like your previous statement that this convention "limits the applicability" of the formulas, something which cannot be true, a formula doesn't lose validity or applicability just based on how you arrange its symbols on your paper. Nevertheless, I will admit that different notational conventions can be very confusing, and that may be a reason to switch this article over, which is certainly possible to do.

But there is indeed a strong reason to prefer the standard convention: the way matrix multiplication is defined and our convention of composing functions demands it. Let me explain what I mean. Suppose you have a column vector

\mathbf{v}= \begin{bmatrix} a\\ b \end{bmatrix}

and a dual vector

f= \begin{bmatrix} x & y\\ \end{bmatrix}

Dual vectors act on vectors to yield scalars. In this case, we have

f(\mathbf{v})= \begin{bmatrix} x & y\\ \end{bmatrix} \begin{bmatrix} a\\ b \end{bmatrix} = ax+by.

If, on the other hand, you take the alternate convention, with

\mathbf{v}= \begin{bmatrix} a &b \end{bmatrix}

and

f= \begin{bmatrix} x\\ y\\ \end{bmatrix}

Then you have two choices: either take a nonstandard definition of matrix multiplication (written with an asterisk) which forces

f(\mathbf{v})= \begin{bmatrix} x\\ y\\ \end{bmatrix} * \begin{bmatrix} a & b \end{bmatrix} = ax+by

(normal matrix multiplication (denoted by juxtaposition) requires that this be rather

\begin{bmatrix} x\\ y\\ \end{bmatrix} \begin{bmatrix} a & b \end{bmatrix} = \begin{bmatrix} ax & bx\\ ay & by\\ \end{bmatrix}

so this is weird). Or else you can keep normal matrix multiplication if you adopt the alternate notation for composition of functions. That is to say, instead of denoting a functional f acting on a vector v as f(v), use the notation (v)f. This results in

(\mathbf{v})f= \begin{bmatrix} a & b \end{bmatrix} \begin{bmatrix} x\\ y\\ \end{bmatrix} = ax+by

using normal matrix multiplication. Thus, you are faced with three alternatives:

  1. Use columns for your vectors (as the article currently does)
  2. Change the definition of matrix multiplication for this article (a bizarre proposition)
  3. Reverse the convention for composition by functions (there was a movement in the 60s to switch all of mathematics to this convention, and I've dabbled with it myself, but it's not very popular)

Thus you see that no matter what we do, we have a source of confusion for someone, and it's my opinion that using standard conventions of most mathematicians (that vectors be represented as columns) rather than the conventions of the guys who wrote the matrix calculus texts (where vectors are rows) represents the best solution. I suppose there is a fourth solution, which is to simply list a bunch of formulas, and simply ignore their mathematical meaning as maps. I suppose this must be what those matrix calculus text authors do? I regard this as a not very good solution. I think it will cause just as much confusion as it saves. -lethe talk + 18:16, 29 May 2006 (UTC)

I've now added a "notice" section in the article, which explains the alternative definition used in estimation theory and pattern recognition. This would at least make readers aware of the potential for confusion. --Fredrik Orderud 21:08, 29 May 2006 (UTC)
It's a good idea to include some discussion about the alternate notation. I need to stare at those other references for a while to see if they have some way of getting around the problems I listed above. I have a suspicion that what those sources do is redefine matrix multiplication (my option number 2), but they hide this fact by throwing around a lot of extra transpose symbols. Once I figure out exactly what is going on, I'll try to add something to the article to make it clear. Stay tuned. -lethe talk + 22:14, 29 May 2006 (UTC)
Great! I'll look forward to hear back from you :-). --Fredrik Orderud 22:55, 29 May 2006 (UTC)
This article is potentially very usefull to a lot of people. Thank you for working on it, guys! Could the section now called "Notice" be expanded with an explantion of why the current definition is used? Is it used in anywhere but in pure math? Maybe "...within the field of estimation theory and pattern recognition" could be generalised? The reason the current definition is slightly awkvard, is that you often get row vectors out when you expect column vectors, which is what you usually represent your data as. An example is if you differentiate a Normal distribution with respect to the mean. (You need \frac{\partial \textbf{x}^T A \textbf{x}}{\partial \textbf{x}}). If you solve this with the currently used definitions, you get equations with row vectors. Of course, you only have to transpose the answer, but it makes it a bit harder to see the solution. I see that this is not a very compelling argument. What is very important is that it is very clear in the article that there are two ways (or more?) or defining the derivatives, why the current definition is given, and what the difference is between them. Maybe the above explanation by Lethe could be stored somewhere and linked to from the Notice? My guess is that this article will be used mostly by "non-math people". -- Nils Grimsmo 06:16, 31 May 2006 (UTC)
I will attest to the fact that this notation is not really used by pure mathematicians. This seems to be corroborated by the fact that the references are all by engineers. Thus my reasons for preferring my notation may not be very relevant to the people who would derive the most use from this article. I'm still considering what the best solution is. But the definitions currently in the article make \frac{\partial f}{\partial\mathbf{x}} a row vector, so \frac{\partial \textbf{x}^T A \textbf{x}}{\partial \textbf{x}} is also a row vector, not a column vector. -lethe talk + 15:45, 31 May 2006 (UTC)
One thing I do not understand. From Gradient#Formal_definition: By definition, the gradient is a column vector whose components are the partial derivatives of f. That is: \nabla f  = \left(\frac{\partial f}{\partial x_1 }, \dots,  \frac{\partial f}{\partial x_n }  \right). Am I missing something here? Is not this the opposite of what is currently used in this article? (BTW: Does round parenthesis always mean column vector, while square parenthesis means row vector? -- Nils Grimsmo 08:27, 1 June 2006 (UTC)
It's either the opposite or it's the same. In the text of that article, it says "column vector", but the equation shows a row vector. In other words, the article is inconsistent, so it's hard to tell whether it contradicts or agrees with this article. I go to fix it now. And as for parentheses versus square brackets, that's simply a matter of taste, you can use whichever you like, it doesn't change the meaning. -lethe talk + 08:51, 1 June 2006 (UTC)

[edit] Matrix differential equation

Does the matrix differential equation A' = AX - XA have a solution for fixed X? --HappyCamper 18:37, 19 August 2006 (UTC)

Yes, it has a solution. I guess you want to know how to find the solution. There happens to be a nice trick for this. Start with the Ansatz
A(t) = S(t) A(0) S(t)^{-1}. \,
Differentiating this, using the formula at Matrix inverse#The derivative of the matrix inverse, gives
A'(t) = S'(t) A(0) S(t)^{-1} - S(t) A(0) S(t)^{-1} S'(t) S(t)^{-1} = S'(t) S(t)^{-1} A(t) - A(t) S'(t) S(t)^{-1}. \,
Comparing with the original differential equation, we find
S'(t) S(t)^{-1} = -X, \,
which can be solved with the matrix exponential:
S(t) = \exp(-Xt). \,
Substituting back yields the solution:
A(t) = \exp(-Xt) A_0 \exp(Xt). \,
This shows that A evolves by similarity. In particular, the eigenvalues of A are constant. -- Jitse Niesen (talk) 03:33, 20 August 2006 (UTC)
just a comment. a particular case of this is the (say normalized, drop the Planck's constant h) Liouville's equation for density matrices, where X is the Hamiltonian times i, and A is the density matrix. then the solution is precisely time evolution of (isolated) quantum system in the Schrodinger picture. Mct mht 04:11, 20 August 2006 (UTC)