Talk:Principal components analysis

From Wikipedia, the free encyclopedia


The article seems terribly cluttered. In particular, I dislike the table of symbols. Sboehringer

Contents

[edit] Question on reduced-space data matrix

The article states: and then obtaining the reduced-space data matrix Y by projecting X down into the reduced space defined by only the first L singular vectors, WL:

\mathbf{Y}=\mathbf{W_L}^T\mathbf{X} = \mathbf{\Sigma_L}\mathbf{V_L}^T

I believe that the correct formula is:

\mathbf{Y}=\mathbf{X}\mathbf{V_L} = \mathbf{W_L}\mathbf{\Sigma_L}

Can anyone verify this? —The preceding unsigned comment was added by 216.113.168.141 (talkcontribs).

Afraid not. The way things are set up in the article, the data matrix X, of size M x N, consists of N column vectors, each representing a different sampling event; with each sampling made up of measurements of M different variables, so giving the matrix M different rows.
With the reduced space, we want to find a smaller set of L new variables, which for each sampling preserves as much of the information as possible out of the original M variables.
So we're looking for an L x N matrix, with the same number of columns (the same number of samples), but a smaller number of rows (so each sample is described by fewer variables).
Matrix WL is an M x L matrix, so WL ΣL is also an M x L matrix - not the shape we're looking for. But ΣL VLT is the desired L x N shape.
Hope this helps. -- Jheald 11:02, 17 June 2006 (UTC).


Yes, that clarifies. Thanks Jheald! I thought that X row vectors were the sampling events and the column vectors were the variables -- since the definition of X is in fact the transpose of what I thought, then everything makes sense. -- 12:33, 26 June 2006

[edit] Separate articles on arg max and arg min notations

We probably need a small article on the arg max and arg min notations.

[edit] Missing crucial details

The article seems to be missing crucial details. I can't see where the actual dimension reduction is happening. Is the idea that you have several samples of the measurement vector x and you use these to estimate the expectations? 130.188.8.9 16:49, 20 Aug 2003 (UTC)

- There should now be a clue. However, the article still needs work

[edit] Plural versus singular title

Principle components analysis is better known as Principle component analysis (singular). This should be the main title and the plural form a synonym referring to this page (Unfortunately I do not know how to do it).

I've always heard it with the plural. I have a PhD in statistics. I'm not saying the singular could never be used, but the plural is certainly the one that's frequently heard. Michael Hardy 21:18, 22 Mar 2004 (UTC)
The only monography solely dedicated to PCA is from Jolliffe to my knowledge and is titled "Principal component analysis". The naming issue is discussed in the introduction otherwise than you indicate. Then again naming issues are conventions and vary across the globe. Sboehringer
Google says: "Principal component analysis": 103,000 hits, "Principal components analysis": 46,300 hits. MH 13:48, 25 Mar 2004 (UTC)
I have that monograph and you are correct. It seems, however, that the analysis elucidates the principal components, plural, and so unless one is only interested in one principal component at a time, the plural appears to be more appropriate.

[edit] Article needs serious improvement

Moving Michael Hardy's comments to Talk:

This article needs some serious revamping, to say the least. One cannot assume without loss of generality that the expectation is zero. If the expectation were observable, one could subtract it from x and get something with zero expectation, and so no generality would be lost by this assumption. In practice the expecation is never observable, and one must consider the probability distribution of the difference between x and an estimate, based on data, of the expectation of x.

Excuse me, but that is absurd. If the mean were observable, then one could simply subtract the mean from X, getting something with zero mean, and then indeed no generality would be lost by assuming that. In practice, one must use a data-based and therefore uncertain estimate of the mean, and one must therefore consider the probability distribution of the difference between X and the estimate of the mean of X.

If I may respond --- PCA is a technique that is applied to empirical data sets. PCA eigendecomposes the maximum likelihood covariance matrix. Indeed, there is a distribution of PCA decompositions about the "true" decomposition that you would get in the infinite data limit. But, that does not make it absurd. Or rather, no more absurd than any other maximum likelihood estimate. Any ML technique will have a variance around the estimate from infinite data.
Are you objecting because ML is not mentioned in the article? Or is it something else? -- hike395 04:39, 5 May 2004 (UTC)
Something else. Several something elses. It doesn't seem like that good an article. I'll probably drastically edit it within a few months; it's on my list. Michael Hardy 16:31, 5 May 2004 (UTC)

[edit] PCR and PLS?

would it be redundant to include some discussion of principal components regression? i don't think so, but i don't feel qualified to explain it.

It would also be nice to have a piece on Partial Least Squares. Geladi and Kowalski Analytica Chimica Acta 185 (1986) 1-17 may serve as a starting point.

I disagree --- PLS and PCR are both forms of linear regression, which is supervised learning. PCA is density estimation, which is unsupervised learning. Very different sorts of algorithms --- hike395 04:35, 22 Mar 2005 (UTC)

[edit] PCA & Least Squares

Is PCA the same as a least squares fit? (Furthermore, is either the same as finding the principle moment of inertia of an n-dimensional body?) —BenFrantzDale 23:53, August 3, 2005 (UTC)

No. A least-squares fit minimizes (the squares of) the residuals, the vertical distances from the fit line (hyperplane) to the data. PCA minimizes the orthogonal projections to the hyperplane. (Or something like that; I don't really know what I'm talking about.) As for moments of inertia, well, physics isn't exactly my area of expertise. —Caesura(t) 18:44, 14 December 2005 (UTC)
Yes. PCA is equivalent to finding the principal axes of inertia for N point masses in m dimensions, and then throwing all but l of the new transformed co-ordinates away. It's also mathematically the same problem as Total Least Squares (errors in all variables), rather than Ordinary Least Squares (errors only in y, not x), if you can scale it so the errors in all the variables are uncorrelated and the same size. You're then finding the best l dimensional hyperplane that your data ought to sit on through the m dimensional space. The real power tool behind all of this to get a feel for is Singular Value Decomposition. PCA is just SVD applied to your data. -- Jheald 19:40, 12 January 2006 (UTC).

[edit] Derivation of PCA

Shouldn't the constraint that we are looking for the maximum variance appear somewhere in that derivation ? I cannot understand it clearly as it is right now. --Raistlin 12:49, 24 August 2005 (UTC)

[edit] Conjugate transpose

and * T represents the conjugate transpose operation.

Why conjugate transpose instead of a normal transpose ? Does it even work with complex numbers ? Taw 04:18, 31 December 2005 (UTC)

As you probably know, conjugate transpose is a generalization of plain old transpose that allows these operations to work on complex numbers instead of just real numbers. If the source data X consists entirely of real numbers, then the conjugate operation is completely transparent, since the conjugate of a real number is the number itself. But if the source data includes complex numbers, then the conjugate operations is absolutely essential for the matrix operations to yield meaningful results. As far as I can tell, it does work on complex numbers. As an example where you might have complex numbers as source data, you might want to use PCA on the Fourier components of a real, discrete-time signal, which are in general complex. -- Metacomet 18:59, 1 January 2006 (UTC)
I have added a motivation paragraph at Conjugate_transpose#Motivation to try to show why it is so natural for the conjugate transpose to turn up, whenever the matrix you're transposing includes complex numbers. Hope it's helpful. -- Jheald 20:14, 12 January 2006 (UTC).

[edit] Computation -- surely this is not the right way to go ?

The section on computation looks to make a real meal of things, IMO; and to be pretty dubious too, as regards its numerical analysis. As soon as you square the data matrix, you're going to reduce the accuracy of your SVD from double precision to single precision.

Is there any reason to prefer either of the methods in the text, compared to choosing which bits of the SVD you actually want to keep, and then just wheeling out R-SVD ? (Which I imagine is quicker, too). -- Jheald 19:05, 12 January 2006 (UTC).

I agree that this article is unreadable. The lengthy "PCA algorithm" section is one of the main reasons - it is too long, and it doesn't agree with the equations in the introduction (where did we divide by N-1? why? what about the empirical standard deviations?). It doesn't even say what the output of the algorithm is, AFAICT. A5 13:32, 6 March 2006 (UTC)
I am working on improving the algorithm section to make it more readable. In the end, the section will still be quite long, because the algorithm is rather complicated and I think it is important to include enough detail so that people can actually implement it in software. After I have completed this upgrade, please make specific suggestions for further improvements. -- Metacomet 21:37, 9 March 2006 (UTC)
I am done for now. There is still more work to do, but it's a good start. Please provide comments and suggestions for improvement. Thanks. -- Metacomet 23:12, 9 March 2006 (UTC)
The improvement I would suggest is to delete the whole entire section completely, starting from the table, and then everything following it; and instead tell people to use SVD.
A standard SVD routine will be better written, better tested, faster, and more numerically stable.
IMO it is totally irresponsible for the article to be suggesting inefficient homespun routines, actually leading people away from the standard SVD routines. -- Jheald 00:03, 10 March 2006 (UTC).
I'm no expert Jheald, but I don't see what you're so worried about. Algorithms for SVD that I have seen on the WWW basically consist of the same algorithm that is listed on this PCA page, only done twice, once for left handed eigenvalues, once for right. Is there some other algorithm for SVD that is much preferable? --Chinasaur 08:40, 25 May 2006 (UTC)
I am really glad that you took some time to carefully review the work that I did and make some thoughtful recommendations. Thanks for the constructive feedback. Oh yes, that is sarcasm, in case you were wondering. -- Metacomet 00:48, 10 March 2006 (UTC)
"...totally irresponsible..." Don't you think that is just a wee bit of hyperbole?
"...homespun routines..." Are you referring to calculating the mean, the standard deviation, or the covariance? No, that can't be right, those are well-known and well-established procedures from statistics. Or perhaps eigenvectors and eigenvalues? Hmmmm, those are standard routines in linear algebra. Sorting the basis vectors by energy content and keeping only the ones with the highest contribution? No, that's also a standard concept called the 80-20 rule (or Pareto's principle). I guess I just don't understand what you mean by homespun routines....
-- Metacomet 01:36, 10 March 2006 (UTC)


BTW, I am pretty sure that dividing by N-1 is correct, which means the introduction needs to be fixed, not the algorithm. The reason the algorithm needs to divide by N-1 is that it is computing the expected value of the product, not the product itself. -- Metacomet 21:50, 9 March 2006 (UTC)
I dont know nothing abouth Maths, but all the pages about the Covariance Matrix use N so maybe N-1 is not so correct...? -- IC 18:48, 18 November 2006 (GMT+1)

[edit] Simplification

Could someone put one sentence at the top explaining this in layman's terms? It looks to me like a very fancy and statistically smart way to average a whole heap of data into some sort of dataset common to all of the data -- is this at all a correct impression? --Fastfission 04:07, 28 January 2006 (UTC)

[edit] Cov Matrix

If one is dealing with a MxN data set, i.e N factors and M obervations of each, the resulting cov matrix will be a NxN, not MxM.

It seems like everything from the mean vector subtraction to the covariance matrix calculation is done as if the data are organized as M rows of variables and N columns of observations. This is not properly explained in the "organizing the data" section, and is kind of opposite what most people would expect. I'm inclined to reverse everything. --Chinasaur 22:39, 19 May 2006 (UTC)

[edit] Cov Matrix size

The size of the cov matrix C is still unclear. From the session “Find the eigenvectors and eigenvalues of the covariance matrix” on, it is considered to be NxN, while in the session “Find the covariance matrix” it is MxM, which I think is the right size, since the matrix B is a MxN. 133.6.156.71 12:07, 6 June 2006 (UTC)

Shouldn't it read "inner product" instead of "outer product" as C as the outer product B \cdot B^* would make it a M\times N\times N\times M tensor?

[edit] This isn't really working!

The first point, I wondered about, is "Calculate the empirical mean". I think the mean is not calculated in the right way. The mean is calculated over each dimension M. Isn't that sophisticating the data. I think you have to take the mean over each observation (N-vector).
The second point is the size, first of the covariance and then the size of the eigenvalue-matrix. By calculating the eigenvalues you get one for each variable in the data set. So, the size of this matrix should be MxM. And before, to reach this result, the covariance Matrix must have the same size.
... Has anybody an idea how it's really working?

[edit] Whats the difference between PCA and ICA

Just wondering.. This ist clear to me for these articles? --137.215.6.53 12:18, 3 August 2006 (UTC)


[edit] Principle Components analysis verus Exploratory factory analysis

I suggest to include a subsection discussing the differences between PCA and exploratory factor analysis. Based on my experience in working in Stat Lab is that students/clients get them confused. Perhaps a description of the differences between PCA and EFA may be included. This can be added to common factor analyses. Below is my undertanding on the differences. I did not want to use "greek" symbols so that it may perhaps be more accessible to non-mathematicians. What do you think?

Exploratory factor analysis (EFA) and principal component analysis (PCA) may differ in their utility. The goal in using EFA is factor structure interpretation and also in data reduction (reducing a large set of variables to a smaller set of new variables); whereas, the goal for PCA is usually only data reduction.

EFA is used to determine the number and the nature of latent factors which may account for a large part of the correlations among a large number of measured variables. On the other hand, PCA is used to reduce scores on a large set of observed (or measured) variables to a smaller set of linear composites of the original (or observed) variables that retain as much information as possible from the original (or observed) variables. That is, the components (linear combinations of the observed items) serve as reduced set of the observed variables.

Moreover, the core theoretical assumptions are different for both methods. EFA is based on the common factor model (FA), whereas, PCA is not.

1. Common and unique variances

Common Factor Model (FA): Factors are latent variables that explain the covariances (or correlations) among the observed variables (items). That is, each observed item is a linear equation of the common factors (i.e., single or multiple latent factors) and one unique factor (latent construct affiliated with the observed variable). The latent factors are viewed as the causes of the observed variables.
Note: Total variance of variable = common variance + unique variance (in which, unique variance = specific + error variance).
Principal Components (PCA): In contrast, PCA does not distinguish between common or unique variances. The components are estimated to represent the variances of the observed variables in an economical fashion as possible (i.e., in a small a number of dimensions as possible), and no latent (or common) variables underlying the observed variables need to be invoked. Instead, the principal components are optimally weighted sums of the observed variables (i.e., components are linear combinations of the observed items). So, in a sense, the observed variables are the causes of the composite variables.

2. Reproduction of observed variables

FA: Underlying factor structure tries to reproduce the correlations among the items
PCA: Composites reproduce the variances of observed variables

3. Assumption concerning communalities & the matrix type.

FA: Assumes that a variable's variance is composed of common variance and unique variance. For this reason, we analyze the matrix of correlations among measured variables with communality estimates (i.e., proportion of variance accounted for in each variable by the rest of the variables) on the main diagonal. This matrix is called the Rreduced.
Note: Principal Axis factoring (PAF) = principal component analysis on Rreduced.
PCA: There is no place for unique variance and all variance is common. Hence, we analyze the matrix of correlations (Rxx) among measured variables with 1.0s (representing all of the variance of the observed variables) on the main diagonal. The variance of each measured variable are entirely accounted for by the linear combination of principal components.

Also See factor analysis

(please bare with me, I am new with using wikipedia).

RicoStatGuy 15:53, Sept 30, 2006(UTC)

[edit] Orthogonality of components

According to this PDF, the eigenvectors of a covariance matrix are orthogonal. The eigenvectors of an arbitrary matrix are not necessarily orthogonal, as seen in the leading picture on the eigenvector page. So what gives? Why are these eigenvectors necessarily orthogonal? —Ben FrantzDale 14:44, 7 September 2006 (UTC)

According to Symmetric matrix, "Another way of stating the spectral theorem is that the eigenvectors of a symmetric matrix are orthogonal." That explains that. 128.113.54.151 20:00, 7 September 2006 (UTC)
If the multiplicity of every eigenvalue of the covariance matrix is 1, then the eigenvectors will by necessity be orthogonal.
If there exists an eigenvalue of the covariance matrix with multiplicity greater than 1, say of dimension r, then this corresponds to an r-dimensional subspace of Rn (n being the dimension of the covariance matrix). Then the corresponding eigenvectors can be in principle any basis of this subspace. But generally speaking, the basis is chosen to be orthogonal.
So to answer the question, in some cases they must be orthogonal, and in some cases they do not all have to be, but are usually chosen to be so.
On a side note, all software packages I am aware of will return orthogonal eigenvectors in the multiplicity case. I suspect that this is because the algorithms implicitly force this by recursively projecting Rn into the nullspace of the most recent eigenvector, or something equivalent. Baccyak4H (talk) 17:56, 20 November 2006 (UTC)

[edit] Abdi

The following comes from my talk page:

" Hi Brusegadi, I deleted some references in principal component analysis related to an author called Abdi because they are neither standard references nor really related to the basic and strong theory expected of the article. Unless you can prove otherwise, I will take steps such that these references would not appear ever again. Can't you understand that this author is self-promoting or may be he is someone dear to you or you are Abdi. Mind your words! PCA_Hero "

PCA_Hero, the reason why I deleted your edit is becuase it looked like a case of blanking. You could have been able to tell that by the message I left on your anon talk page (note that this is the standard message prescribed by wiki to blanking vandals.) I see much blanking from anons. Had you provided an explanation like the one given in my talk page AFTER everything had happened in the ARTCLE'S talk page I would NOT have deleted that. Something we con both learn is WP:AGF since I accused you of blanking and you accused me of promoting spam. The difference is that when you look at your contributions, you see very few (as of today) but when you look at mine you see many and from a broad range of topics. I will also state that I do not appreciate your tone! Mind OUR assumtions! Brusegadi 19:51, 15 November 2006 (UTC)