Talk:Least-squares estimation of linear regression coefficients

From Wikipedia, the free encyclopedia

What the hell's wrong with math tex codes in this article? All I see are red lines!!! --138.25.80.124 01:03, 8 August 2006 (UTC)

Articles for deletion This article was nominated for deletion on 19 February 2006. The result of the discussion was keep.

It's hard to know where to begin saying what's wrong with this truly horrible article...

Wherein we show that computing the Gauss-Markov estimation of the linear regression coefficients is exactly the same as projecting orthogonally on a subspace of linear functions.

There is no context-setting at all above. Nothing to tell the reader what subject this is on, unless the reader already knows.

The Gauss-Markov theorem states that projecting orthogonally onto a certain subspace is in a certain sesne optimal if certain assumptions hold. That is explained in the article titled Gauss-Markov theorem. What, then, is different about the purpose of this article?

Given the Gauss-Markov hypothesis, we can find an explicit form for the function which lies the most closely to the dependent variable Y.

"an explicit form for the function which lies the most closely to the dependent variable Y." What does that mean?? This is one of the vaguest bits of writing I've seen in a while.

Let F be the space of all random variables (\omega,\mathcal{A})\rightarrow(\Gamma,S) such that (F,d) is a metric space.

The above is completely nonsensical. It purports to define some particular space F, but it does not. It does not say what ω, \mathcal{A}, Γ, or S is, but those nee to be defined before being referred to in this way. And what possible relevance to the topic does this stipulation of F have?

We can see η as the projection of Y on the subspace G of F generated by (X_1,\cdots,X_p).

What is η?? It has not been defined. A subspace of F? F has also not been defined. What is (X_1,\cdots,X_p)? Not defined. Conventionally in this topic X_1,\cdots,X_p would be column vectors in Rk and the response variable Y would also be in Rk. But that jars with the idea that X_1,\cdots,X_p are in some space F of random variables, stated above.

Indeed, we know that by definition Y=\eta(X;\theta)+\varepsilon. As \varepsilon and X are supposed to be independant, we have:

How do we know that? And what does it mean? And what is X? Conventionally X would be a "design matrix", and in most accounts, X is not random, so it is trivially independent of any random variable. (And it wouldn't hurt to spell "indepedent" correctly.

\mathbb{E}(Y|X)=\eta(X;\theta),

What does that have to do with independence of X and anything else, and what does this weird notation η(X;θ) mean? I have a PhD in statistics, and I can't make any sense of this.

but Y\mapsto\mathbb{E}(Y|X) is a projection!

I know a context within which that would make sense, but I don't see its relevance here. The sort of projection in Hilbert space usually contemplated when this sort of thing is asserted is really not relevant to this topic.

Hence, η is a projection of Y.

This is just idiotic nonsense.

We will now show this projection is orthogonal. If we consider the Euclidean scalar product between two vectors (i.e. < u,v > : = utv), we can build a scalar product in F with <X,Y>:=\mathbb{E}[X^t Y] (it is indeed a scalar product because if \mathbb{E}\|X\|^2=0, then X = 0 almost everywhere).

User:Deimos, for $50 per hour I'll sit down with you and parse the above if you're able to do it. I will require your patience. You're writing crap.

For any Xj (1\leq j\leq p), <X_j,\varepsilon>=<X_j,Y>-<X_j,\mathbb{E}[Y|X]>=\mathbb{E}[X_j^t Y] - \mathbb{E}[X_j^t \mathbb{E}[Y|X]=X_j^t(\mathbb{E}Y-\mathbb{E}[\mathbb{E}[Y|X]])=X_j^t(\mathbb{E}Y - \mathbb{E}Y)=0. Therefore, \varepsilon is orthogonal to G which means the projection is orthogonal.

Some of the above might make some sense, but it is very vaguely written, to say the least. One concrete thing I can suggest: Please don't write

<X_j,\varepsilon>\,

when you mean

\langle X_j,\varepsilon\rangle.\,
Therefore, Xt(η(X;θ) − Y) = 0. As η(X;θ) = Xθ, this equation yields to XtXθ = XtY.
If X is of full rank, then so is XtX. In that case,
θ = (XtX) − 1XtY. Given the realizations x and y of X and Y, we choose
\hat{\theta}=(x^t x)^{-1}x^t y and \eta(X;\hat{\theta}) = X\hat{\theta}.

Sigh..... Let's see .... I could ask why we should choose anything here.

OK, looking through this carefully has convinced me that this article is 100% worthless. Michael Hardy 23:38, 5 February 2006 (UTC)

Contents

[edit] Recent edits

After the last round of edits, it is still completely unclear what is to be proved in this article, and highly implausible that it proves anything. Michael Hardy 00:51, 9 February 2006 (UTC)

OK, I'm back for the moment. The article contains this sentence:


In this article, we provide a proof for the general expression of this estimator (as seen for example in the article regression analysis):
\widehat{\theta}_n^{LS}=(X^t X)^{-1}X^t Y

What does that mean? Does it mean that the least-squares estimator actually is that particular matrix product? If so, the proof should not involve probability, but only linear algebra. Does it mean that the least-squares estimator is the one that satisfies some list of criteria? If so which criteria? The Gauss-Markov assumptions? If it's the Gauss-Markov assumptions, then this would be a proof of the Gauss-Markov theorem. But I certainly don't think that's what it is. In the present state of the article, the reader can only guess what the writer intended to prove! Michael Hardy 03:19, 9 February 2006 (UTC)

[edit] Aim of the article

I have now added to the introduction that I wish to give a motivation behind the criterion optimized in least-squares (seeing a regression as a projection on a linear space of random variables) and derive the expression of this estimator. One can differentiate the sum of squares and obtain the same result, but I think that the geometrical way of seeing the problem makes it easier to understand why we use the sum of squares (because of Pythagoras theorem, i.e. \|Y\|^2_2=\|\eta(X;\overline{\theta)}\|^2_2+\|\varepsilon(\overline{\theta})\|^2_2, where \|X\|^2_2:=\mathbb{E}[X^2]).To see the regression problem in this way requires the Gauss-Markov hypothesis (otherwise we cannot show that E(.|X) is an orthogonal projection). Regards, Deimos 28 08:56, 9 February 2006 (UTC)

[edit] one bit at a time...

I'm going to disect this slowly. The following is just the first step. The article says:

(\Omega,\mathcal{A}, P) will denote a probability space and n\in\mathbb{N}^* (called number of observations). \mathcal{B}_n will be the n-dimensional Borel algebra. \Theta\subseteq\mathbb{R} is a set of coefficients.


The response variable (or vector of observations) Y is a random variable, i.e. a measurable function Y:(\Omega,\mathcal{A})\rightarrow(\mathbb{R}^n,\mathcal{B}_n).


Let p\in\mathbb{N}^*. p is called number of factors. \forall i\in \{1,\cdots,p\}, X_i:(\Omega,\mathcal{A})\rightarrow(\mathbb{R}^n, \mathcal{B}_n) is called a factor.
\forall\theta\in\Theta^{p+1}, let \eta(X;\theta):=\theta^0 + \sum_{j=1}^p \theta^j X_j.
We define the errors \varepsilon(\theta):=Y-\eta(X;\theta) with \theta:=(\theta_0,\cdots,\theta_p)\in\Theta^{p+1}. We can now write:
\forall \theta\in\Theta, Y=\theta^0 + \sum_{j=1}^p \theta^j X_j+\varepsilon(\theta)

In simpler terms, what this says is the following:

Let Y be a random variable taking values in Rn, whose components we call observations, and having expected value
\eta=\theta_0 \mathbf{1}_n + \sum_{j=1}^p \theta_j X_j,
where
  • Xj ∈ Rn for j = 1, ..., p is a vector called a factor,
  • 1n is a column vector whose n components are all 1, and
  • θj is a scalar, for j = 0, ..., p.
Define the vector of errors to be ε = Y − η.

The first version is badly written because

  • Explicit mention of the underlying probability space, and Borel measureability, are irrelevant clutter, occupying the readers attention but not giving the reader anything. When, in the study of statistics, do you ever see a random vector that is not Borel-measurable? Will the fact of measurability be used in the succeeding argument? A link to expected value is quite relevant to the topic; a link to measurable function is not.
  • Saying "\Theta\subseteq\mathbb{R} is a set of coefficients" makes no sense. The coefficients are the individual components of a vector θ somewhere within this parameter space. If anything, Θ must be a subset of Rp in which the unobserved vector θ is known to lie. If that subset is anything other than the whole of Rp, then I think you'll have trouble making the case that least-squares estimation of θ is appropriate, since the estimate presumably should be within the parameter space;
  • The column vector of n "1"s is missing;
  • It alternates between subscripts and superscripts on the letter θ, for no apparent reason;
  • Why in the world is ε asserted to depend on θ? Later the article brings the Gauss-Markov assumptions, which would conflict with that.
  • One should use mathematical notation when it serves a purpose, not just whenever one can. It is clearer to say "For every subset A of C" than to say "\forall A\in\mathcal{P}(C), where \mathcal{P}(C) is the set of all subsets of C."

OK, this is just one small point; the article has many similar problems, not the least of which is that its purpose is still not clear. I'll be back. Michael Hardy 00:25, 20 February 2006 (UTC)

[edit] Thanks

OK, this makes sense: I'll correct the article. Except for the "having expected value" part. The way I present it, you can always write y=\eta+\varepsilon. What the Gauss-Markov assumptions add is that there exists an optimal parameter \overline{\theta} for which \varepsilon has an expectation of 0 and that its components are independant. The advantage is that you do not have to suppose that the Xj's are constants. In the case of randomized designs, this is important. Deimos 28 12:10, 20 February 2006 (UTC)