Talk:Regression analysis

From Wikipedia, the free encyclopedia

Contents

[edit] Mergers?

A variety of mergers has been proposed. Please add to the discussion at Talk:Linear regression.

[edit] Example

Is it possible to have a better example? Not only does the equation y_i=x_i^2+1 look obvious, but yi = 7.763 / 2 − 3.221cos(xi) + 0.339cos(2xi) is simpler than the result given. A regression example needs to have more data than unknown, leading to non-zero residuals. --Henrygb 02:08, 3 Feb 2005 (UTC)

Please feel free. I just rescued this example from function approximation, but I don't think it is the best example either. -- hike395 17:42, 3 Feb 2005 (UTC)

[edit] OLS

Should this page mention OLS regression?

Isn't it covered already under linear regression?

Feinstein 06:15, 6 May 2006 (UTC)

[edit] Appropriate level of rigor

The maths is too rigorous, to the point where few readers will be able to understand it. You wouldn't expect to see this in a general encyclopaedia, so I think it's not appropriate for the wikipedia.Blaise 22:39, 11 October 2005 (UTC)

Also, isn't alpha reported in a decimal? Noting that alpha = 5% is different than saying 0.05 because of two decimal places....every book I have used has alpha as a decimal (the assumption of percent is only really used when 1-alpha is considered.)

It should be reported as .05 with no 0 before the decimal because, as the APA style guide suggests, any number that does not exceed 1 does not need to have a 0 before the decimal. It is always reported as a decimal in peer-reviewed research literature. Chris53516 13:20, 11 October 2006 (UTC)

[edit] Re: Appropriate level of rigor

Hi,

I think I'm largely responsible for this. I'm aware that this is probably not the simplest way to present regression analysis. However, it is hard to find rigourous definitions for all this on the internet. You usually get fairly vague explanations, sufficient for simple applications but which are closer to a user's manual than to a text book.

[Comment: is there a reason why these definitions to which you refer need originate from the Internet? If you aren't already extensively familiar with the material, it's probably best to leave the article to someone else. -Dave]
I think you missed my point, Dave: as these definitions aren't available, I thought it would be good to make them available. In your edit summary, you say the typo you corrected was one of the smallest problems of this page: could you please be more specific? Deimos 28 13:13, 24 February 2006 (UTC)

Anyway, I think the comment is pertinent and I'll leave it to others to decide what to do with this article (if it should be left "as is", simplified, moved on to another section or deleted alltogether). Obviously I'd be more enclined to keep it, but it's your encyclopedia as much as mine ;)

By the ways, ordinary least-squares is presented in this article...


I second that - the definition is good, but a little difficult to understand, particularly if the person accessing this page is coming from a social sciences background (i.e. poli sci, sociology), and just needs to know the significance of regression analysis, not necessarily how to do it. Could a section be inserted that explains what regression analysis can be used to prove? -sarah

Try MathWorldJames S. 09:39, 5 January 2006 (UTC)

Hi Sarah, I was thinking of adding an example with "real" data. I think it'll be ready in a few days. Cheers! -deimos

I think the article would benefit from adding a simpler introduction. There is material further down that is more accessible. But, I think a reader without mathematical background might give up. It would be nice if someone could write a couple of sentences at the top about the general idea. --Flitzer 14:36, 4 April 2006 (UTC)

In all seriousness, who is the imagined audience for this entry? Anyone who understands measure theory wont come here to learn about OLS. Most everyone else will click away after taking one look at the notation.
—The preceding unsigned comment was added by 170.223.19.150 (talk • contribs) 14:04, April 5, 2006 (UTC)
What needs to be done is to write a sizable introductory section (not a paragraph, a section) that introduces the concept to the average reader, and then to retain the "rigor" in the sections that follow. —Lowellian (reply) 07:37, 21 April 2006 (UTC)

[edit] Regression analysis

Where is the math outcome. Using SPSS, what is the significance level. If the correlation is more than .050 regarding your independent variables, your significance is too high. What and where are the simple answers? 2+2=4. If you have this number(?), you need to look at another variable, or your hypothisis is inaccurate.

—The preceding unsigned comment was added by 69.152.245.238 (talk • contribs) 08:43, December 3, 2005 (UTC)

[edit] Clean up

I've done some cleaning up on this page: I moved a lot of the theoretical details to other articles so that people can skip them if they want to. I also added a detailed example of how to use the material presented in this article. Hope this helps. Let me know what you think about it.

Cheers,

Deimos.

Great work! I had nominated this for Wikipedia:Mathematics Collaboration of the Week, but now I'm not so sure it needs much. Thanks. —James S. 20:21, 10 January 2006 (UTC)

[edit] Problems with the second example

Hi!

I found a few problems with the second example. First of all, this is an example of an interpolation problem which is a special case of regression for n = p. In this case, the regression function fits the points exactly. There's no problem with that, except that the way it is presented in the example, it can't work.

First of all, p > n, which contradicts the hypothesis given at the beginning of the article. This means we have too many coefficients to estimate for the data at hand and therefore that the design matrix will necessarily be singular. If we do not have any more data points, we can reduce the size of the G subspace by noticing that the function we are looking for has to be even, as y(x) = y( − x) for all the data at hand (which incidentaly means that half the data is redundant if we choose a trigonometric function). This means we can reduce the problem to finding 0123) such that:

f(x) = \frac{\theta_0}{2} + \theta_1\cos(x) + \theta_2\cos(2x) + \theta_3\cos(3x)

for x = (0,1,2) and y = (1,2,5). We now have four coefficients to estimate with three data points. Therefore we still have one coefficient to get rid of: the system is still over-determined. We choose θ0 = 0. We can choose any other value of θ0 if instead of the ys, we do the regression on y − θ0. Then, just adding θ0 to the regression formula we will have obtained will give us a function of the requested form taking the y values for x.

We build the matrix X by putting together the column-vectors cos(x),cos(2x) and cos(3x):

X=\begin{pmatrix} 1&1&1\\ 0.54&-0.42&-0.99\\ -0.42&-0.65&0.96\\ \end{pmatrix} I find the same values as in the example, of course: 123) = (4.252563, − 6.130016,2.877453)

I think this is a bad example for regression and should be removed from the article. Maybe more suited for trigonometric interpolation? -- Deimos 28 13:22, 24 February 2006 (UTC)

[edit] Model II regressions

Would it be good toa dd model II regressions? KimvdLinde 17:05, 19 March 2006 (UTC)

Sure! What is it?
Deimos 28 20:04, 19 March 2006 (UTC)

Linear regression in which the variance of both varaible is included, as such, not minimisation along the y values, but y and x values simultaniuosly. Two best known versions are Major axis regression and reduuced major axis regression. KimvdLinde 23:56, 19 March 2006 (UTC)

[edit] Gauss-Markov assumptions: what does V stand for?

Under "Gauss-Markov assumptions", it is unclear what V stands for: I guess it is the derivative with respect to X? Do the formulae in that line mean 1. eps_i = 0 for all i 2. (d eps_i)/(dX_j) = (sigma)^2*(kroneckerdelta)_ij  ? Thank you.

—The preceding unsigned comment was added by 62.206.54.162 (talk • contribs) 03:31, March 27, 2006 (UTC)

The \mathbb{V} stands for "variance". It is not the derivative (nowhere in the article is \vec{\varepsilon} assumed to be derivable, with respect to any variable). The formulae mean:
  • \int_{\Omega}\vec{\varepsilon}(\omega)\,d\omega=\vec{0} (vectors of size n)
  • \forall (i,j)\in[\![1,n]\!]^2, \int_{\Omega}\varepsilon_i(\omega)\varepsilon_j(\omega)\,d\omega=\sigma^2 \delta_{ij}
where \vec{\varepsilon}:\Omega\rightarrow \mathbb{R}^n is a random variable, with \vec{\varepsilon}=(\varepsilon_1,\cdots,\varepsilon_n).
Regards,
--
Deimos 28 09:20, 27 March 2006 (UTC)

[edit] Logistic regression

A related article that could really use some work (hint, hint, this is a call for help to the editors here for this collaboration of the week), especially on how exactly one goes about doing the iterative calculations, is logistic regression. —Lowellian (reply) 07:40, 21 April 2006 (UTC)

[edit] Scrap it and start again

I'm sorry but this article is so bad, so full of mistakes, that it really needs to be completely scrapped and replaced. It is riddled with errors. For example, consider the following sentence from the article:

"The simplest type of regression uses a procedure to find the correlation between a quantitative response variable and a quantitative predictor."

There are three problems with this sentence.

First, regression does not use "a procedure to find the correlation". This is nonsense. If one wants to measure the correlation between two variables, then one computes a correlation coefficient -- regression is not needed to do so. Regression can be used to estimate a mathematical model, an equation, that expresses the response variable as a function of the predictor. Note that it is very possible that the traditional correlation coefficient between two variables will be 0 even though there is a perfect mathematical (functional) relationship between the two variables.

Second, in regression, all variables are quantitative. That is, all variables must be expressed as quantities, as numbers. The mathematics of any kind of regression simply does not work using non-quantitative data. I believe the author(s) is confused about the distinction between types of quantitative data (ratio, interval, ordinal, and nominal) and the distinction between the two broad categories of data (quantitative and non-quantitative.)

Finally, I'm not sure how one goes about defining "simplest" in terms of types of regression analysis. In addition, the use of this adjective is potentially confusing since there is a type of regression analysis called "simple linear regression analysis". However I'm not sure it's any more or less simple (in the sense of "simplest") than a model of the form Y = X^THETA + EPSILON.

These are only three examples from one short sentence. The article is full of errors such as these.

I hope that no one ever uses this article to learn about regression analysis. It really needs to be taken down before too much damage is done. It's really bad.

"I hope that no one ever uses this article to learn about regression analysis." Have no fear. The article is so poorly written that nobody understands it. --Ioannes Pragensis 18:55, 2 May 2006 (UTC)

[edit] ANOVA and ANCOVA

Predictor variables may be defined quantitatively or qualitatively(or categorical). Categorical predictors are sometimes called factors. Depending on the nature of these predictors, a different regression technique is used:

Untrue: no different regression technique is used. The different names appear to exist for historical reasons, but there is no real difference between the models or the means used to estimate them. http://www.stats.ox.ac.uk/pub/MASS3/Exegeses.pdf, top of page 12. Similar remarks appear in the book by Seber and Lee, 2003. Please amend.

[edit] This page SUCKS

I don't think it's appropriate to try and cover every relevant issue under the auspices of 'regression analysis' especially considering that specific techniques are convered more comprehensively and better elsewhere on this Wiki. I'm cutting out most of this stuff because it's on other pages.

Feinstein 06:14, 6 May 2006 (UTC)

[edit] A word of support

I support the work that is being done here. It promises a useful definition of a basic concept in statistics. Often someone approaching a field from the outside (e.g. Soc Sci, Pol Sci) will be reluctant to ask an expert to explain concepts learnt learnt on day one. In this case it helps to be able to refer to an encyclopedia. The technical maths also has its place, though perhaps could be put separately in a box. While MathWorld is probably the definitive online mathematics source, it is pitched at a much higher level. Wikipedia does have something to offer in terms of an introduction. Previous discussants should remove overly critical comments from this page and instead submit a constructive revision. Wikid 10:38, 24 May 2006 (UTC)

[edit] Height vs Weight Example

I don't know which is the first or second example discussed above, but regressing averages against each other leaves such a smooth data set that it gives first time readers a very extreme view of most regressions. For example, how do we explain the r-square value of the resulting relationship? -Finn

This example is not the one which I mentioned earlier on this talk page. It is indeed a bit "artificial", but I thought it would be clearer to have a simple example which "works" well (i.e. is straightforward to compute and gives a good fit). It does not serve well for any statistical inference. -- Deimos 28 07:27, 6 June 2006 (UTC)

Isn't it a simple fallacy to equate the confidence interval from a regression, with an essentially Bayesian estimate along the lines of "with a probability 0.95, these parameters lie in this range" as is done at the end of this example? I know it's commonly enough done (and most people don't know the difference).Jdannan 07:17, 5 July 2006 (UTC)

[edit] Multicollinearity

I know this is not my area, but somehow the following statement in the section on Linear models does not make sense:

"Multicollinearity results in parameter estimates that are unbiased and consistent, but which may have relatively large variances."

The article on multicollinearity does not give this characterization, yet it seems to more accurately describe this phenomenon. Could someone who knows more than I do about Statistics please take a look at this and edit? Vonkje 21:26, 26 November 2006 (UTC)

The correct phrasing is, "unbiased, consistent, but inefficient." Wikiant 21:40, 26 November 2006 (UTC)