Talk:Statistics

From Wikipedia, the free encyclopedia

Welcome! This subject is outlined on the List of basic statistics topics. That list, along with the other Lists of basic topics, is part of a map of Wikipedia. Your help is needed to complete this map! To begin, please look over this subject's list, analyze it, improve it, and place it on your watchlist. Then join the Lists of basic topics WikiProject!

This page has been cited as a source by a media organization. The citation is in:

Kathy Lange. "Differences Between Statistics and Data Mining", http://www.dmreview.com/ DM Review, December 1, 2006. (details)

Mathematics Portal

This article is within the scope of WikiProject Mathematics, which collaborates on articles related to mathematics.

Mathematics rating:

B Class

Top Priority

Field: Probability and statistics

A vital article.

One of the 500 most frequently viewed mathematics articles.

Please update this rating as the article progresses, or if the rating is inaccurate. Click to show/hide comments.
Please add to or update the comments to suggest improvements to the article.
There seem to be too many lists of things here. More prose would help I think. Geometry guy 23:03, 9 June 2007 (UTC)

This article is within the scope of WikiProject Statistics, which collaborates to improve Wikipedia's coverage of statistics. If you would like to participate, please visit the project page.

Statistics was a good article, but it has been removed from the list. There are suggestions below for improving the article to meet the good article criteria. Once these are addressed, the article can be renominated. Editors may also seek a reassessment of the decision if they believe there was a mistake.

^{Delisted version: June 11, 2006}

This article has been reviewed by the Version 1.0 Editorial Team.

This article has been selected for Version 0.5 and subsequent release versions of Wikipedia.

Additional information:
B	This article has been rated as B-Class on the assessment scale.
???	This article has not yet received an importance rating on the assessment scale.
	This article is a vital article.
	This article is one of the core set of articles every encyclopedia should have.
	The following comments have been left for this page: There seem to be too many lists of things here. More prose would help I think. Geometry guy 23:03, 9 June 2007 (UTC) (edit)

This page is for discussion of the article about statistics. Comments and questions about the special page about Wikipedia site statistics (number of pages, edits, etc.) should be directed to Wikipedia talk:Special pages.

Please add new comments at the bottom of this page.

1 Archives
2 Number of data points
3 Questions
4 Fallacy?
5 Need Link to Reliability (statistics) page
6 Standardized coefficient for DYK
7 What is the difference between F(x) and f(x)?
8 Name of Etymology subsection
9 Criticism
10 Note about archives
11 Merge from applied statistics
12 misconceptions
13 Statistics and Accuracy
14 Three types of lies
15 Misuse of statistics
16 Statistics As Principled Argument, by Robert P. Abelson

[edit] Archives

Archive 1 - Mostly pre-2006 (Range: 2002 - 2006)
Archive 2 - Mix of 2005 - 2006 (Range: Nov 2005 - Feb 2006)
Archive 3 - 2006 (Range: Feb 2006 - August 2006)
Archive 4 - 2006 (Range: July 2006 - Aug 2006)
Current version: 2006 (Range: Aug 2006 - )

[edit] Number of data points

Was wondering if there was a name for the statistical principle that maintains that the more data points you have, the more reliable your dataset will be... Thanks.Jefferson61345 02:30, 8 August 2007 (UTC)

Yes it's the central limit theorem. —Preceding unsigned comment added by 82.32.9.240 (talk) 20:03, 30 September 2007 (UTC)

Are there any theorems or definitions related to a small number of data points? In particular, I'm wondering if there is a definition of the term "poor statistics" (or "weak statistics"), which is sometimes used by scientists when describing the statistical analysis of experimental data sets. Usually, this term is accompanied by the statement that "more data" are needed to improve the statistics. What is the limit in number of data points below which statistics are "poor"? Are there other factors to be taken into account? Is "weak statistics" equal to "poor statistics"? --Uxh (talk) 17:27, 2 May 2008 (UTC)

[edit] Questions

Question:- What is the procedure of finding no. of standard n X n latin square design ? Question:- What is the defination non_trivial sufficient statistics ? Pls solve this questions if possible. Thanks a lot. —Preceding unsigned comment added by 164.100.6.9 (talk) 05:40, 5 April 2008 (UTC)

[edit] Fallacy?

Statistics can be easily deemed a fallacy. If statistics say that kids whose parents don't talk to them about not smoking are more likely to smoke (you know the common argument), that is a fallacy. Yes, it may be a true statement, but it cannot be argued that the kids whose parents tell them not to smoke would not find smoking cool and that the kids whose parents didn't tell them not to smoke may decide may feel it is disgusting. Statistics as a field tend to treat all people as equal in all regards when that is clearly not true. Not everybody can throw 49 touchdown passes in an NFL season like Peyton Manning did in 2004 or be the leading goal scorer at the Soccer World Cup. I just figured this might be an idea to consider discussing in the article, even though it may be difficult to find a decent source. 205.166.61.142 00:31, 31 August 2006 (UTC)

You make some sweeping generalizations. One of the purposes of statistics is to attempt to explain an outcome with the most explanatory variables. If a certain type of person is more likely to have a certain kind of outcome (for example, black men tend to have more cardiovascular problems), it is in the best interest of such research to treat everyone differently, not the same. Statistics such as the t-test and ANOVA often differentiate people more than treat them the same. I think your football analogy may be one of the fallacies you are talking about. Football statistics are descriptive statistics--they only describe those people to which they apply (in your case, professional football players and nobody else). Inferential statistics, such as the t-test, often group people according to like kinds based on particular variables, like incidence rate of cardiovascular health problems. Chris53516 13:43, 31 August 2006 (UTC)

Let me add to that answer in case the poser of the question returns. Statisical methods are not (correctly) used to prove cause and effect or to make claims that something is always true. Statistics is more of an art of educated guessing where mathematical methods are used to make best decisions about what is most likely or what tends to be related. In fact, built into the methods of statistics are ways of determining how likely you are to make an error in your "educated guessing". Typically, someone using statistical methods correctly will say, "I am 99% sure that these two factors (such as not smoking and parents telling the child not to smoke) are related to each other." Then qualifiers will be added. Even in that case, a good statistician wouldn't claim that one factor causes the other. It could be that both items are caused by some third, unidentified, factor. But, of course, those types of misinterpretations of statistical results are made all the time. That doesn't mean, however, that the cause and effect is not logically the best interpretation to the situation. Suppose, for example, that a large number of people get sick who mostly all ate spinach. We might make a best guess that spinach caused the illness. But, really it might be something else like a common salad dressing used by spinach lovers or the fact that spinach stuck in their teeth chased away potential romantic relationships leaving the spinach-eaters in a heart-sick condition which eventually led to real illness. Of course, those alternatives are ridiculous. I guess they COULD be true, but most people would go with the theory that the spinach was teinted. And even if the spinach was the problem, it could be that, for some, there was another unidentified cause. So, we are left with concluding, "Probably this is the cause most of the time." --Newideas07 21:48, 3 November 2006 (UTC)

[edit] Need Link to Reliability (statistics) page

This page needs links to the pages on Reliability (statistics) and Factor Analysis. I'm not sure if these should be put under Statistical Techniques or See Also. I'm also wondering if there should be a link to Cronbach's Alpha (which is one type of reliability estimate).

It seems to me that there are probably quite a few statistical techniques that are not linked from this page. Perhaps it would be helpful to create a hierarchical index of statistical techniques. I see that something like this can be done in the Table of contents. Kbarchard 22:24, 16 September 2006 (UTC)

This page is not a list of statistical topics (which we link to in the "See also" section), and not every statistical technique or estimator needs to be listed here. The ones you mention seem a bit too specialised for a general article on statistics, but could be usefully added to articles like multivariate analysis and social statistics. -- Avenue 01:34, 18 September 2006 (UTC)

[edit] Standardized coefficient for DYK

I wrote an aricle on Standardized coefficient, but I am no expert in statistics. If this could be quickly vetted by an editor more experienced with this field, we could have a statistical WP:DYK.--_{Piotr Konieczny aka Prokonsul Piotrus | talk} 20:25, 7 October 2006 (UTC)

[edit] What is the difference between F(x) and f(x)?

Can somebody please explain to me with an example the difference between F(x) and f(x) for a continuous random variable? As far as I understand f(x) is a derivative of F(x), please correct me if I am wrong, but that is not sufficient enough for understanding the whole process. Many thanks. -Chetan. —Preceding unsigned comment added by Chetanpatel13 (talk • contribs)

Those two should be interchangable, as far as I know. By the way, use four ~ to sign with your user ID. Chris53516 17:07, 18 October 2006 (UTC)

Chris, thanks for the response, BTW they are very different. Thanks for the tip and hopefully I am doing it right this time. -- Chetan M Patel 18:24, 18 October 2006 (UTC)

How are they different? Please use 4 ~ to sign your name. It's easier than what you did. Chris53516 18:31, 18 October 2006 (UTC)

f(x) is probability density function (PDF) whereas, F(x) is cumulative distribution function (CDF). Chetan M Patel 18:58, 18 October 2006 (UTC)

The names of the functions are a convention, widely used in statistics. Perhaphs a better question is: whats the difference between a PDF and CDF? Its probably easiest to understand if you know about integration with $F(u)=\int_{x=-\infty}^x f(x) dx$ . As we are working over a continuous domain the chance of a random variable taking a particular real-value, 0.123456789 say, is zero so it only makes sense to talk of probabilities calculated over a range of values and its a convention to use the range $[-\infty,x]$ giving the CDF. So yes $f(x)={dF \over dx}$ . What is the meaning of the PDF, well if you consider a discrete probability distribution like the binomial distribution then the PDF is just the probability of a particular number, here the probabilities of a particualr number 0,1,2,3 occuring is non zero. Futhermore, PDF is useful for visulising the shape of a distribution, for the normal distribution it gives the familiar bell shaped curve, the CDF would be S-shaped and its harder to see whats happening. --Salix alba (talk) 20:45, 18 October 2006 (UTC)

Correction: that should be $F(u)=\int_{x=-\infty}^u f(x) dx$ . The upper bound of integration must be u if F(u) is what you're evaluating. Michael Hardy 22:47, 18 October 2006 (UTC)

In case anyone wants a "Statistics for Dummies" explanation of all that: f(x) is the drawing of a curve that defines a certain probability density function (pattern). For example, a bell shaped curved has an equation, f(x), and represents a situation in which falling in the middle of some range is most likely with tapering probabilities as you go to the left or right. Most measurements of objects fall in this category. But, probabilities of having x in some range are found by calculating the area under the curve. To find the area under the curve, you have to integrate f(x) to get F(x). Sometimes, that is impossible or just really hard and so approximation techniques are used instead, which is why one reason why you usually get probabilities out of tables instead of using equations. There are other theoretical uses for the two functions. I'm not sure if that clarified things for anyone. --Newideas07 21:23, 3 November 2006 (UTC)

In case that didn't clarify things for some people, the 'statistics for dummies for dummies' version is that the pdf is the height of the density at a given point, whereas the cdf is the area under the curve fro a range of points. For example, if we want to know the probability of a person being 5'9" tall, that's a question for a pdf (f(x); if we want to know the probablity of being 5'9" or less, that's a cdf (F(x)). Plf515 02:09, 24 November 2006 (UTC)plf515

[edit] Name of Etymology subsection

Etymology here is the study of the history of the word statistics, not the history of statistics itself. The first paragraph or so of the current Etymology subsection is etymology, but the later paragraphs go beyond etymology to actual history of statistics. That's why I think there are many better, broader titles for this subsection. Or maybe I am interpreting etymology too narrowly? Joshua Davis 15:11, 21 October 2006 (UTC)

I think Etymology works, even if it does go beyond simple etymology. It's still related to the word's history. -- Chris53516 16:04, 22 October 2006 (UTC)

I agree that Etymology was not an accurate description here. I've tried to remedy the situation somewhat by moving some of this material to the Statistics Today section. I also removed a reference to Michel Foucault, which does not seem to me to belong here at all. Thefellswooper 22:06, 31 March 2007 (UTC)

[edit] Criticism

I would like to propose we change the name of this section to "The Misuse and Limitations of Statistics" or something similar as Joshua suggested. I also would like to make big revisions to it if no one is working on it or attached to it as it is. I'm a statistician (M.S.) and educator. If anyone objects or has a better idea or is already working away hard on this, speak soon or I'll do it. --Newideas07 22:04, 3 November 2006 (UTC)

I think that is a good topic, but for a separate article. There are certainly lots of abuses of statistics, but this page seems fine to me, needing only minor edits. Plf515 02:34, 24 November 2006 (UTC)plf515

I agree with the opening comment. Statistics is one of the three primary branches of mathematics (Pure, Applied and Statistics), and at the moment Pure and Applied seem to get more attention. Go for it Newideas07 David —Preceding unsigned comment added by 82.32.9.240 (talk) 20:01, 30 September 2007 (UTC)

[edit] Note about archives

I used a method that others may not like. If someone else wants to change the archive, find and copy any new comments, and begin at this page to do so: Start of archiving. Thanks for being patient while I made these archives. -- Chris53516 (Talk) 23:01, 3 November 2006 (UTC)

[edit] Merge from applied statistics

There was a suggestion at Talk:Applied statistics to merge into this article - it's only a stub but it may have some potential. I'll leave it for the statisticians here to decide. Richard001 19:53, 6 February 2007 (UTC)

Merge. In my opinion, "applied statistics" is a redundant phrase. To me it appears that statistics are often applied somehow. So, the article can be merged as a new section or integrated into this article. — Chris53516 ^(Talk) 20:27, 6 February 2007 (UTC)

I am not a statistician and cannot really comment on the material, so I won't "formally" vote. But the long-standing stubbiness and infrequent editing suggest a merge to me. I'd add that Mathematical statistics is similarly meager, covering nothing that isn't already covered here. Joshua R. Davis 13:54, 8 February 2007 (UTC)

Merge. I don't quite agree with Chris53516 when he asserts that "applied statistics" is pleonastic, but this article already covers the distinction between "applied statistics" and "theoretical statistics" adequately, in the introduction. I looked through the applied statistics article carefully, and in my opinion a merger is overkill. Applied statistics should simply be deleted. DavidCBryant 15:48, 8 February 2007 (UTC)

(Note. If someone deletes the page, be sure to redirect it to this article. — Chris53516 ^(Talk) 16:00, 8 February 2007 (UTC))

Having heard no objections, I have gone ahead and changed Applied statistics into a redirect page. Don't give up on Mathematical statistics quite yet, though. I'm trying to get hold of Dcljr, who had quite a few ideas on that score. I'm sure the theoretical article can be turned into something better pretty soon. DavidCBryant 01:50, 14 February 2007 (UTC)

[edit] misconceptions

not a statistician here but maybe the article ought to have a section addressing those. statistical mechanics has nothing to do with mathematical statistics. many areas are related to rigorous formulation of statistical mechanics: probability and analysis, topology, number theory, etc., but not statistics. i also removed the reference to "sports statistics". to call computing, say, slugging percentages or ERA's or free throw percentages doing statistics seems rather abhorrent, IMHO. Mct mht 07:09, 10 February 2007 (UTC)

Thanks for taking those (See also) links out. I concur with your decisions. Do you mean to tell me that Maxwell and Boltzmann aren't just two guys who played for the Yankees? ;^> DavidCBryant 12:36, 10 February 2007 (UTC)

who's on first base, Dave? :-) Mct mht 07:26, 11 February 2007 (UTC)

I don't mind losing "statistical mechanics", but in my view removing "sports statistics" is going too far. Sure, the routine collection of free throw percentages etc is not exactly groundbreaking statistical work, but it is a (small) part of statistics. I've seen several articles on aspects of sports statistics in reputable statistical journals. They're admittedly more common in lighter fare (e.g. the ASA's Chance magazine has a regular column titled A Statistician Reads the Sports Pages), but they demonstrate that professional statisticians view sports statistics as within their ambit. -- Avenue 03:21, 11 February 2007 (UTC)

i am certainly in no position to object if that's the concensus of professional statisticians. Mct mht 07:26, 11 February 2007 (UTC)

Statistical mechanics is indeed probabilistic mechanics, but I'd be inclined to leave the link here. Sports statistics, as pointed out, is deeper than people may realize. (There was a great article on this in the WSJ around August or Sept. of last year.) There is legitimate inferential statistics going on there, e.g. attempts to correct for the effects of luck on a player's stats. JJL 03:47, 11 February 2007 (UTC)

I don't much care if sports statistics are listed in this article. At least they're comparable (in quantity) to the other kinds of data regular statisticians deal with. But let's keep the references to physics out of the "see also" list ... the meaning of "statistics" in the context of physics and thermodynamics is substantially different from the meaning this article deals with. I guess I could say I use a result from statistical mechanics (a measurement of the ambient temperature) to "make an informed decision" (whether to wear a flannel shirt, or not). But that really seems like stretching the point, to me. Oh – what's on second, and who's on third. ;^> DavidCBryant 17:25, 11 February 2007 (UTC)

[edit] Statistics and Accuracy

Can an expert out there please discuss the topic of statistics and accuracy. For example, do statistics HAVE to be accurate? Or can statistics be a general indication of a trend, reality, etc.

In general, the data from which statistics are derived are as accurate as the observers/experimenters/statisticians can make them. I suppose that observational errors are possible (I might think the lights are off when they're really on ... maybe I just went blind, and haven't realized that yet), but in practice observational errors are fairly rare, and easily controlled.

Even though the observations are accurate, the statistics themselves may be imprecise. In general, the larger the number of observations that can be made, the more precise the statistical estimates that emerge. This tendency of the collected data in a small sample to diverge somewhat from the true characteristics of a sampled population is analyzed, in the first instance, by the statistical variance of the data collected.

Notice that certain kinds of data (mostly relating to people's opinions, and similar subjective measurements) are inherently less reliable than the measurements that can be made in fields like chemistry and physics. Such data can easily be manipulated to reach misleading conclusions, no matter how carefully statistical procedures are carried out (for example, by asking biased questions, or by limiting the allowed responses on a questionnaire, etc.) DavidCBryant 04:33, 10 August 2007 (UTC)

Actually, to qualify as a measurement, a set of observations only have to result in a reduction in uncertainty, not necessarilly ellimination of uncertainty. In other words, if the accuracy is greater than the accuracy of your previous uncertain estimate, then it told you something you didn't know. I just wrote a book about it called "How to Measure Anything".Hubbardaie 22:35, 10 August 2007 (UTC)

[edit] Three types of lies

Lies, damn lies and statistics —Preceding unsigned comment added by 70.80.220.247 (talk) 14:46, 28 October 2007 (UTC)

[edit] Misuse of statistics

Currently the Misuse of statistics section contains a quote from Dennis Lindley that is not referred to in the text and has nothing to do with misuse, as far as I can tell. I think that this section is also disproportionately large (roughly 20% of the text), in danger of giving the casual reader the impression that statistics as a discipline is inherently untrustworthy or controversial. It's also loaded with weasel words.

I propose that we shorten this section dramatically and leave the details to the Misuse of statistics article (so that it's similar to the short History of statistics section, with its accompanying History of statistics article).

In fact, I think that the misuse/misinterpretation paragraph of the Overview section is itself sufficient, without a Misuse section at all, but probably I'm in the minority there? Joshua R. Davis (talk) 16:42, 20 January 2008 (UTC)

I do think that a misuse of stats. section is valuable here, and a longer article on it elsewhere is also useful. While there may be a case for some rebalancing, I am fine with the section as is. Certainly, many people coming to this page will be familiar with "lying with statistics" and with the perception that stas. is, as you say, inherently controversial, and this section both addresses that and puts it in a more formal context. I agree that the Lindley quote is misplaced here and should be (re)moved. But the current section nicely transitions from the general perception of lying with stats. to the more scientific concerns over hypothesis testing, p-values, etc. JJL (talk) 17:28, 20 January 2008 (UTC)

I agree that the section is worth having, but there's also a lot of room for improvement. I've made a few changes to the second paragraph. I think the part about hypothesis testing could be reduced to a simple statement that CIs are preferable to p-values. The Bayesian bit should either be expanded or removed; just saying it's another option, but has its own critics, gives the reader very little information. Mentioning publication bias might be useful. The paragraph on the Abelson perspective is interesting, but does it really deserve this much prominence? -- Avenue (talk) 23:40, 20 January 2008 (UTC)

I have tried to make the section more concise in a manner compatible with these opinions. It still has a lot of weasel words, since I haven't verified any of the information. Joshua R. Davis (talk) 00:12, 31 January 2008 (UTC)

[edit] Statistics As Principled Argument, by Robert P. Abelson

I think that the following is interesting and deserves to be in Wikipedia. But I do not think that it should be in this article. Maybe in a more specialized (new) article on the foundations/philosophy of statistics.

In his book Statistics As Principled Argument, Robert P. Abelson articulates the position that statistics serves as a standardized means of settling disputes between scientists who could otherwise each argue the merits of their own positions ad infinitum. From this point of view, statistics is a form of rhetoric; as with any means of settling disputes, statistical methods can succeed only as long as all parties agree on the approach used.

So I have put the paragraph here, and deleted it from the article. —Preceding unsigned comment added by 86.156.222.165 (talk) 10:22, 31 January 2008 (UTC)

I have created a new article Foundations of statistics, which incorporates the above quoted paragraph. The article is currently a stub. TheSeven (talk) 11:19, 31 January 2008 (UTC)

I think the point of view that staistics is rhetoric is valid and merits inclusion in the main statistics article. The "Misuse" section may not have been the optimal place for it but I'd like to see a statement to that effect somewhere here. Abelson is an obvious reference for that viewpoint but not the only one. JJL (talk) 12:48, 31 January 2008 (UTC)

What I would ideally like is the article Foundations of statistics expanded from a stub into a real article. Then the Statistics article could include a paragraph that summarized the foundations, and linked to that as the main article on the topic. The former should be done anyway, I think; it is an important topic. TheSeven (talk) 14:34, 31 January 2008 (UTC)

Is "Foundations of statistics" a term that statisticians use to talk about this stuff, or did we just make it up? When I hear it (as a non-statistician) I think probability theory. The Abelson stuff seems better described as "Philosophy of statistics". Is there a lot to say about the philosophy of statistics? (I'm honestly asking.) Joshua R. Davis (talk) 14:38, 31 January 2008 (UTC)

"Foundations of statistics", or "foundations of mathematical statistics", are common terms. I just tried googling and got 110,000 results. There are also books with that title. There is a substantial philosophical component to this though. Googling for "philosophy of statistics" gave me 88,000 results. So perhaps there should be an article with that title, which redirects to the foundations article—? TheSeven (talk) 15:01, 31 January 2008 (UTC)

I think "Mathematical Statistics" and "Philosophy of Probability" are more common. You don't see many (separate) phil. of stat. courses; for example, try searching for it at Amazon. The viewpoint Abelson discusses at length isn't his own theory; in my experience it's reasonably common among statisticians--like a formal mathematical proof, an hypothesis test is a form of argumentation (a practical form, a la Peirce, say). I do think that the article must address the fact that an hypothesis test (etc.) is a way of settling disputes as well as a way of finding things out. When the FDA asks for statistical arguments, that's what it wants--an argument that the drug is effective and safe. JJL (talk) 15:14, 31 January 2008 (UTC)

I have not heard the phrase "Philosophy of Probability" before, as far as I can recall. I just tried googling for it, and got 766 results. Compared with 110,000 for "foundations of statistics". TheSeven (talk) 15:27, 31 January 2008 (UTC)

Wait, let's go apples-to-apples! For "Foundations of stats." I think one more commonly sees "Math. stats." as the foundations are in analysis and probability. For "Philosophy of stats." one more commonly sees it as part of a "Philosophy of Probability" course/book than on its own. Here are a few Phil. of Prob. books: [1], [2], [3], [4]. The word chance commonly appears in its place (again, Peirce is an example), and of course it also can be studied in a modern physics context. I can't find a book entitled "Philosophy of Statistics" there; Foundations of stats. does make an appearance [5]. JJL (talk) 15:41, 31 January 2008 (UTC)

This is interesting, especially because several people are claiming that the topic is too insignificant to merit a Wikipedia article of its own. See here—you can vote if you wish. TheSeven (talk) 17:38, 31 January 2008 (UTC)

Would a better name be centred around "statistical inference", rather than just "statistics" ? Melcombe (talk) 15:09, 12 February 2008 (UTC)

The discussion in this section is effectively closed. The right place would now be the discussion in Foundations of statistics. (According to that article though, this is the standard name for the topic. Moreover, it has far more Google hits.) TheSeven (talk) 21:53, 12 February 2008 (UTC)