Talk:Sample (statistics)

From Wikipedia, the free encyclopedia

This article is within the scope of WikiProject Statistics, which collaborates to improve Wikipedia's coverage of statistics. If you would like to participate, please visit the project page.

WikiProject Mathematics
This article is within the scope of WikiProject Mathematics, which collaborates on articles related to mathematics.
Mathematics rating: Start Class High Priority  Field: Probability and statistics
Please update this rating as the article progresses, or if the rating is inaccurate. Please also add comments to suggest improvements to the article.

I think there is enough information here so I can remove the stub tag. There are links to the many other pages that talk about the specifics of samples and sampling. Steve Simon 16:31, 27 September 2006 (UTC)

I've changed the introduction and highlighted some topics to be expanded on (e.g., stratified sample) at a later date. Steve Simon 02:53, 27 September 2006 (UTC)

[edit] Mathematical description

The "Mathematical description" and the "Empirical description" describe different things; this is not just a matter of different ways of looking at the same thing.

If X is a random variable, then the n-tuple (X1,...,Xn), in which the Xi are n i.i.d. clones of X, is itself a (multivariate) random variable. This is what is called "a sample" in the section Mathematical description. A single experimentally observed outcome of this multivariate random variable is what is called "a simple random sample" in the section Empirical description.

I know that the use of the term "sample" to describe outcomes obtained by a sampling process is quite common. Is the "mathematical" meaning in which a sample is a multivariate random variable also common? If so, shouldn't we make it clear that we have two different meanings here (and give citations of sources for the "mathematical" meaning)? If not, the wording of the section on "Mathematical description" should be changed.

(Can someone explain why the first section is called "Empirical description"? It is not as if we observed lots of samples in the field and are now trying to describe, based on our observations, what we saw.)

 --Lambiam 19:19, 13 December 2007 (UTC)

Yes you can sample from a multivariate distribution. I suppose one can interpret a sample of length n (= n iid random variables) as a sample of length one from an n-dimensional variable if convenient for some purpose, but it is not exactly standard. The definition I gave is in virtually any book on mathematical statistics or probability. I have added a citation to a book I happened to have on my desk. The last paragraph in the section is a restatement what a random variable is: a mapping that assigns values to possible outcomes.
That is not what I mean. The sample itself, whose outcomes are vectors of which the length is the sample size, has a (boring) multivariate distribution. Google books won't let me read Wilks' definition, but I trust that is OK, which answers my first question.  --Lambiam 23:40, 13 December 2007 (UTC)
The word empirical means in probability and statistics from observation, see empirical probability. For the concept of sample in the sense of mathematical statistics it is fundamental that in principle one can repeat the sampling experiment and the results would be coming from some probability distribution. So, even if you may not have a lot of samples, you could, again in principle. Jmath666 (talk) 21:27, 13 December 2007 (UTC)
So it is not the description that is empirical, as the section title suggests, but the context of use: this describes the meaning of the term in empirical statistics.  --Lambiam 23:40, 13 December 2007 (UTC)
Actually, I think these are two descriptions of the same thing, one from practitioner's view, one from theoretical view. Otherwise one could not do anything analytical/mathematically meaningful with the pracitioner's samples, such as hypothesis testing, just descriptive statistics. Jmath666 (talk) 01:02, 14 December 2007 (UTC)
I think it is desirable to be able to make a distinction between a random variable, and an experimentally obtained (observed) outcome. The latter, assuming it is in numerical form, one can write on a piece of paper, put it in a table in an article, and so on. One cannot put the random variable itself in the article. Currently, a collection of such observations is also called a sample. If you look, for example, at our Mode (statistics) article, it defines the mode of a sample, where this clearly applies to a collection of observations, and not to a set of i.i.d. r.v.'s. Is it abuse of the term "sample" when authors write: "we obtained a sample of 183 specimens from ..."? Is there a better term for "a data set that has been collected by a sampling process"?  --Lambiam 12:50, 14 December 2007 (UTC)
Perhaps this second part should be moved to Random sample which looks pretty sad now. Also the Sample (probability) should link there. Then there would be no need for the "empirical description" heading. Jmath666 (talk) 22:15, 13 December 2007 (UTC)
I support that, and, moreover, I think "simple random sample" should be merged into and then redirect to "random sample"; the different meanings of "random sample" should be cleared up better (I think a "simple random sample" is also a "random sample" in the empirical sense, even though the text suggests it is not). Is there a counterpart of the mathematical treatment for non-simple sampling methods (e.g. with no replacement)?  --Lambiam 23:40, 13 December 2007 (UTC)
I never heard "simple random sample" before. It may be used incorrectly here or something someone in some source made up.
To the second question, if you sample from a population without replacement, what you get are not values of independent random variables (since something was not replaced, the distributions are not identical and they are not independent) so it is not a random sample. But if you n sequences length m each starting from the beginning that would be random sample size n out of multivariate distribution dimension m... Oh that confusion when people use same word for different things and it gets perpetuated in undergraduate textbooks... Jmath666 (talk) 01:02, 14 December 2007 (UTC)
Googling ["simple random sample" OR "simple random sampling"] gives more than a few hits; the terminology is sufficiently widespread that we cannot ignore it. See e.g. here. This page discusses many forms of "random sampling".  --Lambiam 12:50, 14 December 2007 (UTC)
Indeed simple random sample is something else than random sample (which is defined as iid); not a special case of it as the name might suggest... The article simple random sample also says so, correctly. These are not iid but only approximately so. I am sure statistics textbooks would have plenty of the kind of info about such things. One authoritative source is Kendall's advanced theory of statistics. I'll look at Wilks when I get to that office again. This whole sample related collection of articles would benefit from some coordination by someone who know what he is doing (which in statistics would not be me, at least not yet). Jmath666 (talk) 19:50, 14 December 2007 (UTC)