Talk:Sufficiency (statistics)

From Wikipedia, the free encyclopedia

This article is within the scope of WikiProject Statistics, which collaborates to improve Wikipedia's coverage of statistics. If you would like to participate, please visit the project page.

Mathematics Portal

This article is within the scope of WikiProject Mathematics, which collaborates on articles related to mathematics.

Mathematics rating:

B Class

Mid Priority

Field: Probability and statistics

Please update this rating as the article progresses, or if the rating is inaccurate. Please also add comments to suggest improvements to the article.

1 General question
2 Examples
3 Confusing notation
4 Elements of Information Theory comparison
5 New proofs of factorization theorem: pointlessly heavy handed?
6 example
- 6.1 Fisher's definition
  - 6.1.1 issue 1, known/unknown variance
  - 6.1.2 issue 2, the definition
7 Misprints in proof for the continuous case

[edit] General question

hello every body

i want to discuss sufficent statistics here... any body has any good information and know the usage of sufficnet stats.. may discuss here... thanx

Do you have specific questions? Michael Hardy 19:53, 7 May 2005 (UTC)

[edit] Examples

The two examples have h(x)=1. It might be better if at least one of them did not. --Henrygb 17:43, 23 May 2006 (UTC)

I agree. An i.i.d. sample from a Poisson distribution would do it. I'll be back.... Michael Hardy 23:40, 23 May 2006 (UTC)

The first sentence in the article is hard to understand

[edit] Confusing notation

The notation of the conditional probabilities in section Mathematical definition is confusing or confused. You'd expect that Pr(A|B,C) = Pr(A|C,B). So why is it the case that Pr(x|t,θ) = Pr(x|t), but not Pr(x|t,θ) = Pr(x|θ)? The non-standard notation Pr(X=x|T(X)=t,θ) is not explained. Is θ the parameter itself (a variable), or is this the value of the parameter? The example does not help. It suggests that the parameter is p. But, clearly, the the joint probability distribution (given as a density) depends on p, so if we take the "precise definition" literally, this is not a sufficient statistic. The gain in using the shorthand notation is completely nullified or worse by the lack of explanation. --Lambiam ^Talk 21:47, 13 October 2006 (UTC)

I've reverted the cut material, because IMO the derivation that was cut makes it much clearer where the factorisation criterion comes from.

While some may frown on it, it's not exactly unusual for lower case letters to be used for random variables, with starred variables (eg θ*) being used to indicate a variable that is being held to some 'special' value. This is a not uncommon notation, and usefully succinct and readable; IMO it's no bad thing for WP readers to come across it from time to time. Jheald 13:43, 16 October 2006 (UTC)

Strictly speaking I suppose the use of the lower case letters implies that the (lower case) values of the (upper case) random variables are themselves being treated as variables; in practice I suspect the prevalance of lower case (and it is prevalent) may be as much because the lower case forms are easier on the eye and less out of the ordinary, and therefore make the equations quicker to read and easier to assimilate (not negligible advantages).

But it seems to me the real point for Lambiam is elsewhere. Pr(x|t,θ) is indeed fully equivalent to Pr(x|θ,t), as part of the convention of the notation. However there is no such general requirement that Pr(x|t,θ) = Pr(x|t). This equation is true, for all values of x, t and θ, if and only if T is a sufficient statistic for θ. But "T is a sufficient statistic for θ" is different to saying "θ is a sufficient statistic for T". Similarly, Pr(x|t,θ) = Pr(x|t) is very different to requiring Pr(x|θ,t) = Pr(x|θ).

The difference is perhaps clearest in the Fisher decomposition, P(x|t,θ) = P(x|t).P(t|θ), which holds if T is a sufficient statistic. This tells us that then t contains everything that x has to tell us about θ. But θ does not necessarily contain everything x has to tell us about t. Jheald 19:35, 16 October 2006 (UTC)

The proof was wrong and mixed discrete and continuous variables. I replaced it with a correct continuous case. Should we also include the correct discrete case? [1] Tony 23:40, 5 April 2007 (UTC)

[edit] Elements of Information Theory comparison

Hello wiki people.

Hopefully, you agree that this page needs some work. I didn't understand it at all and referred to a book instead. What I found was the following information theoretical definition:

The following is a quote directly from copyrighted material: Elements of Information Theory: Thomas M. Cover, Joy A. Thomas:

(begin quote) This section is a sidelight showing the power of the data processing inequality in clarifying an important idea in statistics. Suppose we have a family of probability mass functions ${f_{\theta}(x)}$ idexed by $\theta$, and let X be a sample from this family. Let T(X) be any statistic (function of the sample) like the mean or the sample variance. Then $\theta\rightarrow X\rightarrow T(X)$ (the notation for a Markov chain) and by the data processing inequality, we have:

I(\theta; T(X)) \le I(\theta; X) for any distribution on \theta, However, if equality holds, no information is lost. A statistic is called sufficient for $\theta$ if it contains all the information in X about $\theta$

Definition: A function T(X) is a sufficient statistic relative to the family ${f_{\theta)}$ if X is independent of $\theta$ given $T(X)$. i.e. $\theta\rightarrow T(X) \rightarrow X$ forms a Markov chain.

This is the same as the condition for equality in the data processing inequality:

I(\theta, X) = I(\theta, T(X)) (end quote)

I am not a statistician so I'm reluctant to change anything. Please tell me what you think, and just kick my ass if I'm not supposed to put copyrighted material on the talk page.

Jelmer Wind User:130.89.67.57

The majority of the article is just a technical proof that a certain method for identifying a sufficient statistic actually works. The merit of the proof's inclusion and quality of its exposition can be debated; the result of the proof is a very fundamental and important concept for the article.

However, the material you quoted is in some sense more complicated than what is already in the article, e.g., the article does not mention Markov chains which is a nontrivial concept. The quoted def, up to the Markov chain reference, is essentially identical to the one here. The sentence immediately preceding Thomas's def is essentially the same as the second paragraph in the head. Both expositions are slightly longer here, but that is almost solely due to much less reliance on technical notation. The new information from Thomas would be the formulations as a Markov chain and as a data processing inequality. But their inclusion would also require additional background or wikilinks, so it it not clear to me it would make the article more readable.

About the copyright: I am no expert on such, but I do not think the quote here in Talk for purposes of discussion is problematic. However, incorporation of a quote that long, even if sourced, into the main article probably would be. But my $0.02¢; others may know better than I. Baccyak4H (Yak!) 15:01, 18 June 2007 (UTC)

[edit] New proofs of factorization theorem: pointlessly heavy handed?

Is it just me, or are the new proofs on this page of the factorization theorem incredibly heavy handed? Am I missing something? They seem to *bury* what is a really rather simple result, rather than bring it out more clearly.

What this page used to do (eg 22 Dec 2006) was to define as sufficiency the property that

$\Pr(X=x|T(X)=t,\theta) = \Pr(X=x|T(X)=t), \,$

or in shorthand

$\Pr(x|t,\theta) = \Pr(x|t), \,$

$\begin{matrix} \Pr(x|\theta) = \Pr(x,t|\theta) & = & \Pr(x|t,\theta) \cdot \Pr(t|\theta) \\ \\ & = & \Pr(x|t) \cdot \Pr(t|\theta) \end{matrix}$

Since t=T(x), it is clear that

Pr(x|t) can be written as h(x), a function independent of θ; and that
Pr(t|θ) can be written as g_θ(T(x)), a function which depends on x only through t.

Isn't that a much more direct, much more straightforward, much more easily assimilated way to present this result? Jheald 17:09, 20 June 2007 (UTC)

[edit] example

User:Baccyak4H just edited the example I just added to be

As an example, the sample mean is sufficient for the mean μ of a Normal distribution with known variance. If one thus knows the sample mean in such a case, the distribution of the sample will not depend on the underlying mean of the original distribution.

Two points, (1) the sample mean is sufficient regardless of the variance because the two estimators are orthogonal, while the normal is a special case, I think it's worth mentioning (2) it's not the distribution of the sample that doesn't depend on the the underlying mean, it's that there is no more information to capture. The data doesn't have a distribution, it just has values--what could it possibly mean for the data to have a distribution? (3) I guess my point in two suggests that the first paragraph of the entry should be updated too.Pdbailey 03:01, 10 September 2007 (UTC)

You seem confused. The concepts have definitions. No one ever said the distribution of the sample does not depend on the population mean. That would be incorrect. Rather, the CONDITIONAL distribution of the data GIVEN the sample mean, does not depend on the population mean. Now you ask: what could it mean to say the data have a distribution? Perhaps another word besides "data" would be better, but the definition is this: T(X) is sufficient for a family of possible distributions of X precisely if the conditional probability distribution of X given T(X) does not depend on which distribution is the right one.

All this is perfectly standard stuff---look in any textbook for a leisurely discussion. Michael Hardy 03:11, 10 September 2007 (UTC)

I concur with MH, with these comments: You could be right about the variance, I know something requires it known but my notes aren't here now and it's been awhile. :\ As for point 2, your counter statement is correct, but what I tried to say was a layman's translation of what the definition says. And the data does have a (conditional on sample mean) distribution. The definition is not conditioning on the sample, which is just the values, the case you describe, but rather just the sample mean. Different samples could yield the same sample mean. The definition in this case says: given this sample mean, the sampling distribution of this sample (yechh, but this may be the source of the confusion) does not depend on μ. It is true the interpretation and use is based on this issue of capturing all of the information, but that is really a consequence of the definition, not the def itself. I hope this helps. Baccyak4H (Yak!) 03:20, 10 September 2007 (UTC)

Michael Hardy and Baccyak4H, the sample/data does not have a distribution, it is drawn form one. Thus the phrase, "the data's conditional probability distribution," is difficult to understand, and I claim just nonsense. I propose the following two paragraphs

In statistics, a statistic is sufficient for the parameter θ, which indexes the distribution family of the system of interest, when the distribution of any other statistic--conditional on the sufficient statistic--is independent of θ.

This seems a good generalization of Fishers quote, "For a given value of $σ 2$ , the distribution of $σ 1$ is independent of $σ$ " as quoted in the cited Stigler article, with Fishers emphasis. Where $σ 2$ is a sufficient statistic for $σ$ . The idea then is that, for all samples that have the same value for the sufficient statistic, then distribution of any other statistic will not depend on θ. For the second, I propose

As an example, the sample mean is sufficient for the mean μ of a Normal distribution. If one knows the sample mean in such a case, knowledge of anything else from the data--including a full list of all collected data--will not give more information about the mean. Also, in the set samples that have the same mean, any other statistic, or summary of the data will not depend on μ.

This emphasizes Fisher's other quote, "The whole of the information respecting σ, which a sample provides, is summed up in the value of $σ 2$ ." Again, Stigler quoting Fisher.Pdbailey 04:36, 10 September 2007 (UTC)

I'll note that there is more to the Fisher quotes, and what the various σs are, but I think it suffices to say what I said. Pdbailey 04:37, 10 September 2007 (UTC)

If it is meaningless to speak of the distribution of the data, then it is likewise meaningless to speak of the distribution of any statistic. But on this one you're just belaboring a semantic point. There is an obvious interpretation of the word "data" according to which you are right, and another obvious one according to which you are just as clearly wrong. As to the sample mean being sufficient for the population mean: that is certainly right if the variance is known and the whole family of distributions is indexed only by the mean. But obviously your inference about μ given the sample mean would be quite different if the variance is small, from what it would be if the variance is big. Michael Hardy 04:46, 10 September 2007 (UTC)

Michael Hardy ~~(1) I don't know of the second obvious definition of which you speak, can you please give it to me?~~ (2) the distribution of a statistic suggests resampling, taking another draw, the missing part of the in situ definition. Maybe you're trying to suggest there is a definition of data that suggests resampling? If so, I'd be fine with any wording that does this instead of my above proposal. Here's another definition that uses the term data

In statistics, a statistic is sufficient for the parameter θ, which indexes the distribution family of the system of interest, when the distribution of any data with the same value of the sufficient statistic is independent of θ.

but I'd prefer to use the term "sample" (3) the sample mean being independent for the population mean in the case of the normal is a theorem (on the first page!) of this paper. I guess I don't understand why one of Fisher's definition is privileged and the other is not as good. I think the article is best when it includes both clearly, similarly for the example. Pdbailey 05:30, 10 September 2007 (UTC)

I had another crack at the wording of the definition example. Is this any clearer?

For the lead sentence, which I think reads OK now, your second proposal is very similar. Between the two proposals, I prefer something in spirit to the second, as i) the definition refers to the (conditional) distribution of the data; ii) the statement about the (conditional) distribution of any statistic simply follows logically from the statement about the data. I would prefer to not change to this proposal, as the language "distribution of any data" seems somewhat awkward (even though I know what it is supoosed to mean). But certainly let's discuss the wording.

For the example commentary, your proposal is true but it is not an example of the definition, rather it follows from it, it is a consequence of it. Something maybe I should state explicitly here: this commentary is in the definition section, so I think it should reflect the direct definition. Certainly what you wrote is true and sourcable, but I would argue it should go elsewhere in the article than as an example in the section on the definition. For the defintion, I would strongly plead for the example to be of the form "data, given sufficient statistic, is independent of parameter", because that is what the definition states in a general sense. Baccyak4H (Yak!) 14:31, 10 September 2007 (UTC)

Baccyak4H, the point that I'm trying to make is that the concept of sampling/resampling should be in the definition to capture what I think you and Michael Hardy understand the concept of, "the data" to be. As an applied statistician, I think of data as being something in a file on my hard drive--and I think this is a reasonable view of the concept when it is not further qualified. As such, it's silly to talk about it's distribution. Plus, the concept of a sample is already out there, why not use it? I don't see that the current definition takes the disclarity that confused me into account. Furthermore, I think that the fact that Fisher not only talked about, but emphasized there being no additional information in the sample in the paper in which he firsts presents sufficiency suggests that it is a clear follow on to the definition. The idea of writing an article is to explain the concepts clearly, to communicate them to another who might not already know them. Pdbailey 14:57, 10 September 2007 (UTC)

You have reasonable concerns. However, the definition is what it is. Fisher talked about the properties of sufficiency, and many theoretical results use them as well, but the article should not include that as part of the definition, or qualifying an example of the definition. But certainly elsewhere, so long as there is no mistaking that this property is not the definition. (Aside: did Fisher *define* sufficiency in terms of no additional information in the sample? I ask because all modern treatments I've seen do not do that, rather treating that property as sort of a theorem or corollary derived from the definition, or an inspiration for the whole concept (e.g., Casella & Berger). But I don't have your source, and if he does, well then I stand corrected.)

With respect to sampling, note that once you start talking about a distribution, you can always (in principle at least) think of sampling from the distribution. I would prefer to keep the first line of description in terms of distributions, but after that, further elaborating with a sampling argument might be fine. I think this approach makes for good articles: strict general description ("distribution" treatment), then elaboration ("sampling" treatment). Baccyak4H (Yak!) 15:23, 10 September 2007 (UTC)

[edit] Fisher's definition

Fishers defintion form the 1922 paper:

Sufficiency.---A statistic satisfies the criterion of sufficiency when no other statistic which can be calculated from the same sample provides any additional information as to the value of the parameter to be estimated.

Pdbailey 03:16, 11 September 2007 (UTC)

Per above, I stand corrected. I would suggest it not replace but rather augment the conditioning def.

Good work. Baccyak4H (Yak!) 03:43, 11 September 2007 (UTC)

OK. I was bold. I tried to incorporate the new material, while keeping the mathematical strict part too. The actual Fisher citation needs to be added where noted. The readability could be improved, but I think it is almost there. Baccyak4H (Yak!) 02:27, 12 September 2007 (UTC)

Baccyak4H, none of my underlying concerns have been addressed. They are (1) why have the variance be known? (2a) data is a confusing and unnecessary way of talking about a sample. (2b) the "intuitive" definition is equivalent and should be featured as well as the most general definition and Let me break out these two Pdbailey 15:20, 12 September 2007 (UTC)

[edit] issue 1, known/unknown variance

(1) It's fine to leave in the known variance, but I think it bears noting that the normal is special in either way it ends up being written. Alternately, we could use a separate distribution. Pdbailey 15:20, 12 September 2007 (UTC)

To be honest, I still vaguely recall that the variance needs to be known. But in particular, I have a ref for the known case: Casella and Berger.

I should add that when the variance is unknown, the sample mean and variance are sufficient for the population mean/variance pair. But intuitively, if you do not know the variance, having the sample mean does not seem like all the information about the pop mean. A larger variance seems to imply there would be a larger chance the pop mean is further away from the sample mean than in the case where it were smaller.

I would thus suggest keeping the known case since it can be sourced, and if a source comes up for the unknown case, let's revisit it. Baccyak4H (Yak!) 16:05, 12 September 2007 (UTC)

So you don't like the proof in this paper as linked above? Also, you may recall that s2 and x-bar are independent. Pdbailey 19:34, 12 September 2007 (UTC)

I'll look up the rest of the paper over the next couple of days, if necessary, but the second paragraph there I think suggests the paper supports my assertion above (2 stats for 2 params) , not the unknown var case. The third sentence implies the first two sample moments are sufficient for the first two population moments. The assertion we are questioning would be "xbar is sufficient for the location parameter in a location/scale family" which is not the standard interpretation if the family description would be omitted; "a location parameter", without any other qualification, suggests a location family. But I'd have to see the rest to be sure either way. Baccyak4H (Yak!) 02:38, 13 September 2007 (UTC)

I concur that the theorem in the paper is not the one I had hoped for. My intuition is as follows, the sample mean ans s2 are sufficient for all parameters, and are independent. As such, either alone should be sufficient for the parameter it estimates. I'll look at the Koopman ref. in the above linked paper, it looks like it might say. Pdbailey 13:07, 15 September 2007 (UTC)

[edit] issue 2, the definition

(2) Let's just use Fisher's definition in quotes and then state it in the most general way of stating it (which regards the conditional distribution of samples). When we do this, why not use the word, "sample" in preference to "data"? Pdbailey 15:20, 12 September 2007 (UTC)

I have no large preference for either "sample" or "data", although one or the other may be slightly better in a particular context. I am not exactly sure how you are proposing to use Fisher's material, but go ahead and add it in as you see fit, or perhaps even propose it here first. And use "sample" instead of "data" if you like; I want your concerns to be addressed before we start fine-tuning stuff like this. Baccyak4H (Yak!) 16:05, 12 September 2007 (UTC)

I'll propose here:

start proposal -----

According to R. A. Fisher, 'A statistic satisfies the criterion of sufficiency when no other statistic which can be calculated from the same sample provides any additional information as to the value of the parameter to be estimated.' This is equivalent to the more contemporary definition that the distribution of a sample is independent of the underlying parameter the statistic is sufficient for, conditional on the value of the sufficient statistic. Intuitively, a sufficient statistic for a parameter, θ, captures all the possible information about θ that is in a particular sample. Both the statistic and θ can be vectors.

Stigler notes that the concept has fallen out of favor in descriptive statistics because of the strong dependence on a assumption of the distributional form, but remains very important in theoretical work.

-- mathematical definition --

The concept is made rigorous and general as follows: a statistic T(X) is sufficient for θ precisely if the conditional probability distribution of the data X, given the statistic T(X), is independent of the parameter θ,^[1] i.e.

$\Pr(X=x|T(X)=t,\theta) = \Pr(X=x|T(X)=t), \,$

or in shorthand

$\Pr(x|t,\theta) = \Pr(x|t).\,$

--- example --- As an example, the sample mean is sufficient for the mean μ of a normal distribution with unknown variance. Once the sample mean is known, no further information about μ can be obtained from the sample itself.

end proposal -----

so --- would be replaced by three equals, and the same for --. Also, the references and wikilinks would need to be added back in. Pdbailey 19:47, 12 September 2007 (UTC)

A good stab. Three points: let's stick to the known variance case as we have consensus there; if this is to be the top of the article, let's fully spell out Fisher's name; I prefer not to explicitly mention Stigler's name in this way. Some wording tweaks here and there may suggest themselves after it goes in, but for organization it will work. Baccyak4H (Yak!) 02:44, 13 September 2007 (UTC)

[edit] Misprints in proof for the continuous case

Dear colleagues: In the section, ``Proof for the continuous case" , all of the capital X's below the following passage should be replaced by lowercase x's : ``Due to Hogg and Craig .... if and only if, for some function H".

In particular, the subsequent equation on the page should read `` \prod_{i=1}^{n} f(x_i; \theta) = g \left[u(x_1, x_2, \dots, x_n); \theta \right] H(x_1, x_2, \dots, x_n). \,\! "

Yours, Marvin Ortel (2007December6) (Article on Sufficient Statistics)