Talk:P-value

From Wikipedia, the free encyclopedia

This article is within the scope of WikiProject Statistics, which collaborates to improve Wikipedia's coverage of statistics. If you would like to participate, please visit the project page.

Mathematics Portal

This article is within the scope of WikiProject Mathematics, which collaborates on articles related to mathematics.

Mathematics rating:

Start Class

Mid Priority

Field: Probability and statistics

One of the 500 most frequently viewed mathematics articles.

Please update this rating as the article progresses, or if the rating is inaccurate. Click to show/hide comments.
Please add to or update the comments to suggest improvements to the article.
Geometry guy 21:29, 9 July 2007 (UTC)

Does neone actually know how to figure out P or is it all made up?

Certainly all the textbooks explain how to calculate it in various settings. This article, as now written, implies the answer but is not very explicit. Certainly more could be added to it. Michael Hardy 20:23, 5 May 2005 (UTC)

ahhh i see thank you i still havent managed to find out how to work out the p value for a correlational study using pearsons parametric test......guess i must be looking in the wrong text books!

I have difficluties to understand this article as a lay. May be an example would be good ...

Frequent misunderstandings, part b in commnet: there is a numerical mistake. %29 should be %5.

I was adding a numerical example to the p-value article, as requested above, but it's all been deleted. I've no idea why. --Robma 00:17, 11 December 2005 (UTC)

Michael Hardy who modified previous statement " If the p-value is 0.1, you have a 10% chance of being wrong if you reject the null hypothesis " should explain this.

That statement is clearly not true. Michael Hardy 20:15, 28 August 2006 (UTC)

1 Transferred comments of User:Xiaowei JIANG
2 Shouldn't the P value in the coin example be .115?
3 Show some calculations
4 Needs improvement
5 Interpretation
6 Question
7 p-values in hypothesis testing
8 Contradiction?
9 Frequent misunderstandings
10 Why is the p-value not the probability of falsely rejecting the null hypothesis?
11 Probability of 14/20 on a fair coin = ?
12 significance

[edit] Transferred comments of User:Xiaowei JIANG

If the p-value is 0.1, you have a 10% chance to reject the null hypothesis if it is true.("Current statatment is confusing", Michael Hardy who modified previous statement " If the p-value is 0.1, you have a 10% chance of being wrong if you reject the null hypothesis " should explain this.). Note that, in the Bayesian context, the P-value has a quite different meaning than in the frequentist context!

Um, I can't make any sense of this page. Can we have a rewrite? -- Arthur ~I agree that this article isn't as clear as it could be - or needs to be (and, as a contributor to it, I take some responsibility for that). The intro, at the very least, needs redoing. Now back to the day-job....Robma 12:27, 5 June 2006 (UTC)

I agree that we should rewrite this page, which may include how to calculate different P-values in various conditions. A good start might come from the calculating of the p-values from randomized experiments.--Xiaowei JIANG 00:19, 20 October 2006 (UTC)

[edit] Shouldn't the P value in the coin example be .115?

Since the null hypothesis is that the coin is not biased at all, shouldn't the universe of events that are as or less favorable to the hypothesis include too many heads AND too many tails? For example, the coin coming up all tails would be more unfavorable to the "fair coin" hypothesis than 14 heads. If the null hyposthesis was "the coin is not biased towards heads (but it may be towards tails)", .058 would be correct.

But as given, the null hypothesis is not that the coin is fair, but rather that the coin is not unfairly biased toward "heads". If the coin is unfairly biased toward "tails", then the null hypothesis is true. Michael Hardy 20:13, 28 August 2006 (UTC)

[edit] Show some calculations

It would be nice to see some equations or calculations made in the article, that way people are not left standing in the dark wondering where the numbers came from. I know I could follow the article and the example because I have a little experience with probability and statistics. If someone would like, I can post the equations and some other information I feel would be useful.

Yes, this definitely needs some f-ing formulas. I know what p-values are, but came here in the hopes of getting a formula I could use to calculate them. --Belg4mit 02:44, 31 January 2007 (UTC)

[edit] Needs improvement

I feel the explanations offered here are too brief to be of use to anyone who doesn't already know what p-values are. What exactly is 'impressive' supposed to mean in this context to a lay person? General improvement in the clarity of language used, number of examples, and adding calculations would benefit this page.

Agreed -- this suffers from the same disease as most Wiki math-topics pages, i.e. they take a reasonably straightforward concept and immediately bury it in jargon. I'll take a stab at wikifying some of the introduction to this and see if I can include some parenthetical translations for people who are looking to learn something they didn't already know. JSoules (talk) 19:12, 27 March 2008 (UTC) -- ETA: Would " -- that is, the chance that a given data result would occur even if the hypothesis in question were false. Thus, it can be considered a measure of the predictive power or relevance of the theory under consideration (lower p-values indicating greater relevance), by indicating the likelihood that a result is chance" be a fair addition after the initial sentence? Am I understanding this concept correctly? This is how I've seen it used rhetorically, but that was from a semi-trustworthy source only...

[edit] Interpretation

A recent editor added:

The p-value shows the probability that differences between two parameters are by chance deviation.

This is directly (and correctly) contradicted by Point #1 in the next section. It is always the case that the difference between the two parameters is due to chance, thus the probability of this is 1! Indeed, data-dependent p-values do not have a ready interpretation in terms of probabilities, as Berger and Delampady, and Berger and Sellke, have pointed out. From a conditional frequentist point of view, p-values are not the Type I error probabilities for those experiments that are rejected at the observed p-value, for example. Bill Jefferys 21:41, 18 April 2007 (UTC)

[edit] Question

Shouldn't this: Generally, the smaller the p-value, the more people there are who would be willing to say that the results came from a biased coin.

instead read: Generally, the smaller the p-value, the less people would be willing to say that the results came from a biased coin. --68.196.242.64 11 June 2007

I thought so too, seeing that was what drove me to this talk page. Anyone care to disagree? (I added bold emphasis to your sentences... as well as signing on your behalf...) --Hugovdm 17:56, 14 July 2007 (UTC)

A small p-value represents a low probability that the result is hapening by chance. An unbiased coin is supposed to work by chance alone! So if we have a low probability that we are getting these results by chance, we *might want to consider* that we are getting them by using a biased coin. Conversly, a high p-value *suggests* that we have an unbiased coin that is giving us our reults based on chance. It is important to remember that the p-value is just another piece of information designed to help you make up your mind about what just happened, but doesn't by itself confirm or deny the null hypothesis. You could be having a crazy day with a fair coin, but then again, you could be having a normal day with a biased coin. The p-value is just a tool to help you decide which is more likely: is it the universe messing with you or is it the coin?

All that being said, I think that this is a terrible example to use to introduce someone to the concept of a p-value! 206.47.252.66 15:03, 4 August 2007 (UTC)

[edit] p-values in hypothesis testing

I've provied a few pages illustrating an approach to using p-values in statistical testing that I have used in the past. I'm no wizard at providing comments in these talk pages, but the file has been uploaded. File name is P-value_discussion.pdf. Hope this helps.

[edit] Contradiction?

The coin flipping example computes the probabilty that 20 coin flips of a fair coin would result in 14 heads. Then the interpretation section treats that probability as a p-value. This seems to say that the p-value is the probability of the null hypothesis. But the first "frequent misunderstanding" says "The p-value is not the probability that the null hypothesis is true". Huh? That sounds like a contradiction to me.

Yes, the example calculates the probabilty that a *theoretically fair coin* would give the results that we got (14 heads in 20 flips). The example does not calculate the probability of *our coin* producing these results. (Not possible, since we don't know if the coin is fair or unfair, and if it is unfair, exactly how unfair, how often etc). There are two possibilities: the coin is fair or unfair. The p-value only gives you information about one of these possibilities (the fair coin). Knowing how a fair coin would behave helps you to make a guess about your coin. So although the p-value can describe a theoretical coin, it can not directly describe the coin in your hand. It is up to you to make the comparison.

It seems like a semantic distinction but there is a fundamental difference. Imagine that an evil-overlord-type secretly replaced some of the worlds fair coins with unfair ones. What would happen to the probabilty of a theoretically fair coin producing our results? Nothing would change - the probability of a fair coin giving 14/20 heads would remain exactly the same! But what would happen to the probability that our results were due to chance? They would change, wouldn't they? So once again, the p-value only gives us the probability of getting our results 'IF' the null hypothesis were true. That 'IF' can get pretty big, pretty fast. Without knowing exaclty how many coins the evil overlord replaced, you are really left guessing. The p-value can not tell us the probability that the null hypothesis is true - only the evil overlord knows for sure! Hope this helps. 206.47.252.66 15:52, 4 August 2007 (UTC)

[edit] Frequent misunderstandings

I think this section is excellent. However, I would delete or substantially modify the 6th misunderstanding since it assumes the Neyman-Pearson approach which is far from universally adopted today. Fisher's interpretation of the p value as one indicator of the evidence against the null hypothesis rather than as an intermediate step in a binary decision process is more widely accepted.

I have seen a great deal of criticism of p-values in the statistical literature. I think this section is very crucial to the content of this page but I think it is only a starting point--the article needs to discuss criticisms that go beyond just potential for misinterpretation. Any basic text on Bayesian statistics talks about these issues (the J.O. Berger text on decision theory & bayesian analysis comes to mind--it has a very rich discussion of these issues). I'll add some stuff when I get time but I would also really appreciate if other people could explore this too! Cazort 19:01, 3 December 2007 (UTC)

[edit] Why is the p-value not the probability of falsely rejecting the null hypothesis?

The p-value is not the probability of falsely rejecting the null hypothesis. Why? "the p-value is the probability of obtaining a result at least as extreme as a given data point, under the null hypothesis" If a reject the null hypothesis when p is say, lower than 5%, and repeat the test for many true null hypotheses, than I will reject true null hypotheses in approx. 5% of the tests. --NeoUrfahraner (talk) 05:57, 14 March 2008 (UTC)

Maybe you're mistaking p-values for significance levels. 5% is the significance level, not the p-value. If you get a p-value of 30% and your significance level is 5%, then you don't reject the null hypothesis. 5% would then be the probability of rejecting the null hypothesis, given that the null hypothesis is true, and 30% would be the p-value. Michael Hardy (talk) 19:35, 27 March 2008 (UTC)

To flesh this out a bit, in an article in Statistical Science some years ago, Berger and Delampady give the following example: Suppose you have two experiments each with a point null hypothesis, such that the null hypothesis is true under one of them (call it A) and false under the other (call it B). Suppose you select one of the experiments by the flip of a fair coin, and perform the experiment. Suppose the p-value that results is 0.05. Then, the probability that you actually selected experiment A is not 0.05, as you might think, but is actually no less than 0.3. Above, I pointed out that "From a conditional frequentist point of view, p-values are not the Type I error probabilities for those experiments that are rejected at the observed p-value, for example." This is what I was talking about. Bill Jefferys (talk) 20:55, 27 March 2008 (UTC)

I forgot to mention that Berger has a web page with information on understanding p-values; note in particular the applet available on this page that implements in software the example above. You can plug in various situations and determine the type I error rate. Bill Jefferys (talk) 21:48, 27 March 2008 (UTC)

Of course the situation is muche more complicated in the case of multiple tests. Suppose that there is one fixed statistical tests. Is the statement The p-value is not the probability of falsely rejecting the null hypothesis. valid in the case of one single fixed statistical tests and many objects to tested whether they satisfy the null hypothesis? --NeoUrfahraner (talk) 05:27, 28 March 2008 (UTC)

You're asking about the false discovery rate problem, which is a separate issue. But it remains true, when testing point null hypotheses, that the observed p-value is not the type I error rate for the test that was just conducted. The type I error rate, the probability of rejecting a true null hypothesis, can only be defined in the context of a predetermined significance level that is chosen before the data are observed. This is as true when considering multiple hypotheses (adjusted by some method for the false discovery rate problem) as it is in the case of a single hypothesis. Bill Jefferys (talk) 14:27, 28 March 2008 (UTC)

I still do not understand. Suppose that there is one fixed statistical tests giving me a p-value when testing whether a coin is fair. Let's say I decide that I say the coin is not fair when the p-value is less than q. Now I test may coins. What percentage of the fair coins will I reject in the long run? --NeoUrfahraner (talk) 17:18, 28 March 2008 (UTC)

Ah, that's a different question. Here, you have fixed in advance a value q, such that if the p-value is less than q you will reject. This is a standard significance test, and the probability that you will reject a true point null hypothesis in this scenario is q. Note, that you will reject when the p-value is any value that is less than q. So the type I error rate is q in this example.

But that's a different question from observing a p-value, and then saying that the type I error rate is the value of the observed p-value. That is wrong. Type I error rates are defined if the rejection level is chosen in advance. As the Berger-Delampady paper shows, if you look only at p-values that are very close to a particular value (say 0.05), then the conditional type I error rate for those experiments is bounded from below by 0.3, and can be much larger, in the case that (1) you are testing point null hypotheses and (2) the experiments are an equal mixture of true and false null hypotheses.

The Berger-Delampady paper is Testing Precise Hypotheses, by James O. Berger and Mohan Delampady, Statistical Science 2, pp. 317-335 (1987). If your institution subscribes to jstor.org, it may be accessed here. Even if you are not at a subscribing institution, you can at least read the abstract. Note that in common with many statistics journals, this paper is followed by a discussion by several distinguished statisticians and a response by the authors. Bill Jefferys (talk) 18:29, 28 March 2008 (UTC)

I still do not understand. I did not say I reject if it is close to some value. I reject if the p-value is smaller than some fixed value q. You said the type I error rate is q in this example. The text, however, says that the p-value is not the probability of falsely rejecting the null hypothesis. What is correct in that specific example? --NeoUrfahraner (talk) 16:55, 29 March 2008 (UTC)

The example is correct because of fact that q is chosen in advance of looking at the data. It is not a function of the data (as the p-value is). It is not correct to take the data, compute the p-value, and declare that to be the probability of falsely rejecting the null hypothesis, because (as the Berger-Delampady example shows) it is not the probability that you falsely rejected the particular null hypothesis that you were testing. Please read the Berger-Delampady paper, it's all clearly explained there. Go to Berger's p-value website and plug numbers into the applet. Maybe that will help you understand. Bill Jefferys (talk) 18:35, 29 March 2008 (UTC)

Let me try an intuitive approach. Obviously, if you specify q=0.05, then the probability that the p-value will be less than or equal to q is 0.05; that's the definition of the p-value. But, the (true) nulls that are being rejected will have p-values ranging between 0 and 0.05. The rejection is an average over that entire range. Those p-values that are smaller would be more likely to be rejected (regardless of what value of q you choose) and therefore in our judgment would be more likely to be false nulls. Conversely, those p-values that are larger are less likely to be false nulls than the average over the entire [0,0.05] range. The closer the p-value is to 0.05, the more likely it is that we are rejecting a true null hypothesis. Since the rejection criterion is averaged over the entire range, it follows that those values close to or equal to 0.05 are more likely to be rejections of a true null hypothesis than are those for more extreme p-values. But since the average is 0.05, those closer to the upper end are more likely to be false rejections than the average.

The mistake therefore is in identifying the particular p-value you observe as the probability of falsely rejecting a true null. The average over the entire interval from 0 to the observed p-value would of course be equal to the p-value you observed. But once you observe that p-value, you are no longer averaging over the entire interval, you are now fixed on a p-value that is exactly at the upper end of the interval, and is therefore more likely to be a false rejection than the average over the interval. Bill Jefferys (talk) 23:17, 29 March 2008 (UTC)

To understand the Wikipedia article, one has first to read the Berger-Delampady paper? --NeoUrfahraner (talk) 18:06, 30 March 2008 (UTC)

No, but you do need to pay attention. I think I've said everything that needs to be said. It's all in this section, and if you still don't understand, I can't help you further. 18:56, 30 March 2008 (UTC) —Preceding unsigned comment added by Billjefferys (talk • contribs)

"Use of the applet demonstrates results such as: if, in this long series of tests, half of the null hypotheses are initially true, then, among the subset of tests for which the p-value is near 0.05, at least 22% (and typically over 50%) of the corresponding null hypotheses will be true." ( http://www.stat.duke.edu/~berger/papers/02-01.ps ).

Actually this means "The p-value is not the probability of falsely rejecting the null hypothesis under the condition that a hyppothesis has been rejected." Here I agree, this is indeed a version of the prosecutor's fallacy. Nevertheless, this does not say that p-value is not the probability of falsely rejecting the null hypothesis under the condition that the null hypothesis is true. --NeoUrfahraner (talk) 10:40, 2 April 2008 (UTC)

If the null hypothesis is true, and if you select a rejection level q prospectively, in advance of doing a significance test (as you are supposed to do), then the probability of rejecting is equal to q. But that isn't the p-value. You are not allowed to do the test, compute the p-value and then claim that the probability that you falsely rejected the null hypothesis is equal to the p-value that you computed. That is wrong. In other words, you are not allowed to "up the ante" by choosing q retrospectively based on the p-value you happened to compute. This is not a legitimate significance test. Bill Jefferys (talk) 14:49, 2 April 2008 (UTC)

4.5 Rejoinder 5: P-Values Have a Valid Frequentist Interpretation

This rejoinder is simply not true. P-values are not a repetitive error rate, at least in any real sense. A Neyman-Pearson error probability

α

has the actual frequentist interpretation that a long series of

α

level tests will reject no more than

100α

% of true

H 0

, but the data-dependent P-values have no such interpretation. P-values do not even fit easily into any of the conditional frequentist paradigms. (Berger and Delampady 1987, p. 329)

This quotation from Berger and Delampady's paper specifically contradicts the notion that an observed (data-dependent) p-value can be interpreted as the probability of falsely rejecting a true null hypothesis. If you want to reject a hypothesis with a specified probability, conduct a standard significance test by selecting the rejection level in advance of looking at the data. Bill Jefferys (talk) 15:13, 2 April 2008 (UTC)

Why is a prospectively chosen p-value not a p-value? --NeoUrfahraner (talk) 18:22, 2 April 2008 (UTC)

I didn't say that. It's a p-value all right, but a data-dependent p-value does not have a valid frequentist interpretation in the Neyman-Pearson sense, as Delampady and Berger say. Remember, frequentist theory is all about the "long run" behavior of a large sequence of hypothetical replications of an experiment. In the case of hypothesis testing, in the long run, if you conduct a large number of tests where the null hypothesis is true, and reject at a predetermined level q, then a proportion q of those tests will be falsely rejected. No problem. That is the way to do significance testing.

But when you observe a particular instance and it has a particular (data-dependent) p-value, there's no long run sequence of tests, there's only the one test you've conducted, and so you can't give it a frequentist interpretation. Frequentist theory applies only to an ensemble of a large number of hypothetical replications of the experiment. But frequentist theory does not have a probability interpretation for a single one of those replications. The best you can say is that probability that you obtained that particular p-value in that particular replication of the experiment is 1, because you observed it; probability no longer is an issue.

Similarly, in the case of confidence intervals, if you construct the intervals validly for a large number of replications of the experiment, (say for a 95% interval), then 95% of the intervals so constructed will contain the unknown value; but you can't talk about the probability that a particular one of those intervals contains the unknown value. Again, that's because a particular interval is not a large sequence of intervals, so there's no way you can give it a frequentist interpretation.

If you want to talk Bayesian, then you can talk about the probability of unique events, but if you are talking frequentist interpretations, no such interpretation exists. Bill Jefferys (talk) 19:32, 2 April 2008 (UTC)

[edit] Probability of 14/20 on a fair coin = ?

Shouldn't the correct probably (as determined by the PMF of a binomial dist) be .03696? —Preceding unsigned comment added by 70.185.120.218 (talk) 14:03, 28 March 2008 (UTC)

[edit] significance

can someone please clearly state the larger the p value, the "more" significant the coefficient or the revese: the larger the p value, the "less" significant the coefficient. which one is right? I think the former is right. I am confused very often. Jackzhp (talk) 23:53, 18 April 2008 (UTC)