Talk:Binomial distribution

From Wikipedia, the free encyclopedia

If you go to previous versions and look at the first one, 02/15/2001, which is yours?, you will see :


1). q (1-p), maybe a typo?


2). And the formula for the numbers of ways of picking X items out of N items was: N!/X!/(N-X)!. This is plain wrong. Yes, after requesting a change for a week, I changed it.

3).There were also wording problems. RoseParks.


I see now the problem. (1-p) was intended as a parenthetical definition. I guess N1/X!/(N-X)! worked in my programming codes so I couldn't see the ambiguity. How would you calculate N!/X!/(N-X)!? From right to left? On the other hand, Today is 02/20/2001, so I think your "requesting a change for a week" is a bit off. Today is only the 20th by my calendar. In any case, the criticism has led to something better. Dick Beldin---- In answer to your question on how you evaluate, N!/X!/(X-N)!, this is ambiguous. In any easy example.


2/4/12 is ambiguous since

  • (2/4)/12= 2/48=1/24 while
  • 2/(4/12)= 24/4= 6.

Multiplication is associative over the reals. If you look at division as the inverse operation of multplication, i.e. 2/4/12=2*4^1*12^1=1/24 you are okay. If you look at division in the ordinary sense, you must specify the order of operations.RoseParks


I agree that an expression with successive divisions appears ambiguous. Most mathematicians I know do indeed consider division as the inverse of multiplication and many programming languages explicitly specify that multiplication and divisions are performed left to right. You are correct, it is not a universal convention. In addition, the vertical placement of numerator and denominator is clearer. Dick Beldin

Contents

[edit] Confidence Interval?

I was looking for information about confidence intervals on a binomial distribution, but was surprised not to find it here. I know this case isn't quite as simple as for normal distributions, but it would be nice to have here, if somebody would like to contribute the information.

You mean CI of p, the success probability, as estimated from the data. If 70 successes in 100 trials, then p_est = 0.7, and your question is what is standard deviation of p_est. It is sqrt(p_est(1-p_est)/n_trials). The 95% confidence interval is +/- 2 standard deviations. My question is what happens if the CI range is outside the allowed 0 to 1 range for a probability. This can happen if p_est is ~1 or ~0. The CI has to be assymetric. Any ideas?

In the case where the confidence interval gets close to 0 or 1, the normal approximation of the binomial distribution is not accurate and rules like your "2 standard deviations" that are derived from the normal distribution are not accurate either. Depending on the circumstances, one can use a different approximation (such as the Poisson distribution) or the exact values of the binomial distribution. McKay 06:36, 27 October 2006 (UTC)

[edit] Simulation?

I was looking for a pointer to quickly simulate a Binomial trial. That is, given a p and an n, I want to randomly select a result with a Binomial distribution. I know I can approximate this with a normal distribution, but I would prefer an exact result if it can be calculated quickly for n < 10,000. I'm sure others have come here looking as well. Thanks.

I added two references to the article which describe binomial random variate generation. A modern C implementation of Kachitvichyanukul and Schmeiser's BTPE algorithm is available as part of the GNU Scientific Library. --MarkSweep 04:08, 8 October 2005 (UTC)

[edit] HIV positive?

Is it me, or should the "A typical example is the following: assume 5% of the population is HIV-positive." part in the second paragraph be changed to something a little less... you know... The HIV part is just not encyclopedia-ish...

That might depend on which population. Michael Hardy 19:42, 22 October 2005 (UTC)
Spot on. I thought exactly the same and immediately looked at the discussion. All political correctness aside, I just don't think anyone would feel harassed if we wrote "assume 5% of the population carry a certain gene" or "are infected with a certain desease", while I am very sure that everyone with an HIV-infection or someone who knows someone closely who is infected will at least feel strange on reading this paragraph. I am all against political correctness for its own sake, but if there's no need whatsoever to use a certain formulation that might be considered inappropriate, why use it?

[edit] Probability mass function?

Okay, maybe this is standard jargon somewhere, but I've never come across it until today. I guess "mass" makes sense by the physical analogy to density. Honestly, I think it's stupid language. Should we also speak of cumulative mass distribution functions? Be consistent! I'm not going to change it, but a mathematician should. At the very least link it to the pmf page.

pmf is fairly standard. It is linked there now. No, cumulative mass distribution function is not a phrase I have heard. --Richard Clegg 08:25, 6 February 2006 (UTC)

[edit] CDF Example Request

The article gives the following example: "A typical example is the following: assume 5% of the population is green-eyed. You pick 500 people randomly. How likely is it that you get 30 or more green-eyed people?".

This is a CDF example. Unfortunately, the expression given for CDF is not very clear to me. How about giving a worked example with the green-eyed people given in the article as a good example, please? --New Thought 15:12, 8 May 2006 (UTC)

I think the given CDF is really merely an introduction of notation. Perhaps there is no simple closed-form expression for the CDF, although there is an obvious algorithm for computing its values (just add up the appropriate values of the mass function). Michael Hardy 18:27, 8 May 2006 (UTC)
aha - that's the answer I was looking for! In that case, why not say something like, "The value can be computed with..."
cdf(k;n,p) = \sum_{k=1}^n {n\choose k}p^k(1-p)^{n-k}\, --New Thought 09:14, 9 May 2006 (UTC)
Actually, in this case the CDF is
F(k;n,p) = \sum_{j=0}^k {n\choose j}p^j(1-p)^{n-j}
--MarkSweep (call me collect) 10:43, 9 May 2006 (UTC)
Good corection - I have added this expression to the article! --New Thought 13:04, 9 May 2006 (UTC)
That is correct only when k is an integer, and only when 0 ≤ k ≤ n. Michael Hardy 21:29, 9 May 2006 (UTC)
This is the binomial distribution - how can k not be either 0 or a positive integer? --New Thought 08:30, 10 May 2006 (UTC)
Just wanted to add - thanks for your help in getting to this article improvement! --New Thought 16:34, 10 May 2006 (UTC)
In that specific example, we have
\Pr[X \geq 30] = 1 - \Pr[X \leq 29] = 1 - F(29; 500, 0.05)
= 1 - I_{0.95}(471,30) = I_{0.05}(30,471) \approx 17.647\%.
You can compute this in terms of the incomplete Beta function, as indicated in the article, using your favorite numerical software. For example, in Mathematica this becomes BetaRegularized[0.05, 30, 471]. Direct summation is likely going to be less numerically stable than a carefully designed subroutine for evaluating the incomplete Beta function. --MarkSweep (call me collect) 06:11, 9 May 2006 (UTC)
Thanks very much for your response. I agree with you - and as it happens, I do use Maxima, which has a shed-load of distribution functions (load(distrib); followed by functions; will show them) - but I wanted to write the functions in Javascript for a web page. I went ahead and wrote the web page using the Poisson distribution - but I still think that this article should give expressions that people can use in normal languages and spreadsheets! I feel I've done my bit for Wikipedia maths clarity - in the Lottery_Mathematics article, mostly written by me, I did my best to make it clear exactly how to do each calculation! --New Thought 09:14, 9 May 2006 (UTC)

[edit] "nmemonic" section

I really dislike the "nmemonic" section. If anyone else agrees, please delete it. McKay 14:55, 11 June 2006 (UTC)

I agree. The mnemonic section is laughable. I'm deleting it. Rjmorris 14:44, 18 June 2006 (UTC)

[edit] Relationship to Bezier curves?

The article currently states: The formula for Bézier curves was inspired by the binomial distribution.

Would someone care to source that statement? It seems rather dubious to me, but if it's true it's worthy of a proper explanation and not the vague description of being "inspired by". Certainly the Bernstein polynomials, which constitute the basis functions for Béziers, contain a Binomial coefficient. But binomial coefficients exist all over the place. It doesn't necessarily imply that they have much at all to do with the Binomial distribution.

From reading about Bézier curves I've always had the impression that the decision to use Bersteins as their parametrization wasn't 'inspired' by anything, but merely chosen from a group of candidates on the merit of their desireable properties. (Being such properties as the fact that curve is guaranteed to be contained within the convex hull of the control points, that reversing the control points does not change the curve, that the tangents at the endpoints consist of the line between the endpoint and the neighboring control point, etc). --130.237.179.166 14:48, 3 September 2006 (UTC)

I'm deleting this since no justification has been offered. Zillions of things are "inspired" by the binomial distribution anyway and I don't see why this one is important enough to single out even if it is true. McKay 04:31, 28 October 2006 (UTC)

[edit] Better Example

I feel like there could be a better example than picking 500 people out of a population "with replacement" and seeing how many were green-eyed. Perhaps a more sensical and applicable example could be: out of 50 web servers, each of which has a 1% chance of failing by the end of the day, how many failed servers do you have at the end of the day?

—The preceding unsigned comment was added by 18.216.0.100 (talk • contribs) .

I agree. The current example suffers from the need to do sampling with replacement, which will seem unnatural to people unaccustomed to sampling theory. --McKay 05:52, 29 November 2006 (UTC)