Talk:Benford's law
From Wikipedia, the free encyclopedia
Contents |
[edit] Deletion explained
I deleted this statement (in the restricted data set analysis): " In the case of the villages it could be applied, but expecting only first numbers to be 3, 4, 5, 6, 7, 8, 9 each getting the same relative probabilities as in the general case."
This is incorrect. In this case, the first digit is the one-digit approximation to the distribution over the restricted range. The first digit won't follow Benford's law, but instead will follow the underlying distribution, whatever it is-- for example, for the villiages case, it might be a normal distribution (exp-(-x^2));
Benford's law does not apply to a spread of numbers that spread less than an order of magnitude. For example, the first digit of the height of adult male humans in meters, or feet, do not follow Benford's law, although the first digit of height in mm does.
- 'Fraid not. The first digit of height in mm is the same as the first (non-zero) digit of the height in meters.--Henrygb 21:05, 18 October 2006 (UTC)
[edit] First digit or all digits
I thought this law only applies to the first digit of numbers, but the article seems to suggest it works for all digits. Is this correct? AxelBoldt 11:32, 4 February 2002 (UTC)
- Well, it suggests that 13 would be more common than 15, for instance, yes. That's just the same law applied to base 100. —Simetrical (talk • contribs) 02:18, 14 February 2006 (UTC)
-
- It applies to second digit, etc, but with a much smaller deviation from a uniform distribution. The deviation from uniform of the first digit is so profound that it is easy to statistically etect it in real data. I think you would be hard pressed to statistically detect it in the second digit in real data. Bubba73 (talk), 02:23, 14 February 2006 (UTC)
[edit] Links wanted
I am pleased to see how my initial, very basic and imprecise article has in a few days evolved into something quite decent. But I think this law deserves to be much more widely known, so it would be nice if there were more links to it from other pages. At the moment there is just one! Maybe the Wikipedians who have delved more deeply into the articles about statistics and probability have an idea about where suitable links could be placed? Calypso 15:51, 25 February 2002 (UTC)
[edit] Proof vs. demonstration
- "This can be proven mathematically: if one repeatedly "randomly" choses a probability distribution and then randomly choses a number according to that distribution, the resulting list of numbers will obey Benford's law."
- is that proof or a demonstration? -- Tarquin 01:44, 26 June 2002
[edit] Explanations
The first two 'explanations' are patently absurd. The second shows that Benford's law is a limiting form of the zeta distribution but doesn't say why it works. The first never gives Benford's law. More precisely, as you count, the proportion of 1s increases, then the proportion of 2s until it equals the 1s, then the proportion of 3s, and so on. At each stage there are at most 3 different proportions.
As far as I can see, Hill's explanation is the most likely.--Terry Moore (218.101.112.28 (talk • contribs)) 10:17, 6 March 2004 (UTC)
- Right. So if you had an infinite run of digits (1 to infinity), then Benford's law wouldn't hold - each initial digit would occur an equal number of times. But no real-life data set is infinite, so all real-life data sets have a highest (maximum) figure. Suppose that the maximum is set at random; then, because of the way the proportions work (first mostly 1s, then 2s catching up, then 3s catching up), whatever range you end up with will tend to favour 1s over 2s, 2s over 3s and so on. That's why it works with real-life data, but not, say, with the phone book. Toby W 10:25, 6 Mar 2004 (UTC)
Not quite--there are only three possible proportions at each stage. However, you get Benford's law if you count geometrically. Suppose you invest $1 at 7% compound interest. Then your investment doubles roughly every 10 years. For the first 10 years your monthly statement will show a first digit of 1, but it goes through 2 and 3 over the next 10 years. When it reaches 5 it only takes 10 years to cycle through 5, 6, 7, 8 and 9 before getting to 1 again.--Terry Moore (192.195.12.4 (talk • contribs)) 00:08, 18 May 2004 (UTC)
[edit] Sequential access
Shyamal's recent addition: Benford was an astronomer, and it is generally believed that the law was discovered when he noticed that the early pages of the book of logarithms were more used than the later ones. However, it has been argued that any book that is sequentially accessed would show more wear and tear on the earlier pages. This story might thus be apocryphal, just like Newton's supposed discovery of gravity from observation of a falling apple.
- True, any book that is sequentially accessed would show more wear and tear on the earlier pages. But isn't that the point? Logarithm tables aren't sequentially accessed, are they? You turn to the page which has the number whose log you want to know (like a dictionary); you don't read it from start to finish (like a novel). So, if you more frequently want to know log 1 or 10 or 100 than you do 9 or 99 or 999, the earlier pages will show more wear and tear than the later ones. Besides, isn't the historical question of whether the law occurred to Benford when he looked at the wear and tear on a logarithm book independent of the question of whether he was right to jump to the conclusion he did in fact jump to? Toby W 09:40, 25 Mar 2004 (UTC)
[edit] Incorrect statement removed
I have taken out The product of n uniform random numbers will conform to Benford's law. (The sum of n uniform random numbers tends towards a normal distribution.) since it is not really true (try uniform on [0,1] and n=2, or uniform on [10,11] and n=7). In fact it tends to a log-normal distribution, and only comes close to Benford's law when the variance is large, i.e. for large enough n. --Henrygb 16:25, 21 Jun 2004 (UTC)
[edit] Unclear statement
The following statement is unclear:
"The precise form of Benford's law can be explained if one assumes that the logarithms of the numbers are uniformly distributed."
To be uniformly distributed, a variable must have a largest and smallest possible value. These are not specified in the discussion.
This objection is correct. (One could also give a meaning to uniform distributions on unbounded intervals, but there is no need of such an assumption here). The sentence above should be made precise this way:
"The precise form of Benford's law can be explained if one assumes that the MANTISSAS of the numbers (that is, the fractional parts of logarithms) are uniformly distributed in the unit interval." PMajer 10:33, 28 December 2006 (UTC)
The explanation based on scale-invariance is unclear as well. —The preceding unsigned comment was added by Pvh (talk • contribs) 20:25, 10 May 2005 (UTC).
Right, the product of two uniformly-distributed numbers is not a Benford distribution, but it is approching it. Multiplying my n independent numbers approaches Benford's law, and n=3 is already pretty close (close enough to be a statistical match, IIRC). Bubba73 (Bubba73 (talk • contribs)) 18:20, 11 May 2005 (UTC)
[edit] Problems
If I understand this law correctly, if I go to the telephone book (White Pages)
and start listing the first digit of each house number for sequential entries, skipping entries of the same surname at the same address, I will have a sequence of first digits that should be distributed according to Benford's law. It does not happen. Frankly I don't see why it should.
[edit] Newton's apple
Some version of Newton's apple story is probably true. See
- http://www.sfu.ca/physics/ugrad/courses/teaching_resources/demoindex/mechanics/mech1l/apple.html
- http://en.wikipedia.org/wiki/Isaac_Newton#Newton.27s_apple
So I removed mention of it here. --Mmm 05:35, 25 March 2006 (UTC)
[edit] Explanation (again)
Someone (above) mentioned their dissatisfaction with the explanation given for this "law". I am not happy with it either. I think I can improve on the reasoning, although I'm not all the way there yet. I'll present my partial result here, with the wish that it may stimulate discussion and eventual consensus on a convincing and (hopefully) easily understood explanation:
BL struck me immediately as sensible, although the explantions published in the article tend to miss the point as far as I'm concerned. Here are a couple of examples to show how BL works:
Street Numbers A suburban developer decides to number his streets (as opposed to naming them). He will start at 1st Street and continue on with some generally small sequence. Taking the totality of all developments of this type, it is very obvious that 1st Street will occur much more frequently than 65th Street; as all such developments will have a 1st Street, whereas few will be large enough to have a 65th Street. Of course, this affects the distribution of first digits. Eg, 1 will occur more frequently than 6 because more developments will stop at, say 15th Street than go up to 65th Street.
Billing It is frankly more difficult to convincingly account for the preponderance of lower digits occuring in day to day bills. Nonetheless, I do have some thoughts. Perhaps they may inspire another reader to arrive at a rigorous explanation.
I suspect that something similar to the 'street number argument' may account, in part at least, for a preponderance of smaller digits in the leading digit of household bills. To undertand this it is useful to remember that the value of a unit of currency is not a random variable, but rather is chosen, from motives of convenience, to correspond roughly with to the price of small, day to day purchases:
Imagine you were to find yourself at a grocery store checkout counter in an unknown country, having to purchase a single red apple. You have no idea of the currency value. Still, it would be a reasonable for you to expect the bill to come in at something like 1 currency unit, and you would probably be right to feel some suspicion if the cashier were to hand you a bill for 57,844 currency units.
Of course, the 57,000 CU apple does happen in places. But this is always an indication that the currency has departed severely from its original value (run away inflation). Usually this is a temporary anomaly. At some point the government will introducing a new currency, thereby rescaling prices.
It is also true that certain private purchases (real estate, for instance) will require exchanges in the thousands or even millions of currency units. But the point is that these purchases are likely to be for infrequently purchased items. It seems reasonable to expect that the phone bill, the electricity bill, etc. will be a small number of currency units and, hence, disporportionally likely to begin with a small digit. Later digits in a bill amount may indeed be evenly distributed (or, nearly so), but a preponderance of low digits in the leading figure is enough to influence the cumulative result.
In addition to this argument, I suspect another effect contributes to enhance the probability for low digits in household bills. This effect is an outcome of the fact that the standard deviation in billing amounts is generally a factor of the average bill amount, rather than some constant amount.
To take an example, suppose that electricity bills typically cluster about 25% from some average value. If this average happened to be, say 15 CU, then the range from 11.25 CU to about 18,75 CU are common, so the majority of bills will begin with a 1. If, on the other hand, the average bill were 85 CU, then the leading digit would not be so highly concentrated at 8. Rather, there would be a range of high freqency leading digits (6,7,8,9 and 1). Of course, 8 will also occur as a leading digit with a fairly high frequency if the average bill were CU 75 or CU 95 as well. Nevertheless, as the spread of leading digits is greater for the higher average bill, the set of circumstances (average and standard deviation bill amounts) giving rise to a higher leading digit is relatively narrower; so that lower digits more likely to occur.
--Philopedia 10:19, 8 May 2006 (UTC)
[edit] Naive explanation?
It is not clear to me why the explanation should "assume" that it is the log of the first numeral that should be evenly distributed rather than the numeral itself. However, couldn't the explanation be simply this: In most counts, quantities, or measurements, unless the datum is 0 (which can only occur in the base0 place, and there are many populations where the datum must be >0), the first digit (regardless of which place) must be ≥ 1, but it needn't be ≥ 2; if it isn't 1, then must be ≥ 2, but it needn't be ≥ 3; and so on to 9. A logarithmic distribution would capture this perfectly.
By the way, in binary (or notches on a stick) the first numeral will always be 1 unless the datum is 0. While this result is obvious, it is consistent with Benford's law and a logarithmic distribution.
Finell (Talk) 22:30, 17 September 2006 (UTC)
[edit] number of digits in percentages
This is a response to a comment on my own talk page from User:Das my, who wondered why I reverted his three decimal expansion of the percentage of likelihood of each digit, turning it back into one decimal place. (The rollback button doesn't allow Edit Summaries; a longtime complaint)
I did it because they stuck me as an example of unnecessary accuracy. It might be more accurate in a mathematical sense to say the probability of starting with 1 is 30.103% instead of 30.1%, but it doesn't make the issue any clearer to the reader (the difference, after all, is only one part in 10,000). I just throws in more digits that increase the chance of mis-reading a number. Since these are log calculations, you could expand it to 100 decimal places if you wanted, but so what?
The chart lets casual readers get a quick sense of how the likelihood degreases as digits increase. Keeping one decimal point is the coarsest we can be and still demonstrate how the probability declines with every number - otherwise I'd suggest rounding to the nearest integer. - DavidWBrooks 20:02, 4 October 2006 (UTC)