Talk:Zipf's law

From Wikipedia, the free encyclopedia

This article is within the scope of WikiProject Statistics, which collaborates to improve Wikipedia's coverage of statistics. If you would like to participate, please visit the project page.

Mathematics Portal

This article is within the scope of WikiProject Mathematics, which collaborates on articles related to mathematics.

Mathematics rating:

B Class

Mid Priority

Field: Probability and statistics

Please update this rating as the article progresses, or if the rating is inaccurate. Click to show/hide comments.
Please add to or update the comments to suggest improvements to the article.
Geometry guy 02:15, 15 June 2007 (UTC)

Is it true that the word "the" does indeed occur about twice as often as the next common English word? The rest of the article seems to allow for some proportionality constants. AxelBoldt

It's not true, so I replaced it with a statement about Shakespeare's plays. AxelBoldt

1 random typing
2 Why ?
3 Consistency of variables in text
4 Linked site doesn't exist
5 reported constants in Examples section
6 s = 1?
7 Sources needed for examples
8 Too technical?
9 Does Wikipedia traffic obey Zipf's law?
10 Wikipedia's Zipf law
11 On "Zipf, Power-laws, and Pareto - a ranking tutorial"
12 k = 0 in support?
13 biographical information

[edit] random typing

The main article claimed that

the frequency distribution of words generated by random typing follows Zipf's law.

I doubt that very much. For one thing, if you type randomly, all words of length one will be equally likely, all words of length 2 will be equally likely and so on. Or am I missing something? Maybe we should perform a little perl experiment. AxelBoldt

I tried this Python code, and plotted the results in a log-log plot -- the early ranks are a bit stepped, but the overall pattern fits Zipf's law rather well. The Anome

import random
import string
import math

N=10000
M=100

words = {}
for j in range(M):
  str = []
  for i in range(N):
    str.append(random.choice('aaabccdefg     '))
  str = string.join(str, '')
  str = string.split(str)
  for word in str:
    if words.has_key(word):
        words[word] += 1
    else:
        words[word] = 1
  print 'did string pass', j
  
vals = words.values()
vals.sort()
vals.reverse()

file = open('zipf_ranks.txt', 'w')
# Let's just have the first few ranks
useranks = min(1000, len(vals))
for i in range(useranks):
    rank = i+1
    file.write("%d %d\n" % (rank, vals[rank-1]))

Could you repeat the experiment with all letters and the space getting the same probability? That's at least what I thought off when I heard "random typing". AxelBoldt

When I was told of the "random typing" experiment, the "typing" part was more important than "random". If you sit at a keyboard and randomly type, you have a much higher chance of hitting certain keys than others because your fingers tend to like certain positions. Also, the human brain is a pattern-matching machine, so it works in patters. If you look at what you type, you might notice you tend to type the same sequences of letters over and over and over.

For a brief historical note, this theory was used in cracking the one-time-pad cyphers. Humans typed the pads, so it was possible to guess at the probability of future cyphers by knowing past ones. It is like playing the lottery knowing that a 6 has a 80% chance of being the first number while a 2 has a 5% chance. If you knew the percentage chance for each number and each position, you have a much greater chance of winning over time. Kainaw 13:59, 24 Sep 2004 (UTC)

I can't right now, but I'll give you the reason for the skewed probabilities -- the space is by far the most common character in English, and other chars have different probabilities -- I wanted to model that. The Anome

The main article claimed that

the frequency distribution of words generated by random typing follows Zipf's law.

I doubt that very much. For one thing, if you type randomly, all words of length one will be equally likely, all words of length 2 will be equally likely and so on. Or am I missing something?

Yes, you're missing something.

It doesn't *exactly* match Zipf's law. But then, no real measurement exactly matches Zipf's law -- there's always "measurement noise". It does come pretty close. In most English text, about 1/5 of all the characters are space characters. If we randomly type the space and 4 other letters (with equal letter frequencies), then we expect words to have one of these discrete probabilities:

1/4*1/5: each of the 4 single-letter words
1/4*1/4*1/5: each of the 4*4 two-letter words
1/4*1/4*1/4*1/5: each of the 4*4*4 three-letter words.
...
(1/5)*(1/4)^n: each of the 4^n n-letter words.

This is a stair-step graph, as you pointed out. However, if we plot it on a log-log graph, we get

 x = log( 4^n / 2 ) = n * log(4) - log(2).
 y = log( 1/5 * (1/4)^n ) = -n*log(4) - log(5).

which is pretty close to a straight line (and therefore a Zipf distribution), with slope

 m = Δy / Δx = -1.

You get the same stair-step on top of a straight line no matter how many letters (plus space) you use, with equal letter frequencies. If the letter frequencies are unequal (but still memoryless), I think that rounds off the corners and makes things even closer to a Zipf distribution. --DavidCary 01:54, 12 Feb 2005 (UTC)

The problem here seems to be arising from the ambiguarity of the phrase "random typing".To avoid this I'll replace the phrase with "random sampling of a text". Jekyll

It's already gone, never mind. Jekyll 14:45, 18 November 2005 (UTC)

this is a pretty minor example of Zipf's law. A much more important one is that the frequency of occurrence of words in natural language conform to Zipf's law. This is in the original 1949 reference. Callivert (talk) 11:29, 24 March 2008 (UTC)

[edit] Why ?

Moved from main article:

We need an explanation here: why do these distributions follow Zipf's law?

No we don't. Zipf's law is empirical, not theoretical. We don't know why it works. But even without a theory, even the simplest experiments that try to model a society of independent actors consistently turn it up!--BJT

Well, empirical facts have to be explained too. It's not enough to simply state that the moon always shows us the same side; you have to give the reason if you try to understand the world. It's the same here. If Zipfian distributions show up in a variety of situations, then there must be some underlying principle which generates them. I doubt very much that "we don't know" that principle. AxelBoldt

I agree - every theory begins with empirical evidence. The theory models an explanation to fit those facts. I'm sure someone has tried to come up with an explanation? 70.93.249.46

We need an explanation here: why do these distributions follow Zipf's law?

An excellent question. However, I doubt there is a single cause that can explain every occurance of Zipf's law. (For some distributions, such as wealth distribution, the cause of the distribution is controversial).

Well, empirical facts have to be explained too. It's not enough to simply state that the moon always shows us the same side; you have to give the reason if you try to understand the world.

Good point. However, sometimes we don't yet know the cause of some empirical facts -- we can't yet give a good explanation. In those cases, I would prefer the Wikipedia article to bluntly tell me "we don't know yet" rather than try to dance around that fact.

While it is true that Zipf's law is empirical, I agree with AxelBoldt that it is useful to have an interpretation of it. The most obvious place to look is the book that Zipf himself wrote in which he linked his observation to the Principle of Least-Effort, kind of an application of Conservation of Energy to human behavior.

I've just measured Polish Wikipedia page access distribution using Apache logs (so they had to be heavily Perlscripted) for about 2 weeks of late July, only for main namespace articles, and excluding Main Page. In most part it seems to be following Zipf's law with b about 0.5, except at both ends, where it behaves a bit weird (what was to be expected). Now why did I get constant so grossly different from stated constant for English Wikipedia ?

Some possibilities:

Polish and English Wikipedias really have different Zipf's factors
It was due to my perlscripting
Measurment given for English Wikipedia here is wrong for some reason, like measuring only top 100, and not all articles.

Taw 01:38, 4 Aug 2003 (UTC)

I think we may be fast approaching the point when merging this article with Zipf-Mandelbrot law would be appropriating, along the way doing some reorganizing of the article. Michael Hardy 22:57, 6 Dec 2003 (UTC)

Although the reason is not well-understood, mechanisms that bring about the Zipf distribution have been suggested by physicists. Power laws tend to crop up in systems where the entities (words in this case) are not independent, but interact locally. The choice of a word isn't random, neither does it follow a mechanistic prescription - the choice of a word depends strongly on what other words have already been chosen in the same sentence/paragraph. I think these speculations should be mentioned in the article as a side note, for the sake of completeness. 137.222.40.132 12:45, 17 October 2005 (UTC)

It is meaningful to ask why in general a particular distribution is found in nature - the passage above is a good start; I'd like more clarification - for example, the normal distribution arises when their the outcome is caused by a large number of minor factors where no particular factor predominates. The bimodal distribution arises when their are a large number of minor factors coupled with one predominant factor. The Poisson distribution arises when an event is the consequence of a large number of rare events converging. Etc. For Zipf's distribution, I would like to know, why does interdependence of events lead to it?

[edit] Consistency of variables in text

In the examples section, the variable quoted in each case is b. However, this variable is not used anywhere else. Some consistency throughout the article would be nice (and less confusing!). — 130.209.6.41 17:01, 1 Jun 2004 (UTC)

I agree - please explain what is b ? \Mikez 10:00, 8 Jun 2004 (UTC)

I was about to add something about this as well... is it what is called s in the discussion of formulas? It's not clear from the text. -- pne 14:01, 8 Jun 2004 (UTC)

Well, Ive seen Zipf's Law stated as

f n = [constant] / n b .

So Im pretty sure that s and b are the same thing. Im changing b to s in the "Examples..." section. -- Aparajit 06:01, Jun 24, 2004 (UTC)

Can someone check the values of b / s given in the examples? Especially the word frequency example. I took the data for word frequencies in Hamlet, and fitted a line to the log-log plot. This gave a slope of more like 1.1, rather than the 0.5 figure quoted here. Taking the merged frequencies over the complete set of plays gives a value which gets towards 1.3. This would agree more with the origin of Zipf's law, which is that the frequency of the i'th word in a written text is proportional to 1/i. The value of 0.5 seems much too small to match this observation. Graham.

[edit] Linked site doesn't exist

It seems to me that the special page of the [popular pages] no longer exists, but it's used for an example in this page.

Has this page simply moved or do we need to get a new example?

[edit] reported constants in Examples section

the examples section reports values of s < 1 as resulting from analysis of wikipedia page view data. the earlier discussion correctly notes that such values do not yield a valid probability distribution. what gives? perhaps (s - 1) is being reported?

Or it could be that that value of s is right for a moderately large (hundreds?) finite number of pages. That seems to happen with some usenet posting statistics. Michael Hardy 22:07, 28 Jun 2004 (UTC)

I miss reference to Zipf's (other) law: the principle of least effort.

[edit] s = 1?

I plotted Shakespeare's word frequency lists and top 5000 words in Wikipedia: [1] Where did the old value of s~0.5 come from? -- Nichtich 00:27, 23 Jun 2005 (UTC)

[edit] Sources needed for examples

Section 3 ("Examples of collections approximately obeying Zipf's law") has a bunch of examples with no further explanation or reference (Shakespeare excluded). That's highly undesirable, so let's get some sources. By the way, I find the final point (notes in a musical performance) very questionable. I imagine it would depend very much on the type of music. EldKatt (Talk) 13:20, 19 July 2005 (UTC)

It might not be a bad idea to give examples from Zipf's book. Also see the article by Richard Perline in the February 2005 issue of Statistical Science. Michael Hardy 22:28, 19 July 2005 (UTC)

[edit] Too technical?

Rd232 added the "technical" template to the article. I've moved it here per template. Paul August ☎ 03:49, 27 November 2005 (UTC)

There have been numerous edits to the article since the template was first added almost a year ago. Also, there has been no discussion of what about the article is too technical so I've removed the template. Feel free to put it back, but if you do, please leave some comments as to what you find is too technical and some suggestions as to how to improve the article. Lunch 02:25, 21 November 2006 (UTC)

[edit] Does Wikipedia traffic obey Zipf's law?

Yes, apparently, with an exponent of 0.5. See Wikipedia:Does Wikipedia traffic obey Zipf's law? for more. -- The Anome 22:45, 20 September 2006 (UTC)

[edit] Wikipedia's Zipf law

Just a plot of English Wikipedia word frequencies: http://oc-co.org/?p=79

Is this plot availiable to Wikipedia - i.e. Free content? It would look good in the article.

Yes, it is released under LGPL by me, the author :) -- Victor Grishchenko

I downloaded it, tagged it as LGPL with you as author, and put it into the article - please check it out and make corrections if needed. This is an excellent demonstration of Zipf's law (and its limitations). Thanks! PAR 15:00, 29 November 2006 (UTC)

[edit] On "Zipf, Power-laws, and Pareto - a ranking tutorial"

I've found two doubtful places in the tutorial by L.Adamic (ext. link N3). Probably, I've misread something...

First, "(a = 1.17)" regarding to Fig.1b must be a typo; the slope is clearly -2 or so.

Second, it is not clear whether Fig.2a is a cumulative or disjoint histogram? To the best of my knowledge, logarithmically-disjointly-that-way-binned Zipf must have slope=-1 and not -2. I.e., if every bin catches items which popularity resides in the range $[c i : c i + 1)$ . Just to verify it, I did a log2-log2 graph of log2-binned word frequencies compiled from Wikipedia. I.e. y is log2 of number of words mentioned $2 x$ to $2 x + 1 - 1$ times in the whole Wikipedia. Although the curve is not that simple, it shows slope=-1, especially in regard to more frequent words.

Any thoughts? Gritzko

Yes it looks like a typo, a=1.17 certainly is not right. Regarding Fig 2a, it is cumulative (the vertical axis is "proportion of sites" which will have an itercept at (1,1)). The Zipf law does not specify an exponent of -1, just that it is some negative constant. It happens to be close to -1 for word frequency, but maybe its closer to -2 for aol user data. PAR 14:20, 1 January 2007 (UTC)

"As demonstrated with the AOL data, in the case b = 1, the power-law exponent a = 2.", i.e. b is "close to unity" in the case of AOL user data. I had some doubts whether Fig 2a is cumulative because at x=1 y seems to be slightly less than 1. Probably, it is just a rendering glitch. Thanks! -- Gritzko

[edit] k = 0 in support?

I'm not sure if k=0 should be included in the support... pmf is not well defined there, as 1 / (k^s) = 1 / 0 and it -> +inf. Krzkrz 08:56, 3 May 2007 (UTC)

[edit] biographical information

I was surprised that there wasn't even a brief note at the beginning of the article on who Zipf is (was?). I think it's sort of nice to see that before you get into the technical stuff. Jdrice8 05:38, 14 October 2007 (UTC)

That was the brief second paragraph of the article. Now I've made it into a parenthesis in the first sentence, set off by commas. Michael Hardy 01:58, 15 October 2007 (UTC)