Wikipedia:Modelling Wikipedia's growth

From Wikipedia, the free encyclopedia

Shortcut:
WP:GROWTH

This page analyses the article count data in Wikipedia:size of Wikipedia and attempts to fit a simple numerical model of past and future growth to the observed article count size and growth data.

Contents

[edit] Growth of the article count

The following graph shows the number of articles on the English Wikipedia from its creation in 2001 up to the end of January 2008. It is constructed from the data at Wikipedia:Size of Wikipedia.

Several features of this graph are discussed at Wikipedia:Size of Wikipedia. Here, two models are presented to attempt to explain the observed general trends in article growth.

[edit] Is the growth in article count of Wikipedia logistic?

Number of articles on en.wikipedia.org and logistic extrapolations
Number of articles on en.wikipedia.org and logistic extrapolations
Percentage growth per month
Percentage growth per month

If Wikipedia's growth follows the exponential growth model the average rate of growth would be proportional to the size of the Wikipedia. The annual growth rate would stay constant, as would the average time the number or articles will double. As can been seen here and on the 2nd graph this is not the case, the %growth is constantly getting lower.

Maybe Wikipedia's growth follows the logistic growth model better. This model is based on:

  • more content leads to more traffic, which in turn leads to more new content
  • however, more content also leads to less potential content, and hence less new content
  • the limit is the combined expertise of the possible participants.

Some characteristics of this model are:

  • there will be a maximum to the number of articles. On Wikipedia one can hardly imagine this as there will be new events and people to describe in the future. Compared to the large number of existing articles this is a very small though.
  • at the end the growth is zero.
  • at the pivot point (halfway the maximum) the growth is at its peak. For the en.wikipedia this might have been in August 2006 with 60,000 new articles a month.

This model is related to the quantity (number of articles). The quality might still increase independently.


[edit] Is the growth in article count of Wikipedia exponential?

One common model of Wikipedia growth is that:

  • more content leads to more traffic
  • which leads to more edits
  • which generate more content

Thus, the average rate of growth should be proportional to the size of the Wikipedia, that is, the growth should be exponential.

The graph of article count on the right is plotted on a logarithmic scale, so exponential growth should manifest itself as linear behavior of the data. Between October 2002 and July 2006, the data do fit very well along the dotted line shown, while from July 2006 onwards there is a noticable fall off from linear behaviour, while the behaviour before October 2002 is more complex.

The graph on the right below is a close-up of the data points that follow a linear trend: the best-fit line in red was computed using linear regression. From the slope of this best-fit line, the proper time of the exponential growth can be found, giving:


N(t)=N(0)\ e^{t/\tau};\quad
\tau\approx 500\ \mathrm{days}

In prose the previous expression means that the number of articles has doubled once every 346 days from October 2002 to October 2006, with very good approximation. If Wikipedia had kept up with this trend, as shown on the graph, the number of articles by December 2006 would have been 1,900,000, by June 2007 2,800,000 and by December 2007 4,000,000, although there has been a slow down of the growth and wikipedia has dropped out of exponential growth.

Wikipedia growth and predictions from July 2006 to December 2008

The graph on the right is a exponential growth projection made in July 2006. The number of articles on the English Wikipedia up to July 2006 is shown in red, and this is extrapolated in blue using an exponential function (approximately 38000*exp(0.0017t) articles, where t is the number of days since January 1, 2001).

By the end of 2006, when there were 1.5 million articles, the projection was already overestimating the growth by 10-15%, and the prediction of over 3 million articles by the end of 2007 is significantly more than the actual figure of about 2.1 million articles.

It has been hypothesized that the growth rate of Wikipedia consists of a constant number of articles per day, submitted by "hard-core" wikipedians, with additional articles submitted by less enthusiastic wikipedians proportional to the current article count of Wikipedia. In this model the growth rate should be a linear function of the size of Wikipedia.

Questions:

  • is this model even remotely valid?
  • how long can exponential growth go on, or is this just really the early part of a logistic curve?
  • what does this imply for server and traffic scaling?

Eventually there will probably be a point where the amount of articles created each day will begin to slow down, due to a lack of things to write about. But it is probable that the amount of information in each article will begin to increase in lieu of an increase in the number of articles. Limitations on the (current) Wikipedia interface will cause a bottle neck of sorts, limiting the type (and by default, the amount) of growth to vertical monolingual growth patterns, as opposed to lateral cross-lingual interlanga ones.

Note that from the beginning of December 2005 only registered users can create new pages.

[edit] Other measurements of article growth

[edit] Edits per Article

The following graph shows the mean number of edits per article, and is intended as a measure of the quality of the articles, assuming that editing improves the content.

The graph is plotted in logarithmic scale, and this data also fits well with exponential growth starting from October 2002. The number of edits per article has since doubled once every 504 days.

[edit] Relationship of Usenet cites to article growth

The relationship of Usenet cites of the word "Wikipedia" to the official article count for the en: Wikipedia appears to show a curve, rather than a linear relationship. (See Wikipedia:Awareness statistics for data). Or does it show a line broken into two parts, one before and one (horizontally shifted) after the Rambot-created articles? If so, this would suggest that the Rambot articles do not stimulate significant comment on Usenet, but that the linear relationship does in fact hold for all other articles. As ever, more data are needed.

[edit] Modelling growth of Wikipedia page views per million

Using the Alexa page views per million data from Wikipedia:Awareness statistics (see [1] for a graph) in the period 1 January 2003 to 5 September 2005, filtering out all points less than 28 days away from the previous point (to avoid excessive weighting during time periods where points are densely sampled), and performing a linear least-squares fit of the logarithm of the data, gives the following approximate formula:

log_e(page_views_per_million) = -50 + 5e-08 * unix_epoch_of_date

for n = 21 points fitted

This implies a doubling period of (log_e(2) / 5e-08) / 86400 days, which is approximately 160 days, and an annual growth factor in page views per million of appoximately exp(5e-08*365*86400), which is approximately 5.

Playing around with different time periods and filter times, we get a range of results from which can reasonably say that Wikipedia's estimated page views per million doubling time is somewhere in the range 130 - 160 days, with the recent (2005) doubling time of 156 days or so being within the range of the longest-term doubling time of about 155 - 159 days, with the 2004 period being the exception to the long-term and short-term trends.

[edit] Modelling improvement in Wikipedia's Alexa traffic rank

Applying a similar linear regression fit to the log of Wikipedia's Alexa traffic rank from October 2002 to September 2005 gives a similar result, with a halving period (lower is better for rank) of roughly 134 - 138 days over the long term, with a more recent (2005 data only) halving time of 114 days! Since the current page rank As of September 2005, is roughly 40, this suggests, if taken to logical extremes, and using the most cautious of the three figures, and rounding it to 4.5 months, that Wikipedia will reach:

  • page rank 20 in 4.5 months
  • page rank 10 in 9 months
  • page rank 5 in 13.5 months
  • be fighting its way into the top 3 in 18 months, and
  • be fighting its way to the #1 spot in 22.5 months...

So, clearly this exponential growth has got to stop or slow down, or it's going to be a wild ride...

November 2005 — the daily page rank is averaging 34 and reached 31 in October.

January 2006 — the daily page rank has been averaging 20 for about a week; in line with the original predictions above.

April 2006 — averaging 16/17 this month, although in March it reached as high as rank 12, the current record.

July 2006 — deviating from predictions; Wikipedia was supposed to have reached rank 10 by now, yet for the whole of June we hovered between 16/18.

September 2006 — Heavily deviating from predictions; by the end of October, Wikipedia was supposed to reach rank 5, yet still only making small gains, hovering between 14/16 now. The climb up the rankings has slowed down - but for now we are still climbing! Wikipedia has broken the "50,000 reach" barrier, meaning we reach as many people as youtube.com and even more than myspace.com!

November 2006 — Alexa weekly rank is now 12, and is still climbing, with occasional daily blips up to 11. Wikipedia once made the daily rank in the top 10 on 12th!

February 2007 — 18 months after the predictions, I think it's safe to say the model is flawed. We should be ranked as 3rd, but the current high is 8, with the average being 10/11. We're still getting gaining popularity, just not as fast as expected.

May 2008 — Swaying between 7 and 8 for the past few months with 8 being slightly more common. the rise, though slow, continues.

[edit] See also

[edit] External links

Languages