Wikipedia:Modelling Wikipedia's growth

From Wikipedia, the free encyclopedia

This page analyses the article count data in Wikipedia:size of Wikipedia and attempts to fit a simple numerical model of past and future growth to the observed article count size and growth data.

Contents

[edit] Is the growth in article count of Wikipedia exponential?

One common model of Wikipedia growth is that:

  • more content leads to more traffic
  • which leads to more edits
  • which generate more content

Thus, the average rate of growth should be proportional to the size of the Wikipedia, that is, the growth should be exponential.

Here is a graph of article count for the English-language Wikipedia alone, based on Erik Zachte's statistics until they were available (July 06), then integrated with data collected by Andrea Allais, the creator of the graph. See Wikipedia:size of Wikipedia for more discussion.

The graph is plotted on a logarithmic scale, so exponential growth should show as linear behavior of the data. Points after October 2002 do indeed fit very well along a line, while previous data follows a more complex behavior, probably due to artifacts. The following graph is a close-up of the points that follow a linear trend, with the best-fitting line plot in red:

From the slope of the best-fitting line, the proper time of the exponential growth can be found, giving:

N(t)=N(0)\ e^{t/\tau};\quad \tau= 499.7\ \mathrm{days}

In prose the previous expression means that the number of articles has doubled once every 346 days from October 2002 to October 2006, with very good approximation. If Wikipedia keeps up with the trend, as shown on the graph, the number of articles by December 2006 will be 1,900,000, by June 2007 2,800,000 and by December 2007 4,000,000, though extrapolating exponential behavior is a dangerous process.

During the last three or four months there has been a slight slow down of the growth. This may just be a fluctuation, like others that have happened before, or this could be the first sign of a change in behavior from exponential growth to logistic.

[edit] Edits per Article

The following graph shows the mean number of edits per article, and is intended as a measure of the quality of the articles, assuming that editing improves the content.

This graph is also plotted in logarithmic scale, and this data also fits well with exponential growth starting from October 2002. The number of edits per article has since doubled once every 504 days.

[edit] Speculative growth predictions, as of December 2003

It has been hypothesized that the growth rate of Wikipedia consists of a constant number of articles per day, submitted by "hard-core" wikipedians, with additional articles submitted by less fanatical wikipedians proportional to the current article count of Wikipedia. Thus, it should be possible to fit a straight line on the bulk of the "main-line" points in the scatter plot. Note that there are some outliers that are above the range of the current plot; these can be attributed to the Rambot data-dump of machine-generated gazeteer articles, and have been discounted from this analysis.

The graph below shows that, apart from outliers, the model of growth of the English-language Wikipedia has been roughly proportional to size of Wikipedia as of December 2003.

The data below are based on data from Erik Zachte's dump analysis, see http://www.wikipedia.org/wikistats/EN/TablesWikipediaEN.htm , and uses the "official article count" criterion for the article count. Because of record-keeping differences, Erik's earlier data points may not exactly correspond to previous analyses, and growth rates are aggregated monthly in Erik's data. However, the overall import of the graph is very similar; there is an even clearer linear relationship between the size and growth rate, and as a result growth can be expected to be dominated by an exponential trend in the short- and medium-term future.

Image:Growth_vs_article_count_to_dec_2003.png


Here's a by-eye fit of the data without outliers:

\frac {dy} {dt} = 40 + \frac {150} {110000} y


where y is the article count and t the time since January 10, 2001, measured in days. This is a first-order nonhomogeneous linear differential equation.

Note: no error bounds are provided: this is just a visual fit, and there is insufficient data for a better technique. However, this is likely to change over the next year, and there should be enough data by mid-2004 to resolve some of the questions posed. For now, it is interesting just to make a "straw man" prediction which can be tested in the future, rather than creating models retroactively.

Setting y = 183375 articles as of December 15 2003 and integrating the formula above gives some very crude predictions for the en: Wikipedia article count at the end of each month, as follows:

Image:Historical_and_predicted_en_article_count_dec_2003_model.png


Questions:

  • is this model even remotely valid? (Time will tell).
  • how long can exponential growth go on, or is this just really the early part of a logistic curve?
  • what does this imply for server and traffic scaling?

Eventually there will probably be a point where the amount of articles created each day will begin to slow down, due to a lack of things to write about. But it is probable that the amount of information in each article will begin to increase in lieu of an increase in the number of articles. Limitations on the (current) Wikipedia interface will cause a bottle neck of sorts, limiting the type (and by default, the amount) of growth to vertical monolingual growth patterns, as opposed to lateral cross-lingual interlanga ones.

November 2005 update: A glance at the updated article count graph at the top of the page will show that the graph lags some way behind the actual growth, which accelerated in early 2004 and has now resulted in a total article count of over 800000, 45% more than predicted.


Note 1: From the beginning of Dec 05 only registered users can create new pages.

[edit] Relationship of Usenet cites to article growth

The relationship of Usenet cites of the word "Wikipedia" to the official article count for the en: Wikipedia appears to show a curve, rather than a linear relationship. (See Wikipedia:Awareness statistics for data). Or does it show a line broken into two parts, one before and one (horizontally shifted) after the Rambot-created articles? If so, this would suggest that the Rambot articles do not stimulate significant comment on Usenet, but that the linear relationship does in fact hold for all other articles. As ever, more data are needed.

Image:Usenet_cites_vs_article_count_dec_2003.png

[edit] Projections using new data

Wikipedia growth and predictions from July 2006 to December 2008

As of July 2006, the graph makes projections on data for the English Wikipedia. This graph is by no means accurate for future dates. It covers Wikipedia growth and predictions from July 2006 to December 2008.

The equation of the prediction corresponds to y(t) = 37936.2858*exp(0.00173376161187*t) where t is the number of days since January 1, 2001.

[edit] Modelling growth of Wikipedia page views per million

Using the Alexa page views per million data from Wikipedia:Awareness statistics (see [1] for a graph) in the period 1 January 2003 to 5 September 2005, filtering out all points less than 28 days away from the previous point (to avoid excessive weighting during time periods where points are densely sampled), and performing a linear least-squares fit of the logarithm of the data, gives the following formula:

log_e(page_views_per_million) = -49.8177569301 + 5.02511420201e-08 * unix_epoch_of_date

for n = 21 points fitted

This implies a doubling period of (log_e(2) / log_e(5.02511420201e-08)) / 86400 days = 159.64 days, and an annual growth factor in page views per million of exp(5.02511420201e-08*365.25*86400) = 4.88.

Playing around with different time periods and filter times, we get a range of results from which can reasonably say that Wikipedia's estimated page views per million doubling time is somewhere in the range 130 - 160 days, with the recent (2005) doubling time of 156 days or so being within the range of the longest-term doubling time of about 155 - 159 days, with the 2004 period being the exception to the long-term and short-term trends.

[edit] Modelling improvement in Wikipedia's Alexa traffic rank

Applying a similar linear regression fit to the log of Wikipedia's Alexa traffic rank from October 2002 to September 2005 gives a similar result, with a halving period (lower is better for rank) of roughly 134 - 138 days over the long term, with a more recent (2005 data only) halving time of 114 days! Since the current page rank As of September 2005, is roughly 40, this suggests, if taken to logical extremes, and using the most cautious of the three figures, and rounding it to 4.5 months, that Wikipedia will reach:

  • page rank 20 in 4.5 months
  • page rank 10 in 9 months
  • page rank 5 in 13.5 months
  • be fighting its way into the top 3 in 18 months, and
  • be fighting its way to the #1 spot in 22.5 months...

So, clearly this exponential growth has got to stop or slow down, or it's going to be a wild ride...

November 2005 — the daily page rank is averaging 34 and reached 31 in October.

January 2006 — the daily page rank has been averaging 20 for about a week; in line with the original predictions above.

April 2006 — averaging 16/17 this month, although in March it reached as high as rank 12, the current record.

July 2006 — deviating from predictions; Wikipedia was supposed to have reached rank 10 by now, yet for the whole of June we hovered between 16/18.

September 2006 — Heavily deviating from predictions; by the end of October, Wikipedia was supposed to reach rank 5, yet still only making small gains, hovering between 14/16 now. The climb up the rankings has slowed down - but for now we are still climbing! Wikipedia has broken the "50,000 reach" barrier, meaning we reach as many people as youtube.com and even more than myspace.com!

November 2006 — Alexa weekly rank is now 12, and is still climbing, with occasional daily blips up to 11. Wikipedia once made the daily rank in the top 10 on 12th!

February 2007 — 18 month after the predictions, I think it's safe to say the model is flawed. We should be ranked as 3rd, but the current high is 8, with the average being 10/11. We're still getting gaining popularity, just not as fast as expected.

[edit] See also

[edit] External links

In other languages