Wikipedia:Modelling Wikipedia's growth
From Wikipedia, the free encyclopedia
This page analyses the article count data in Wikipedia:size of Wikipedia and attempts to fit a simple numerical model of past and future growth to the observed article count size and growth data.
[edit] Is the growth in article count of Wikipedia exponential?
One common model of Wikipedia growth is that:
- more content leads to more traffic
- which leads to more edits
- which generate more content
Thus, the average rate of growth should be proportional to the size of the Wikipedia, that is, the growth should be exponential.
Here is a graph of article count for the English-language Wikipedia alone, based on Erik Zachte's statistics until they were available (July 06), then integrated with data collected by Andrea Allais, the creator of the graph. See Wikipedia:size of Wikipedia for more discussion.
The graph is plotted on a logarithmic scale, so exponential growth should show as linear behaviour of the data. Points after October 2002 do indeed fit very well along a line, while previous data follows a more complex behaviour, probably due to artifacts. The following graph is a close-up of the points that follow a linear trend, with the best-fitting line plot in red:
From the slope of the best-fitting line, the proper time of the exponential growth can be found, giving:
In prose the previous expression means that the number of articles has doubled once every 346 days from October 2002 to October 2006, with very good approximation. If Wikipedia keeps up with the trend, as shown on the graph, the number of articles by December 2006 will be 1,900,000, by June 2007 2,800,000 and by December 2007 4,000,000, though extrapolating exponential behaviour is a dangerous process.
During the last three or four months there has been a slight slow down of the growth. This may be just a fluctuation, like other that happened before, or the first sign of the behaviour changing from exponential growth to logistic.
The following graph shows the mean number of edits per article, and is intended as a measure of the quality of the articles, assuming that editing improves the content.
This graph is also plotted in logaritmic scale, and also this data fits well with exponential growth starting from October 2002. The number of edits per article has since doubled once every 504 days.
[edit] Speculative growth predictions, as of December 2003
Hypothesis: growth rate is a constant number of articles per day, submitted by "hard-core" wikipedians, with an extra number that is proportional to the article count of Wikipedia. Thus, it should be possible to fit a straight line to the bulk of the "main-line" points in the scatter plot. Note that there are some huge outliers that are above the range of the current plot: these can be attributed to the Rambot data-dump of machine-generated gazeteer articles, and are discounted from this analysis.
The graph below shows that, apart from outliers, the model of growth of the English-language Wikipedia as being roughly proportional to size still holds as of December 2003.
The data below are based on data from Erik Zachte's dump analysis, see http://www.wikipedia.org/wikistats/EN/TablesWikipediaEN.htm , and uses the "official article count" criterion for the article count. Because of record-keeping differences, Erik's earlier data points may not exactly correspond to previous analyses, and growth rates are aggregated monthly in Erik's data. However, the overall import of the graph is very similar: there is an even clearer linear relationship between the size and growth rate, and as a result growth can be expected to be dominated by an exponential trend in the short- and medium-term future.
Here's a by-eye fit of the data without outliers:
where y is the article count and t the time since January 10, 2001, measured in days. This is a first-order nonhomogeneous linear differential equation.
Note: no error bounds are provided: this is just a visual fit, and there is insufficient data for a better technique. However, this is likely to change over the next year, and there should be enough data by mid-2004 to resolve some of the questions posed. For now, it is interesting just to make a "straw man" prediction which can be tested in the future, rather than creating models retroactively.
Setting y = 183375 articles as of December 15 2003 and integrating the formula above gives some very crude predictions for the en: Wikipedia article count at the end of each month, as follows:
Questions:
- is this model even remotely valid? (Time will tell).
- how long can exponential growth go on, or is this just really the early part of a logistic curve?
- what does this imply for server and traffic scaling?
Eventually there will probably be a point where the amount of articles created each day will begin to slow down, due to lack of things to write on. But probably the amount of information in each article will begin to increase a lot more. More to the point, limitations on the (current) Wikipedia interface will cause a bottle neck of sorts, limiting the type (and by default, the amount) of growth to vertical monolingual growth patterns, as opposed to lateral cross-lingual interlanga ones.
November 2005 update: A glance at the updated article count graph at the top of the page will show that the graph lags some way behind the actual growth, which accelerated in early 2004 and has now resulted in a total article count of over 800000, 45% more than predicted.
Note 1: From the beginning of Dec 05 only registered users can create new pages.
[edit] Relationship of Usenet cites to article growth
The relationship of Usenet cites of the word "Wikipedia" to the official article count for the en: Wikipedia appears to show a curve, rather than a linear relationship. (See Wikipedia:Awareness statistics for data). Or does it show a line broken into two parts, one before and one (horizontally shifted) after the Rambot-created articles? If so, this would suggest that the Rambot articles do not stimulate significant comment on Usenet, but that the linear relationship does in fact hold for all other articles. As ever, more data are needed.
[edit] Projections using new data
As of July 2006, the graph makes projections on data for the English Wikipedia. This graph is by no means accurate for future dates. It covers Wikipedia growth and predictions from July 2006 to December 2008. Even these more aggressive growth predictions given here have fallen short of the most recent growth of Wikipedia.
The equation of the prediction corresponds to y(t) = 37936.2858*exp(0.00173376161187*t) where t is the number of days since January 1, 2001.
[edit] Modelling growth of Wikipedia page views per million
Using the Alexa page views per million data from Wikipedia:Awareness statistics (see [1] for a graph) in the period 1 January 2003 to 5 September 2005, filtering out all points less than 28 days away from the previous point (to avoid excessive weighting during time periods where points are densely sampled), and performing a linear least-squares fit of the logarithm of the data, gives the following formula:
- log_e(page_views_per_million) = -49.8177569301 + 5.02511420201e-08 * unix_epoch_of_date
for n = 21 points fitted
This implies a doubling period of (log_e(2) / log_e(5.02511420201e-08)) / 86400 days = 159.64 days, and an annual growth factor in page views per million of exp(5.02511420201e-08*365.25*86400) = 4.88.
Playing around with different time periods and filter times, we get a range of results from which can reasonably say that Wikipedia's estimated page views per million doubling time is somewhere in the range 130 - 160 days, with the recent (2005) doubling time of 156 days or so being within the range of the longest-term doubling time of about 155 - 159 days, with the 2004 period being the exception to the long-term and short-term trends.
[edit] Modelling improvement in Wikipedia's Alexa traffic rank
Applying a similar linear regression fit to the log of Wikipedia's Alexa traffic rank from October 2002 to September 2005 gives a similar result, with a halving period (lower is better for rank) of roughly 134 - 138 days over the long term, with a more recent (2005 data only) halving time of 114 days! Since the current page rank As of September 2005, is roughly 40, this suggests, if taken to logical extremes, and using the most cautious of the three figures, and rounding it to 4.5 months, that Wikipedia will reach:
- page rank 20 in 4.5 months
- page rank 10 in 9 months
- page rank 5 in 13.5 months
- be fighting its way into the top 3 in 18 months, and
- be fighting its way to the #1 spot in 22.5 months...
So, clearly:
- either this exponential growth has got to stop or slow down, or
- it's going to be a wild ride...
November 2005 update: Well, it's November, and Wikipedia is currently moved up only to 38th place, so it isn't quite keeping up with these predictions. However, the daily page rank is hovering around 34 and reached 31 in October, so it's doing OK...
January 2006 update (Wikipedia's 5th anniversary): The daily page rank has been hovering around 20 for about a week in line with the original predictions above.
April 2006 update: Currently on 17th, hovering around 16/17 for the whole of this month, although in March it reached as high as rank 12, the current record. Two months to go till Wikipedia reaches the 10th mark, and still yet to shed 6 places.
July 2006 update: Deviating from predictions now; Wikipedia was supposed to have reached rank 10 by now, yet for the whole of June we hovered between 16/18. It appears the climb up the rankings has slowed down - perhaps even stopped.
September 2006 update: Heavily deviating from predictions; By the end of next month (October) Wikipedia was supposed to reach rank 5, yet still only making small gains, hovering between 14/16 now. The climb up the rankings has slowed down - but for now we are still climbing! As of now Wikipedia has broken the "50'000 reach" barrier, this means that we reach as many people as youtube.com and even more than myspace.com!
November 2006 update: Alexa weekly rank has now reached 12, and is still climbing, with occasional daily blips up to 11. Wikipedia once made the daily rank in the top 10 on November 12th, 2006!