Wikipedia talk:Size of Wikipedia

From Wikipedia, the free encyclopedia

Contents

[edit] Older comments

It's a nice graph, but there's an awful lot of white space on it. Can somebody who knows how to do these things trim it? --Camembert
Additionally, the JPEG compression makes it a bit muddy. Would it be possible to resave the original as a PNG file? --Brion


New and improved graphs now in place. The Anome 09:35 Sep 21, 2002 (UTC)

time to update graphs? --Lightning 19:42 Oct 19, 2002 (UTC)

anyone notice something funky about the following:

2002 Oct 20, 66372, mpacIII
2002 Oct 19, 61128, mpacIII
2002 Oct 17, 54339, mpacIII

Lightning 05:14 Oct 21, 2002 (UTC)

Not really - User:Ram-Man's bot was pumping in about 10-20 US cities a minute for much of the day. --mav

Yep, he's ramming them in there. I wonder if that's why how he selected his user name? ;-) --Ed Poor
Alas, it comes from my real name. -- Ram-Man

I made a new graph. uhm.. I'll try to keep it updated.. i'm sorry if it doesnt look great, but im just pumping it out with a spreadsheet program. --Lightning 05:38 Oct 23, 2002 (UTC)

Looks good to me. --mav
Are you going to change the graph below (rate of increase) as well? -- WillSmith (Malaysia)
I want to wait till Ram Man is done, because the bot massively inflates this number, once the bot is done running, ill take a week's worth of samples and do it. --Lightning 19:49 Oct 24, 2002 (UTC)
Fire away! I've for the most part finished it up (at least the large scale automation anyway!) -- Ram-Man

how about a graph showing the amount of data hosted by wikipedia and the average size per page? Lir 05:56 Oct 23, 2002 (UTC)

No access to the db, so i can't make sql queries to get these numbers.. --Lightning 19:49 Oct 24, 2002 (UTC) It would be interesting though to get the mean number of content bytes per article and wait like a year and take it again for comparison purpouses. --Lightning 19:49 Oct 24, 2002 (UTC)

The new graph has an x-axis which is not evenly spreaded in time. The slope is now dependant on the number of samples in any given period. This is a bit confusing in my opinion. Erik Zachte


The best thing to do is to use an x/y scatter-plot setting for the graph tool. This will allow for the non-uniform sampling, which will otherwise distort the graph.


Illustration:

Image:Article_growth_chart.png

The graph above does not allow for the non-uniform sampling in time: compare with:

Image:Wikipedia article count graph to Oct 04 2002.png

i'll look into it --Lightning 19:49 Oct 24, 2002 (UTC)

It is possible to give the growth of Wikipedia without including the Ram-Man bot additions? The bot is adding around 1,000 articles a day (and seems to have around 30,000 in total to add) and it would be interesting to see the rate of growth without this distortion.


Note: the article count feature is currently disabled, with the article counter stuck at 90679. -- 15 November 2002


Note: The article counter is incrementing again. -- 18 November 2002


Is the article counter fixed now? If not, there is very little point in continuing to update this data by hand. If much of the past mpacIII data is questionable, perhaps someone would be so kind as to regenerate the mpacIII data from the database dumps? The Anome

The count is still calculated stupidly (comma count?!!) but it's now fixed, yes. I see absolutely zero purpose in regenerating older counts, since A) the number is pure hype with limited value, B) we only have a limited number of dumps kept on hand at ~1 month intervals (keeping the old ones around at a higher rate more would waste A LOT of disk space), and C) the margin of error from the drift is probably smaller than the margin of error of our crappy count system (comma count?!!) except for that one >100000 entry. --Brion 20:33 Dec 17, 2002 (UTC)
Speaking of which, what happened to the idea to redefine the count? I still think we shouldn't count anything below 500 bytes as an article. That along with the dreaded comma count, IMO, would give a more accurate measure of our true progress (~80,000 articles). My only concern for this plan though is what it might do to the moral of the non-English Wikis. Maybe we could have a Template:HEADLINEARTICLECOUNT that would display the more conservative article count (we could even up the ante by excluding anything below 1 kilobyte). --mav

If we are going to make graphs, and then analyze them, shouldn't we take RamBot's contributions into account? --Uncle Ed


The new article count system is now active on the English Wikipedia. (And the counter is no longer stuck. ;) If desired, I can go back through my backup dumps and run counts of the new algorithm on older databases for comparison purposes. --Brion 06:02 25 May 2003 (UTC)

Is it also up for the Dutch wiki? I find differences between the count on our main page (6901) and the count obtained by

SELECT count(*) FROM cur WHERE cur_namespace=0 AND cur_is_redirect=0 AND cur_text LIKE '%[[%'

(8427) TeunSpaans 12:41 18 Jun 2003 (UTC)


I have replaced Fonzy's analysis of growth with a new treatment, which produces a new growth model that tries to eliminate the effects of outliers, data dumps, recalibration, and slow-downs. It's a remarkably good (conicidental?) fit for the past, but who knows about the future? -- The Anome 16:58 11 Jun 2003 (UTC)


An HTML idiot writes: is there any way either that this page can be made a sensible width, or that I can view it (IE6) as a screen-width page? jimfbleak 17:30 11 Jun 2003 (UTC)

[edit] Update desperately needed!

This page hasn't been edited since April except to correct a spelling error, and the page linked to (here) hasn't been updated since May!!!!

I have made a few graphs, and scripts to update them. I am not sure how accurate they are (I didn't do the database query myself), but it looks reasonable. The details are on my user page. Perhaps this can be used here? Amaurea 14:48, 23 April 2006 (UTC)

[edit] Other kinds of growth

There are some other kinds of growth like this here: Image:Vandalism.png. I Have made this diagram to compare the situation with the german wikipedia (Image:Vandalismus.png). --Markus Schweiss 06:41, 9 December 2006 (UTC)

[edit] size in GB

Could someone get and add information about how much space the text actually takes up? Or perhaps an estimate of how many printed pages all the text would take? There isn't anything here that really gives me a good idea of how BIG wikipedia is when compared to other information compendiums, which is all I wanted when I came to this page.24.128.152.12 08:06, 19 December 2006 (UTC)greg

Ditto. Not much seems to be going on here, in terms of updates. ALTON .ıl 07:34, 11 May 2007 (UTC)

Nevermind, I think I found it. According to the dump download page, the entire html of Wikipedia weighs in at 8042 MB, or about 7.9 GB. Suprising? ALTON .ıl 07:38, 11 May 2007 (UTC)

Not including sizes of pictures in gigabytes doesnt make any sense to me. The images of the encyclopedia are just as important as the text information. Also to compare this to a "book" you would have to look at how much space the average wiki page has in images and include that as well. —Preceding unsigned comment added by 68.154.41.177 (talk) 05:04, August 29, 2007 (UTC)

[edit] ISO 8601 dates

Should use ISO 8691 dates (eg. 2007-05-30) for all Wikipedia stuff, including graphs and charts. In this age of international commerce and communication, it seems foolish to use ambiguous dates, especially since Wikipedia English is edited and read by a large minority of English speakers outside the US. Anthony717 19:10, 30 May 2007 (UTC)

I agree that the date format is lacking. If this was in the article space I would have just fixed it by wiki-linking the dates and letting the servers format it on the fly; and maybe that should be done here. ISO-8601 would be better than what is here now, but ISO-8691 is for the benefit of computers, is it not? I think most people would find "30 May 2007" more humanistic. --Charles Gaudette 09:18, 31 May 2007 (UTC)
I'll change the date format in the "Wikipedia growth" plots during the next update. The plot in "Comparisons with other Wikipedias" was grabbed from Commons, so I'm not sure where the source data for it is. Maybe I can get the data from http://stats.wikimedia.org/EN/Sitemap.htm and generate a new plot (later, when I have more free time). --Seattle Skier (talk) 09:01, 2 June 2007 (UTC)

DONE. The "Wikipedia growth" plots have been updated. I also found a couple of hours to combine the data from http://stats.wikimedia.org/EN/TablesArticlesTotal.htm with the data from this page to generate two new plots in "Comparisons with other Wikipedias". --Seattle Skier (talk) 01:26, 4 June 2007 (UTC)

[edit] Scanty information

This page doesn't answer some of the obvious questions (as noted above) like how many gigabytes is it. Another question that comes to mind, how many gigabyes are the images used in articles (hosted here or in the commons), since they are definitely part of wikipedia as well. Also, how many servers are there currently, how many watts of electricity do they use, how much total RAM - all these are interesting questions. -- fourdee ᛇᚹᛟ 11:09, 7 August 2007 (UTC)

[edit] Relaible source

Wikipedia: proving the Web's freedom of space and How much paper would it take to print out Wikipedia? cite Nikola Smolenski, a contributor to this article, as a reliable source for the amount paper it would take to print out Wikipedia. -- Jreferee (Talk) 15:05, 29 August 2007 (UTC)

[edit] The question about the data of statistics

I am a graduate school student of MBA from Taiwan. I and my advisor, Professor Chu, are interested in the diffusion phenomenon of the famous wikipedia website very much. I and my advisor and have some questions about the diffusion data from this URL below,

http://en.wikipedia.org/wiki/Wikipedia:Size_of_Wikipedia


we hope we could apply the formal diffusion model from management science to figure out the successful story of Wikipedia.

At the bottom of this website, there is a data set, describing the shape of Wikipedia growth in the domain of English. It make me have two questions from this data set. First of all, faced with this data set, I can hardly distinguish the numbers of size is from auto-posting robot, the Rambot, or from the real people. Could you help me to obtain the data which have already disassembled those two different processes of data (edited by program and editing by human being)?

Second, what makes me so confused is that the formation of dates is irregular. I was wondering why the pattern of the data set appears in that way. Is there anything happening inside those irregular data? Could you provide me further story or idea which may help me to figure it out?

Thank you for your response in advance. I hope I can get acquainted with the statistics of Wikipedia which can help us to explore the nature about the diffusion condition of Wikipedia.

Once again, thank you very much.

Best wishes, —Preceding unsigned comment added by Jackiewi (talk • contribs) 12:57, 10 December 2007 (UTC)

    • Hello!

People just posts the update of the count of articles when they feel like, it is not a robot who is making that. Keep us updated with the models you are going to use... and use also google. I have seen a couple of good articles studying how wikipedia grows Diego Torquemada (talk) 23:47, 10 December 2007 (UTC)

Hey, I saw your question and tried to come up with a better answer. I think the only way to distinguish human and automated editing is checking all editors entries in the bot category. You can find comments on unusual growth in some of the Category:Wikipedia statistics articles. Some are also slashdot or similar effects. Good luck. --Ben T/C 14:57, 18 December 2007 (UTC)
See also User:Dragons flight/Log analysis. You wil find there are some graphs of edits of bots/registered users/unregistered users. HenkvD (talk) 18:27, 29 December 2007 (UTC)