Wikipedia talk:Database download

From Wikipedia, the free encyclopedia

Please note that questions about the database download are more likely to be answered on the wikitech-l mailing list than on this talk page.

Contents

[edit] Direct access to textual content of Wiki Pages via mySQL?

I am conducting research which utilizes the content of wikipedia. I can access the content of my wiki database dump via a web browser and apache, but this does not suit the nature of my work. I would like to just access the plain (or marked up) text of pages through sql queries. How can I do this? I would imagine this requires 2 things, mapping of a page title to an id, and selecting the textual content associated with that ID. can any one please advise? —Preceding unsigned comment added by 128.238.35.108 (talk) 02:05, 7 December 2007 (UTC)


[edit] Unwieldy Images

Are the English Wikipedia images dump actually up to 75.5G from 15G in June? Could this huge tar file be broken into 10G batches or something? 75G is quite a download.

Rsync is the best solution for downloading "upload.tar" file with images (which is 75 GB in size). It allows resuming download, automatic error checking and fixing (separately in any small part of file). And if you already have old version of "upload.tar" - it will download only the difference between them. But this "updating" feature of Rsync is useful only for TAR archives (because they only collect and don't compress data), and useless for files compressed with gzip/bzip2/7-zip, because any little change in data will cause great changes in almost whole file.
For more info look here: http://en.wikipedia.org/wiki/Wikipedia:Database_download#Rsync
--Alexey Petrov 03:26, 9 April 2006 (UTC)
I thought that the images served by Wikipedia are already compressed, so their data is near random and thus gzip/bzip/7-zip/etc are useless on them? —Preceding unsigned comment added by 67.53.37.218 (talk) 03:21, 11 May 2008 (UTC)

[edit] freecache.org

Would adding http://freecache.org to the database dump files save bandwidth? Perhaps this could be experimented on for files < 1Gb & > 5Mb -Alterego

    • Also, in relation to incremental updates, it seems that the only reason freecache wouldn't work is because old files aren't accessed often so they aren't cached. If some method could be devised whereas everyone who needed incremental updates accessed the files within the same time period, perhaps via an automated client, you could better utilize the ISPs bandwidth. I could be way off here. -Alterego

[edit] Mailing list for notification of new dumps?

It would be great if there was a mailing list to notify interested users when there were new database dumps available. (I couldn't see such a mailing list currently, so I'm assuming it doesn't exist). Ideally the notification email would include HTTP links to the archive files, and would be generated after all the dumps had been completed (I've seen sometimes that the download page has an "intermediate" stage, where the old dumps are no longer listed, but the new ones have not been fully created). This mailing list would presumably be very low-volume, and would be especially useful as there doesn't seem to be an easily predictable timetable for dumps to be released (sometimes it's less than a week, often it's once every two weeks, and sometimes such as now it's up to 3 weeks between updated dumps), and because (for some applications) getting the most current Wikipedia dump possible is quite desirable. -- All the best, Nickj (t) 00:36, 28 Jan 2005 (UTC)

I agree something like this would be useful. I'll try to cook up an rss feed for this that page. --Alterego 04:14, Jan 28, 2005 (UTC)

Perhaps we can share Wiki through BitTorrent form and view it in our Italic textiPod

[edit] Help: Namespace download

It seems to me that a lot of people download the database for the "Help:" Namespace, it would seem logical to me to provide a seperate download just for this. This could be done via the perl script that s provided in this very article. It would; 1, save bandwidth, 2, save time, 3, make life easier for mediawiki users :), 4 not be very hard.

Otherwise, would anyone be able to point me in the right direction of someones own download of it?

I would like to second this comment. The idea of creating such a great tool and then forcing new administrators to have to download the entire database just to get the "Help:" pages is crazy Armistej

And one more "gotcha". You need the "Template:" namespace too. Otherwise a lot of your "Help:" pages are missing bits and pieces from them. Unfortunately, the "Template:" namespace contains templates from all parts of the MediaWiki, not just templates relating to "Help:". If I'm running an all-English MediaWiki on my Intranet, I couldn't give $0.02 about the Russian, Chinese, or other non-English templates which will never be referenced. We need some sort of script to find the "dangling" templates and zap them once and for all.

Exactly! I started a Help Desk query about this on 07/08/06, not having seen the above yet, nor having known to ask about the template. I'll repeat here that my motivation was that I do not currently have internet at home, so learning about wiki takes away from my short online opportunities. I bet others are in this boat.
I am appalled that there still appears to be no simple and straightforward way to download the "Help:" namespace and/or the "Template:" namespace. Either that, or it's too difficult to find. Come on, the comment just above is over a year old! Anyone have a simple point-and-click solution yet? --Wikitonic 18:40, 24 September 2007 (UTC)

[edit] More frequent cur table dumps?

I think that it'll be precious when you'll do more frequent dumps of cur table. Full dumps takes a lot of time and not everyone need them. As I see it was started yesterday and not finished yet. What do you think about making small dumps once a week and full once a month?

Margospl 19:45, 10 Mar 2005 (UTC)

[edit] Saving user contributions

Is there any way to save all my user contributions? That is, when I click on "my contributions" from my user page, I want to have all the links to "diff" saved. I wonder if there is a faster way to do this than click "save as" on each link. Q0 03:52, 7 May 2005 (UTC)

Look into a recursive wget with a recursion level of one. --maru (talk) contribs 03:44, 19 May 2006 (UTC)

[edit] OLD data

- I've installed WIkimedia script

- downloaded and installed all tables with exclusion of OLD_TABLES...

Wikimedia don't works! I need absolutely the old_tables or not??

Thanks

Davide

[edit] Dump frequency and size

Recent mailing list posts [1] [2] [3] indicate developers have been busy with the Mediawiki 1.5 update recently, but that smaller, more specific dumps can now more easily be created. I'm hoping will see an article-namespace-only dump. The problem with talk pages is that the storage space required for them will grow without bound, whereas we can actually try to reduce the storage space needed for articles - or at least slow growth - by merging redundant articles, solving content problems, etc. Such a dump would also make analysis tools that look only at encyclopedic content (and there are a growing number of useful reports - see Wikipedia:Offline reports and Template:Active Wiki Fixup Projects) run faster and not take up ridiculously large amounts of hard drive space (which makes it more difficult to find computers capable of producing them).

User:Radiant! has been asking me about getting more frequent updates of category-related reports for a "Categorization Recent Changes Patrol" project. This is difficult to do without more frequent database dumps, and I'm sure there are a number of other reports that could benefit from being produced more frequently, or at least containing more up to date information. (And less human editor time would be wasted as a result, we hope.)

I'm actually surprised that the database dumps must be produced manually; I'm sure there's an engineering solution that could automate the process, reducing the burden on developers. I hope the developers will be able to deal with performance and other issues to be able to do this in the near future, so we can stop nagging them about database dump updates and get on with the work of fixing up the wiki. -- Beland 8 July 2005 03:48 (UTC)

[edit] New XML database dump format

  • Thank you Beland for updating the page, and for making it clear that XML dumps are the way of the future.
  • Is there a pre-existing way that anyone knows of to load the XML file into MySQL without having to deal with MediaWiki? (What I and presumably most people want is to get the data into a database with minimum pain and as quickly as possible.)
  • Shouldn't this generate no errors?
xmllint 20050909_pages_current.xml

Currently for me it generates errors like this:

20050909_pages_current.xml:2771209: error: xmlParseCharRef: invalid xmlChar value 55296
[[got:&#xD800;&#xDF37;&#xD800;&#xDF3B;&#xD800;&#xDF30;&#xD800;&#xDF39;&#xD800;&
                                                                              ^
20050909_pages_current.xml:2771209: error: xmlParseCharRef: invalid xmlChar value 57158
&#xD800;&#xDF37;&#xD800;&#xDF3B;&#xD800;&#xDF30;&#xD800;&#xDF39;&#xD800;&#xDF46
                                                                              ^

-- All the best, Nickj (t) 08:36, 14 September 2005 (UTC)

[edit] Problem with importDump.php

I was trying to import the id languange wikipedia but the importDump.php stop at article 467 (from total of around 12000+). Anybody can help me with this problem ? Borgx(talk) 07:53, 16 September 2005 (UTC)

[edit] A small and fast XML dump file browser

Check WikiFilter.

It works with all current wiki project dump files in all languages. You do not need PHP or Mysql. All you need is a web server like Apache, and then you can view a wiki page through your web browser, either in normal html format, or in the raw wikitext format.

Rebuilding an index data-base is also reasonably fast. For example, the 3-GB English Wikipedia takes about 10 minutes on a Pentium4. Wanchun (Talk).06:35, 20 September 2005.

[edit] How to import the files?

Is there any way for importing the xml dump files successfully ? importDump.php stop uncompleted with no error. sql dump files is too old :( ,xml2sql-java from Filzstift only importing the "cur" table (I need all tables for statistical needs. Thanks Wanchun, but I still need to see the tables). Borgx(talk) 00:44, 21 September 2005 (UTC)

  • I successfully imported using mwdumper, though it took all day. — brighterorange (talk) 04:20, 21 September 2005 (UTC)
  • I ran mwdumper on the 20051002_pages_articles.xml file using the following command:
 java -jar mwdumper.jar --format=sql:1.5 20051002_pages_articles.xml>20051002_pages_articles.sql

and received the following error concerning an invalid XML format in the download file:

1,000 pages (313.283/sec), 1,000 revs (313.283/sec)
...
230,999 pages (732.876/sec), 231,000 revs (732.88/sec)
231,000 pages (732.88/sec), 231,000 revs (732.88/sec)
Exception in thread "main" java.io.IOException: org.xml.sax.SAXParseException: XML document structures must start and end within the same entity.
        at org.mediawiki.importer.XmlDumpReader.readDump(Unknown Source)
        at org.mediawiki.dumper.Dumper.main(Unknown Source)

ECHOOooo... (talk) 9:22, 8 October 2005 (UTC)

[edit] Disable keys

It might be worth commenting that importing the data from the new SQL dumps will be quicker if the keys and indicies are disabled when importing the SQL. --Salix alba (talk) 23:22, 30 January 2006 (UTC)

Arrgh. Gad its slow. I'm trying to import the new SQL format dumps into a local copy of MySQL. I've removed the key definitions which speeded things up, but its still taking an age. I'm trying to import enwiki-20060125-pagelinks.sql.gz on quite a new machine and so far its taken over a day and only got to L. Does anyone have hints of how to speed this process up. --Salix alba (talk) 09:49, 1 February 2006 (UTC)

Check out meta:Talk:Data_dumps#HOWTO_quickly_import_pagelinks.sql.--Bkkbrad 15:51, 22 February 2007 (UTC)


[edit] Trademark violation?

Can someone please explain the "trademark violation" to me? How exactly is the wikipedia's GFDL content rendered through the GPL MediaWiki software a trademark violation, suitable only for "private viewing in an intranet or desktop installation"? -- All the best, Nickj (t) 22:50, 1 February 2006 (UTC)

Take a look at sites like http://www.lookitup.co.za/n/e/t/Netherlands.html (a mirror violating our copyrights terms) that have used the static dumps. Many lack the working link back to the article and GFDL, etc. The wording itself is from http://static.wikipedia.org/. Cheers. -- WB 00:47, 2 February 2006 (UTC)
But http://static.wikipedia.org/ does not explain why it is a trademark violation, and neither do you. Can you please explain why it is trademark violation? Is because of the Wikipedia logo? If so, MediaWiki does not show the Wikipedia logo by default (certainly the article only dumps don't), hence no violation. -- All the best, Nickj (t) 06:02, 2 February 2006 (UTC)
Perhaps trademark violation is not the proper term. It is my understanding that with out the proper link back to the original article that the terms of the GFDL have not been met; because of this, with out a working hyperlink to the original location, no permission is granted to use the subject material (the static dumps, in this case), which would make most uses illegal in most of the world. Triddle 07:06, 2 February 2006 (UTC)
Yeah, I didn't write the website. It should be called "copyrights violation" instead. What I found was that most static mirror/forks lack proper licensing. (so do other mirrors/forks though) -- WB 07:16, 2 February 2006 (UTC)
I wasn't aware of the linking back to the original requirement; live and learn! I take it this would be section 4. J. of the GFDL license text, listing the requirements to distribute modified copies? But what if it's not modified? (e.g. it uses an unmodified dump of the data, which presumably is what most static mirrors will do). In that situation, why would the requirement to link back to the previous versions still apply? -- All the best, Nickj (t) 01:09, 3 February 2006 (UTC)
As far as I know, you need a live link back to the original article in Wikipedia, attribution to Wikipedia, mention of the GFDL license, and a link to some copy of GFDL as well. It derives from section 2 of GFDL:
You may copy and distribute the Document in any medium, either commercially or noncommercially, provided that this License, the copyright notices, and the license notice saying this License applies to the Document are reproduced in all copies, and that you add no other conditions whatsoever to those of this License. You may not use technical measures to obstruct or control the reading or further copying of the copies you make or distribute. However, you may accept compensation in exchange for copies. If you distribute a large enough number of copies you must also follow the conditions in section 3.
As well as Wikipedia's license:
Wikipedia content can be copied, modified, and redistributed so long as the new version grants the same freedoms to others and acknowledges the authors of the Wikipedia article used (a direct link back to the article satisfies our author credit requirement).
I hope that helped. I hope information can be added in DB download page somehow. If you have time, take a look at WP:MF. Cheers. -- WB 04:48, 3 February 2006 (UTC)
I will do! Thank you for clarifying. -- All the best, Nickj (t) 05:58, 3 February 2006 (UTC)
First of all, it's quite difficult to make a Verbatim Copy. Some would say it's almost impossible. More importantly, Wikipedia believes the link satisfies the history requirement for Verbatim Copies; this is important because creating a Verbatim Copy requires copying the history section along with the page. Superm401 - Talk 04:43, 4 February 2006 (UTC)

[edit] How do you retrieve previously deleted files?

Maybe it says somewhere, but I just don't see it.Gil the Grinch 16:09, 21 February 2006 (UTC)

  • You will not be able retrieve deleted files from the dumps. If your contribution was deleted for some reason, an admin maybe able to retrieve it; however, it is not guaranteed. Cheers! -- WB 07:35, 22 February 2006 (UTC)

[edit] Latest dumps

There were several links to the latest versions of dumps in this article, e.g.: http://download.wikimedia.org/enwiki/latest/

But if somebody will use them, it may be dangerous, because many latest dumps (especially English) are broken. And the worst, there is no information about completeness of every file at these links - just a list of files, which seems absolutely normal.
For example, link http://download.wikimedia.org/enwiki/20060402/ shows warnings about all broken files - I wonder why link http://download.wikimedia.org/enwiki/latest/ doesn't look the same way.

So I changed that links to last complete dumps (should be updated manually). E.g. last complete dump for enwiki is http://download.wikimedia.org/enwiki/20060219/ --Alexey Petrov 03:59, 9 April 2006 (UTC)

I added the latest versions link. As the dumps are now happening on a regular basis about weekly, linking to the last complete dump will need to be updated on a weekly basis. This makes a considerable maintanance task to update this page, which of often out of date, so latest is likely to be more current than the specific date in this page. As for which are broken it generally seems to be only the one with complete history which is broken. The recommended version pages-articles.xml.bz2 (This contains current versions of article content, and is the archive most mirror sites will probably want.) is considerably more up to date (2006-03-26) than 2006-02-19.
Maybe the best thing is just to point people to http://download.wikimedia.org/enwiki/ and let the user browse from there. --Salix alba (talk) 15:06, 9 April 2006 (UTC)
Yes, that seems to be the best solution. I have changed links. --Alexey Petrov 14:36, 11 April 2006 (UTC)

[edit] Opening the content of a dump

I have downloaded and decompressed a dump into an XML file. Now what? I made the mistake of trying to open that and it froze up my computer it was so huge! What program do I open it with in order to view the Wikipedia content? What next?! J@red  01:08, 31 May 2006 (UTC)

Yes I've had the same problem, never managed to successfully get a dump imported into SQL on my machine. It all depends on what you want to do, Wikipedia:Computer help desk/ParseMediaWikiDump describes a perl library which can do some operations. I've avoided SQL entirely, instead I've created my own set of perl scripts which extracted linking information and writes it out to text files, I can then use the standard unix grep/sed/sort/uniq etc tools to gather cetain statistics, for example see User:Pfafrich/Blahtex en.wikipedia fixup which ilustrates the use of grep to pull out all the mathematical equations. --Salix alba (talk) 06:50, 31 May 2006 (UTC)
Well I understand that there are several ways to parse it, but is there any feasible way of easily opening up information from a dump in a program like IE or FireFox for viewing, just like I'm viewing this page now? J@red  19:28, 1 June 2006 (UTC)

[edit] Subsets of Image Dumps

Is there a good, approved way to download a relatively small number of images (say ~4000)? The tar download is kinda slow for this purpose. It'd be faster to web crawl (even with the 1-second delay) than to deal with the tar. But that seems disfavoured.

--BC Holmes 17:07, 12 June 2006 (EST)

[edit] Simple steps for wikipedia while you travel

I travel a lot and some times like to look up stuff on my laptop, I see that I can do it but like a lot of people I am clueless about computer was hoping that someone would make a section with easy to follow steps to make this work figure that I’m not the only one who would like this . I see the size of wiki with picture 90G wow little to big hehe but how hard would it be to just get picture for say featured article and the rest text only Britannica and Encarta are both 3-4 gigs. if it is text only would be fine.Ansolin 05:59, 14 June 2006 (UTC)

That's basically what I want, too. J@red  19:39, 15 June 2006 (UTC)
If you're running GNU/Linux you can give wik2dict a try. DICT files are compressed hypertext. wik2dict needs an upgrade though. I will work on that very soon. I have also written MaemoDict, to have Wikipedia on my Nokia 770, though I don't have any space on there to put a big Wikipedia on it (so one of the things I will add to wik2dict is to just convert some categories, instead of the whole thing).
There might also be some software to support dict stuff in Windows. Guaka 20:28, 15 June 2006 (UTC)

did not follow most of that i use windows btw i though text only was only about a gig should fit but as i said i have no clue about computers :).Ansolin 04:41, 16 June 2006 (UTC)

See, the problem I have is that I've downloaded the wiki from the downloads page and I've decompressed it, but it's just a big gargantuan xml file that you can't read. How would I now, offline, open/dump that information in a wiki of my own? Do I need mediawiki? This is all so confusing and it seems that nowhere is a proper explanation! J@red  12:36, 16 June 2006 (UTC)

[edit] Downloading (and uploading) blocked IP addys

Hi, I recently created a wiki for sharing code snippets and I've started to get some spam bots adding nonsense or spam to the articles. Rather than block them one at a time, I was thinking it'd be easier to preemptively use Wikipedia's IP block list and upload it into my wiki.

So, my question is, how do I download the block list -- and -- how do I upload it to my wiki? Thanks!

[edit] How to import image dumps into a local Wiki installation?

I've imported the text dump file into my local Wiki using importDump.php. And I've downloaded the huge tar image dump,and put the extracted files (directory intact) under the upload path specified in the LocalSetting.php. But my Wiki installation doesn't seem to recognize them. What do I need to do? I think it has something to do with RebuildLinks or RebuildImages or something else. But I want to know the specific procedures. Thanks! ----Jean Xu 12:26, 2 August 2006 (UTC)

[edit] Using XML dumps directly

People seem to be having lots of issues with trying to get this working correctly without any detailed instructions, myself included. As an alternative, there is WikiFilter which is a far simpler tool to setup and get working. I am not affiliated with WikiFilter, I have just used the app and found it quite successful. WizardFusion 08:59, 20 September 2006 (UTC)

[edit] Possibility of using Bittorrent to free up bandwidth

Is it possible for someone to create an official torrent file from the wikipedia .bz2 dumps? this would encourage leeching to some degree but it would make it easier for people to download the large files for offline use (i would like wikipedia as maths/science reference for offline use when travelling). 129.78.64.106 04:54, 22 August 2006 (UTC)

[edit] Downloading Wikipedia, step-by-step instructions

Can someone replace this paragraph with step by step instructions on how to download wikipedia into a usable format for use offline?

  • What software progams do you need? (I've read that mysql is no longer being supported, I checked Apaches website but have no idea what program, of their many programs, is needed to make a usable offline wikipedia)
  • What are the web addresses of those programs?
  • What version of those programs do you need?
  • Is a dump what you need to download?
  • Can someone write a brief laymens translation of the different files at: http://download.wikimedia.org/enwiki/latest/ (what exactly each file contains, or what you need to download if you are trying to do a certain thing, ex; you want the latest version of the articles in text only) you can get a general idea from the title but it's still not very clear.
  • What program do you need to download the files on that page.
  • If you get Wikipedia downloaded and setup, how do you update it, or replace a single article with a newer version.

The instructions on m:Help:Running MediaWiki on Debian GNU/Linux tell you how to install a fresh mediawiki installation. From here, you will still need to get the xml dump, and run mwdumper.

[edit] Why is this all so complicated?

  • Not everybody who wants to have access to Wikipedia offline is a computer systems expert or a computer programmer.
  • What is with this bzip crap, why not use winzip. I've spent so long trying to figure out this bzip thing, that I could have downloaded 10 winzip files already
  • Even if i get this to work, can I get a html end product that I browse offline as if I was online?

— 195.46.228.21 12 September 2006. Agree, wikipedia a very usefull site has failed the user friendly concept of offline browsing.

  1. Granted.
  2. The best reason not to use WinZip is probably that WinZip is not free. On the other hand, bzip2 is. It is also a better compressor than WinZip 8, although WinZip 10 does compress more. [4] And because the source code of bzip2 is freely available, if someone has a computer that cannot decompress bzip2 files, it is much easier to rectify this situation than in the proprietary case.
  3. It sounds like you want a Static HTML tree dump for mirroring.
  4. Remember: just the articles are about 1.5GB compressed (so maybe around 7-8GB uncompressed?). Current versions of articles, talk pages, user pages, and such are about 2.5GB compressed (so I'd estimate about 13GB uncompressed); Images (e.g. pictures) will run you another 75.5GB. If you also want the old revisions (and who wouldn't?) that's about 45.9GB (I guess around 238GB uncompressed).

The Storm Surfer 04:08, 7 October 2006 (UTC)

Try 7-Zip for a compression utility that supports many formats. Superm401 - Talk 09:00, 3 February 2007 (UTC)

[edit] Possible Solution

I have written simple a page for getting the basics working on Ubuntu Linux (server edition). This worked for me, but there are issues with special pages and templates not showing. If anyone can help with this it would be great. It's located at http://meta.wikimedia.org/wiki/Help:Running_MediaWiki_on_Ubuntu. WizardFusion 23:31, 1 October 2006 (UTC)

[edit] old images

Hello, why are the image-dumps from last year? Would it be possible to get the small thumbnails as an extra file? It should be possible to seperate the fair-use images from the dump. de:Benutzer:Kolossos09:14, 12 October 2006 (UTC)

[edit] Making A DVD

How could I convert a Wikisource/Wikibooks/Wikipedia XML dump to HTML? I'm using Windows XP. —The preceding unsigned comment was added by Uiop (talkcontribs) 12:03, 24 December 2006 (UTC).

You would have to download the Static Wikipedia, but it's something like 60GB in all--including Talk files, User pages, User talk pages, Help and Wikipedia namespaces. I suppose you could write a script to automatically delete those files, but even after that it would be something like 30GB.
A DVD today is limited to 4.7GB. You'll have to compress it beyond recognition.Thegeorgebush
What about Blue-ray Disc (50 GB) - that would work if one could only figure out how to get the damm static html pages --Sebastian.Dietrich 21:13, 5 August 2007 (UTC)

[edit] data base size?

huge artiucle, lots of information but not a single innuedno as to what is the approximate size of the wikiepdia data base...Can anyone help me find that out?--Procrastinating@talk2me 13:13, 25 December 2006 (UTC)

The last en.wiki dump i downloaded enwiki-20061130-pages-articles.xml expands to 7.93GB Reedy Boy 14:39, 25 December 2006 (UTC)
I think the indexes need an additional 30GB or so. I'm not sure yet because for the latest dump rebuildall.php has not completed after several days. (SEWilco 00:35, 22 January 2007 (UTC))

[edit] Update

When is Wikipedia going to be dumped again? Salad Days 06:35, 30 December 2006 (UTC)

It seems to be about monthly. Usally on the lower side Reedy Boy 13:12, 30 December 2006 (UTC)
Where are the new dumps? The new place for dumps is also orphaned for more then a week. We at Wikipedia-World could need it really. Thanks. Kolossos 17:54, 23 January 2007 (UTC)
The latest dump is also over a day beyond it's Estimated finish time. Does this mean it failed? Is there anyone who can comment on the actual status? —The preceding unsigned comment was added by 216.84.45.194 (talk) 21:50, 1 February 2007 (UTC).
Looks like it did fail but a new one was started since Flamesplash 18:15, 10 February 2007 (UTC)

[edit] Info on gzip --rsyncable vs. bzip

Rsync section seems to say that rsync will work at least as well with bzip2 as with gzip --rsyncable. Not sure this is true. bzip2 does compress using blocks, but (1) the bzip2 manpage says blocks are at least 100KB long, and (2) the blocks are on even 100kb boundaries, so any changes in the length of one article will affect the rest of the archive. gzip --rsyncable seems to ensure that less than 100k of compressed output after a change is affected (I'm basing that the RSYNC_WIN constant of 4096 in the gzip patch; I figure that's just the order of magnitude). More importantly, gzip uses a rolling checksum to decide where block boundaries go, in such a way that inserting or deleting bytes doesn't affect the location of all future block boundaries. It's clever, and kind of out of my league to explain it clearly.

If CPU time on the wikimedia servers isn't an issue, my guess is that rsync -z on an uncompressed file is the lowest-bandwidth way to do incremental transfers, because changes are more localized than under other approaches and the network stream is still gzip-compressed (possibly at a lower compression level than gzip uses on files). Would require more disk space, too.

Of course, really it comes down to 1) what differences between these approaches actually turn out to be in testing, and 2) whether there's demand for less bandwidth-heavy updates given the costs (or other things like how busy the wiki sysadmins are). I'm not up to date on what discussions have already happened. —The preceding unsigned comment was added by 67.180.140.96 (talk) 00:32, 14 January 2007 (UTC).

[edit] Followup on rsync/incremental transfer

For whatever it's worth, diffing XML dumps using rsync --only-write-batch -B300 and compressing the result with bzip2 seems to produce a file that's substantially smaller than the monthly dump. (This is based on testing with ruwiki-20061108-pages-articles.xml and ruwiki-20061207-pages-articles.xml: rsync-batch.bz2 was 16.6 MB while the 20061207 dump was 79.9 MB bzipped.) Producing dumps with only new and changed articles (and tools to process them) might also be useful. Again, wikifolks may not have the time or the need for either. Bigger gains may come from using smaller/larger rsync block sizes. 67.180.140.96 03:35, 14 January 2007 (UTC)

[edit] 1-24-2007 dump frozen

The 1-24-2007 dump looks to be stalled/broken. It's several days past it's ETA, and in the past it has only taken a day or two. Can someone comment on the actual status or if I should be notifying someone else through a different mechanism so that this can be restarted, or the next dump began. Flamesplash 17:02, 5 February 2007 (UTC)

[edit] Downloading templates

Is there somewhere where I can download some of the templates that are in use here on wikipedia?

—Preceding unsigned comment added by 74.135.40.211 (talk)

You probably want the pages-articles.xml.bz2. It has Articles, templates, image descriptions, and primary meta-pages. For example Reedy Boy 14:55, 13 February 2007 (UTC)
The above link returns 404. Is there somewhere else I can download all the templates used on Wikipedia? --Magick93 11:22, 1 August 2007 (UTC)


here Wowlookitsjoe 18:40, 25 August 2007 (UTC)

Judging by the file name, this is all Wikipedia articles, and not just the templates - is that correct Wowlookitsjoe? --Magick93 15:38, 1 September 2007 (UTC)

I am looking to download Wiki templates as well. What else will I need to work with it? FrontPage? Dreamweaver? —Preceding unsigned comment added by 84.59.134.169 (talk) 19:31, August 27, 2007 (UTC)

[edit] Image dumps getting old

The image dumps are over a year old now (last modified 2005-Nov-27), is there any plan to update them? Bryan Derksen 10:23, 15 February 2007 (UTC)

What are you talking about...? Look Here 2007-02-07... Reedy Boy 10:57, 15 February 2007 (UTC)
There are only database dumps for the various SQL tables at that URL. Image file dumps are located at http://download.wikimedia.org/images/wikipedia/en/ and were last updated in November 2005, as Bryan wrote. I'd also be interested in a newer version.--134.130.4.46 23:54, 22 February 2007 (UTC)
The image metadata database does get dumped so if all else failed I suppose one could rig up a script to download the matching images directly from en.wikipedia.org. I imagine that would take longer and put more of a load on the servers than an image dump would, but if that's the only place the images are available then that's all that I can think of doing. Bryan Derksen 01:26, 1 March 2007 (UTC)

[edit] Link to HTML dumps not updated

Not sure where to put this, kept looking for a mail add to some tech staff. Anyway, according to the logfile on http://download.wikipedia.org, the December dump is in progress, and the link points to the November one. However the url http://static.wikipedia.org/downloads/December_2006 works perfectly, so I guess all that needs doing is updating the link? (http://houshuang.org/blog - pre-alpha tool to view html-dumps without unzipping) Houshuang 06:13, 21 February 2007 (UTC)

[edit] enwiki dumps failing?

There is an enwiki dump currently processing at the moment, but the previous one failed a number of jobs: [5]

In particular:

2007-04-02 14:05:27 failed Articles, templates, image descriptions, and primary meta-pages. This contains current versions of article content, and is the archive most mirror sites will probably want. pages-articles.xml.bz2

I can't seem to find any older dumps either...

67.183.26.86 17:56, 5 April 2007 (UTC)

I'm also trying to make a local mirror for use while traveling, but can't find a functioning older dump. Any word on when the next dump will be out?

Looks like with the failures, they've done another a day later here. That is 02-04-2007, whereas the other was 01-04-2007. The pages-articles.xml.bz2 is done 2007-04-07 12:59:23 done Articles, templates, image descriptions, and primary meta-pages link. Reedy Boy 15:27, 11 April 2007 (UTC)

I'm a little worried: I am a PhD student whose research is completely dependent on this data. With the last 2 dumps failing, and others being removed, there isnt a single complete dump of the english wikipedia data left available for download - and there hasn't been a new dump for almost a month. Has something happened? --130.217.240.32 04:47, 27 April 2007 (UTC)

A full, 84.6GB (compressed) dump has finally completed [6] (PLEASE, whoever runs the dumps, keep this one around at least until a newer version is complete, such that there's always at least one good, full dump). Regarding research, it's been suggested to dump random subsets in the requests for dumps on the meta site, but nothing's happened regarding it as of yet AFIACT.--81.86.106.14 22:38, 8 May 2007 (UTC)

[edit] Wikisign.org alternative

Wikisign seems to be down. Is there an alternative? MahangaTalk 02:53, 12 April 2007 (UTC)

[edit] Image dumpds are defective or inaccesible, Images torrent tracker doesn't work either

All links to image dumps in this article are bad, all bit torrents are not working. As of now there are is no way to download the images other then scraping the article pages. This will choke wiki bandwith but looks like people have no other choice. I think sysops should take a look at this, unless of cours this is intentional,.... is it ???

[edit] How do HTML dumps work?

The page says, that beginning with V1.5 there are routines to dump a wiki to html. How does this work? How can I use this on my own mediawiki? --Sebastian Dietrich 10:32, 19 May 2007 (UTC)

[edit] Image Dumps Stolen

Why are all image dumps gone??? Thousands of users provided this information with the intent that it be freely available, but now ONLY the Wikipedia site can provide this information in a drip format (HTML). This looks a lot like what happened to the CDDB album database. It was collected as free info, but now it's been stolen and is handed out piecemeal so only Wikipedia can provide the info, everyone else must beg. Of course if you actually try to get all the images via a spider you will be banned. This is quite a corruption of how most people (including myself) who contributed to Wikipedia envisioned the information being used. —Preceding unsigned comment added by 63.100.100.5 (talk) 20:19, 31 May 2007

It's probably not as helpful as you'd hope to fling around conspiracy accusations. I don't understand what you mean by saying that the image dumps are "gone" or "stolen"; there are full database dumps available at download.wikimedia.org/enwiki, just as the page says. The current dump, started on May 27, is still in progress, though the dump containing the articles, templates, image descriptions and some metadata pages is complete. What are you looking for that you can't find? grendel|khan 06:12, 1 June 2007 (UTC)
About the only things you wont be able to get your hands on, would be the private user data. Reedy Boy 07:59, 1 June 2007 (UTC)
Maybe he's referring to the claim which is also in the article, in the section Currently Wikipedia does not allow or provide facilities to download all Images. I doubt that information is true, but the article currently says image dumps are and are not available. (SEWilco 13:27, 1 June 2007 (UTC))
No, as I stated before, the images are all gone. Please look at the URL you sent: download.wikimedia.org/enwiki, there are NO images present. If you can find even one image, please state how you found it. All images are gone, all images are stolen. Maybe not a conspiracy, but certainly a very disappointing "change in operating policy", corporate speak for we are *blanking* you and you can't do anything about it.
Still no word on image dump file availability. --66.74.75.39 01:51, 25 July 2007 (UTC)
Still no word on image dumps as of December of 2007. The notice of "Check back mid-2007" has obviously been removed. What gives? Dchristle (talk) 21:43, 15 December 2007 (UTC)

[edit] Static HTML dumps page always down !

http://static.wikipedia.org/
Hello, the Static HTML dumps download page is never working. Is there a non-wikipedian mirror where we can download the complete HTML version of en.wiki (and fr:wiki as well)? Or maybe a torrent? 13:16, 5 September 2007 (UTC)

[edit] Database backups of enwiki keep getting cancelled

The enwiki database dumps seem to either get cancelled or fail. Surely considering the importance of English wikipedia this seems like a critical problem, is anyone working on fixing it? Looks like there hasn't been a proper backup of enwiki for over 2 months. This needs escalating, but there are no obvious ways of doing so. --Alun Liggins 19:51, 5 October 2007 (UTC)

I spoke to someone on the wikitech channel on IRC and they told me that the problem is on someone's ToDo list, but that it's a 'highly complicated' problem. I don't know what this means in terms of its prospects for getting done. Maybe we can find out who is responsible for moving forward with this? I wonder if there is something we can do to help in the meantime. Bestchai 04:44, 6 October 2007 (UTC)
Thanks Bestchai, Looking at the backups, there are other backups after the "every edit ever" backup that just includes the current pages, surely it would be better to skip to those, so at least we'd have the last edit rather than nothing at all in the event of a disaster. Reading Wikipedia Weekly episode 31 it appears that the dumps just mysteriously fail after a while. Maybe these aren't the only backups they do, and they do tape/disk copies of the main datafiles too? Anyone know? If this is the only backup then this should be the absolute number 1 priority for the Wikipedia foundation to get this fixed today. --Alun Liggins 13:57, 6 October 2007 (UTC)
Complete absence of any feedback from the administrators, perhaps I've just not looked in the correct place, or they are avoiding the issue? --Alun Liggins 17:40, 12 October 2007 (UTC)
I went on IRC and asked in #wikimedia-tech, here's a log of what I got: http://pastebin.ca/734865 and http://leuksman.com/log/2007/10/02/wiki-data-dumps/ 69.157.3.164 02:40, 13 October 2007 (UTC)
Database dumps now look totally broken/stalled. I've not been able to determine (anywhere) if this is the sole method of backing up the databases. --Alun Liggins (talk) 21:20, 13 December 2007 (UTC)
Now the enwiki ones have failed for pages-articles, no one seems concerned or wants to talk about it. The two messages I posted onto the wikitech-l have been deleted where I asked about the backups. I would have thought that backups take precedent over other fluffier things. --Alun Liggins (talk) 19:47, 21 December 2007 (UTC)
I've not been able to get a viable pages-meta-history.xml.7z file for at least 4 months. I think this is a reasonably serious issue. Is anyone out there maintaining an archive of successful full backups, or at least a list of the successful backups for enwiki? Is anyone actually looking at this problem? Aidepkiwi (talk) 17:52, 10 January 2008 (UTC)

[edit] Where can I download an enwiki XML dump ?

Hello, could someone who knows about this please post the link to the exact file, pages-articles.xml.bz2, for me please. The more recent the better. I did read the explanations but when I go to http://download.wikimedia.org/enwiki/ I can't download anything useful. When I get to http://download.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles.xml.bz2-rss.xml I then click http://download.wikimedia.org/enwiki/20070908/enwiki-20070908-pages-articles.xml.bz2 but I get error 404. Are there any older versions I can download ? Thanks in advance. Jackaranga 01:00, 20 October 2007 (UTC)

Ok, no error 404 anymore problem solved. Jackaranga 13:40, 23 October 2007 (UTC)

[edit] a simple table!

Hi, I'm editing the article MS. Actually, it's more of a disambiguation page or a list. There is a little debate there about how we could format the list. I couldn't help but think of my good old database programming class. There we could pick and chose how we wanted the information to be displayed. I think it would be really handy to be able to make a table and sort it in various ways and have it display on wikipedia the way that the final user would like. For example, the article could sort the list by MS, mS, Ms, M.S., etc... or By category: medical, Aviation, etc..., then by alphabetical, etc...? I can't place 2 and 2 toghether on how SQL and a regular articles (wikipedia's database) technologies could be implemented together. --CyclePat (talk) 03:28, 26 November 2007 (UTC)

[edit] Main namespace dump

Are there any dumps of only the main namespace? It would be simple to do on my own, but it would be time consuming and memory intensive, and it seems like this would be something useful for other users. My parser is getting bogged down on old archived Wikipedia namespace pages which aren't particularly useful for my purposes, so it would be nice to have only the actual articles. Thanks. Pkalmar (talk) 01:58, 21 December 2007 (UTC)

[edit] Arrrrgh, I can't read/find current dumps of the file

I can't seem to be able read the file on my computer. Any help? And were is the current dump, I accidently downloaded a old one. (TheFauxScholar (talk) 03:27, 4 April 2008 (UTC))

[edit] Inability to look after the database dumps

You would think the wikimedia foundation, with all the funding it gets, would be able to actually deliver the part of the open sources license that dictates that the "source" (i.e. dumps of the database) actually happen. Currently they constantly violate this, then shout and scream (as is the wikipedia way) at people who ask why there are no recent dumps. Hopefully someone will make a fork and run it properly, oh, but hang on, the wikimedia foundation "seem" to almost deliberately restrict access to the pictures.... so no fork of wikipedia then....! —Preceding unsigned comment added by 77.96.111.181 (talk) 18:33, 18 May 2008 (UTC)

[edit] Static HTML dump

Does anyone know why no static HTML dump is available? I also asked this question here. Bovlb (talk) 23:11, 22 May 2008 (UTC)

Big silence. I see that people have also been asking on wikitech, but there's been no comment since 2008-03-06, when it was a week away. Is there somewhere else I should be asking? Bovlb (talk) 23:21, 30 May 2008 (UTC)