Wikipedia talk:Database download

From Wikipedia, the free encyclopedia

Please note that questions about the database download are more likely to be answered on the wikitech-l mailing list than on this talk page.

1 Archives
2 Unwieldy Images
3 freecache.org
4 Mailing list for notification of new dumps?
5 Help: Namespace download
6 More frequent cur table dumps?
7 Saving user contributions
8 OLD data
9 Dump frequency and size
10 New XML database dump format
11 Problem with importDump.php
12 A small and fast XML dump file browser
13 How to import the files?
14 Disable keys
15 older dumps
16 Trademark violation?
17 How do you retrieve previously deleted files?
18 Latest dumps
19 Opening the content of a dump
20 Subsets of Image Dumps
21 Simple steps for wikipedia while you travel
22 Downloading (and uploading) blocked IP addys
23 How to import image dumps into a local Wiki installation?
24 Using XML dumps directly
25 Possibility of using Bittorrent to free up bandwidth
26 Downloading Wikipedia, step-by-step instructions
27 Why is this all so complicated?
28 Possible Solution
29 old images

[edit] Archives

archive1

[edit] Unwieldy Images

Are the English Wikipedia images dump actually up to 75.5G from 15G in June? Could this huge tar file be broken into 10G batches or something? 75G is quite a download.

Rsync is the best solution for downloading "upload.tar" file with images (which is 75 GB in size). It allows resuming download, automatic error checking and fixing (separately in any small part of file). And if you already have old version of "upload.tar" - it will download only the difference between them. But this "updating" feature of Rsync is useful only for TAR archives (because they only collect and don't compress data), and useless for files compressed with gzip/bzip2/7-zip, because any little change in data will cause great changes in almost whole file.

For more info look here: http://en.wikipedia.org/wiki/Wikipedia:Database_download#Rsync

--Alexey Petrov 03:26, 9 April 2006 (UTC)

[edit] freecache.org

Would adding http://freecache.org to the database dump files save bandwidth? Perhaps this could be experimented on for files < 1Gb & > 5Mb -Alterego

- Also, in relation to incremental updates, it seems that the only reason freecache wouldn't work is because old files aren't accessed often so they aren't cached. If some method could be devised whereas everyone who needed incremental updates accessed the files within the same time period, perhaps via an automated client, you could better utilize the ISPs bandwidth. I could be way off here. -Alterego

[edit] Mailing list for notification of new dumps?

It would be great if there was a mailing list to notify interested users when there were new database dumps available. (I couldn't see such a mailing list currently, so I'm assuming it doesn't exist). Ideally the notification email would include HTTP links to the archive files, and would be generated after all the dumps had been completed (I've seen sometimes that the download page has an "intermediate" stage, where the old dumps are no longer listed, but the new ones have not been fully created). This mailing list would presumably be very low-volume, and would be especially useful as there doesn't seem to be an easily predictable timetable for dumps to be released (sometimes it's less than a week, often it's once every two weeks, and sometimes such as now it's up to 3 weeks between updated dumps), and because (for some applications) getting the most current Wikipedia dump possible is quite desirable. -- All the best, Nickj (t) 00:36, 28 Jan 2005 (UTC)

I agree something like this would be useful. I'll try to cook up an rss feed for ~~this~~ that page. --Alterego 04:14, Jan 28, 2005 (UTC)

Perhaps we can share Wiki through BitTorrent form and view it in our Italic textiPod

[edit] Help: Namespace download

It seems to me that a lot of people download the database for the "Help:" Namespace, it would seem logical to me to provide a seperate download just for this. This could be done via the perl script that s provided in this very article. It would; 1, save bandwidth, 2, save time, 3, make life easier for mediawiki users :), 4 not be very hard.

Otherwise, would anyone be able to point me in the right direction of someones own download of it?

I would like to second this comment. The idea of creating such a great tool and then forcing new administrators to have to download the entire database just to get the "Help:" pages is crazy Armistej

And one more "gotcha". You need the "Template:" namespace too. Otherwise a lot of your "Help:" pages are missing bits and pieces from them. Unfortunately, the "Template:" namespace contains templates from all parts of the MediaWiki, not just templates relating to "Help:". If I'm running an all-English MediaWiki on my Intranet, I couldn't give $0.02 about the Russian, Chinese, or other non-English templates which will never be referenced. We need some sort of script to find the "dangling" templates and zap them once and for all.

Exactly! I started a Help Desk query about this on 07/08/06, not having seen the above yet, nor having known to ask about the template. I'll repeat here that my motivation was that I do not currently have internet at home, so learning about wiki takes away from my short online opportunities. I bet others are in this boat.

[edit] More frequent cur table dumps?

I think that it'll be precious when you'll do more frequent dumps of cur table. Full dumps takes a lot of time and not everyone need them. As I see it was started yesterday and not finished yet. What do you think about making small dumps once a week and full once a month?

Margospl 19:45, 10 Mar 2005 (UTC)

[edit] Saving user contributions

Is there any way to save all my user contributions? That is, when I click on "my contributions" from my user page, I want to have all the links to "diff" saved. I wonder if there is a faster way to do this than click "save as" on each link. Q0 03:52, 7 May 2005 (UTC)

Look into a recursive wget with a recursion level of one. --maru (talk) contribs 03:44, 19 May 2006 (UTC)

[edit] OLD data

- I've installed WIkimedia script

- downloaded and installed all tables with exclusion of OLD_TABLES...

Wikimedia don't works! I need absolutely the old_tables or not??

Thanks

Davide

[edit] Dump frequency and size

Recent mailing list posts [1] [2] [3] indicate developers have been busy with the Mediawiki 1.5 update recently, but that smaller, more specific dumps can now more easily be created. I'm hoping will see an article-namespace-only dump. The problem with talk pages is that the storage space required for them will grow without bound, whereas we can actually try to reduce the storage space needed for articles - or at least slow growth - by merging redundant articles, solving content problems, etc. Such a dump would also make analysis tools that look only at encyclopedic content (and there are a growing number of useful reports - see Wikipedia:Offline reports and Template:Active Wiki Fixup Projects) run faster and not take up ridiculously large amounts of hard drive space (which makes it more difficult to find computers capable of producing them).

User:Radiant! has been asking me about getting more frequent updates of category-related reports for a "Categorization Recent Changes Patrol" project. This is difficult to do without more frequent database dumps, and I'm sure there are a number of other reports that could benefit from being produced more frequently, or at least containing more up to date information. (And less human editor time would be wasted as a result, we hope.)

I'm actually surprised that the database dumps must be produced manually; I'm sure there's an engineering solution that could automate the process, reducing the burden on developers. I hope the developers will be able to deal with performance and other issues to be able to do this in the near future, so we can stop nagging them about database dump updates and get on with the work of fixing up the wiki. -- Beland 8 July 2005 03:48 (UTC)

[edit] New XML database dump format

Thank you Beland for updating the page, and for making it clear that XML dumps are the way of the future.
Is there a pre-existing way that anyone knows of to load the XML file into MySQL without having to deal with MediaWiki? (What I and presumably most people want is to get the data into a database with minimum pain and as quickly as possible.)
Shouldn't this generate no errors?

xmllint 20050909_pages_current.xml

Currently for me it generates errors like this:

20050909_pages_current.xml:2771209: error: xmlParseCharRef: invalid xmlChar value 55296
[[got:&#xD800;&#xDF37;&#xD800;&#xDF3B;&#xD800;&#xDF30;&#xD800;&#xDF39;&#xD800;&
                                                                              ^
20050909_pages_current.xml:2771209: error: xmlParseCharRef: invalid xmlChar value 57158
&#xD800;&#xDF37;&#xD800;&#xDF3B;&#xD800;&#xDF30;&#xD800;&#xDF39;&#xD800;&#xDF46
                                                                              ^

-- All the best, Nickj (t) 08:36, 14 September 2005 (UTC)

[edit] Problem with importDump.php

I was trying to import the id languange wikipedia but the importDump.php stop at article 467 (from total of around 12000+). Anybody can help me with this problem ? Borgx^{(talk) 07:53, 16 September 2005 (UTC)}

[edit] A small and fast XML dump file browser

Check WikiFilter.

It works with all current wiki project dump files in all languages. You do not need PHP or Mysql. All you need is a web server like Apache, and then you can view a wiki page through your web browser, either in normal html format, or in the raw wikitext format.

Rebuilding an index data-base is also reasonably fast. For example, the 3-GB English Wikipedia takes about 10 minutes on a Pentium4. Wanchun (Talk).06:35, 20 September 2005.

[edit] How to import the files?

Is there any way for importing the xml dump files successfully ? importDump.php stop uncompleted with no error. sql dump files is too old :( ,xml2sql-java from Filzstift only importing the "cur" table (I need all tables for statistical needs. Thanks Wanchun, but I still need to see the tables). Borgx^{(talk) 00:44, 21 September 2005 (UTC)}

I successfully imported using mwdumper, though it took all day. — brighterorange (talk) 04:20, 21 September 2005 (UTC)

I ran mwdumper on the 20051002_pages_articles.xml file using the following command:

 java -jar mwdumper.jar --format=sql:1.5 20051002_pages_articles.xml>20051002_pages_articles.sql

and received the following error concerning an invalid XML format in the download file:

1,000 pages (313.283/sec), 1,000 revs (313.283/sec)
...
230,999 pages (732.876/sec), 231,000 revs (732.88/sec)
231,000 pages (732.88/sec), 231,000 revs (732.88/sec)
Exception in thread "main" java.io.IOException: org.xml.sax.SAXParseException: XML document structures must start and end within the same entity.
        at org.mediawiki.importer.XmlDumpReader.readDump(Unknown Source)
        at org.mediawiki.dumper.Dumper.main(Unknown Source)

ECHOOooo... (talk) 9:22, 8 October 2005 (UTC)

[edit] Disable keys

It might be worth commenting that importing the data from the new SQL dumps will be quicker if the keys and indicies are disabled when importing the SQL. --Salix alba (talk) 23:22, 30 January 2006 (UTC)

Arrgh. Gad its slow. I'm trying to import the new SQL format dumps into a local copy of MySQL. I've removed the key definitions which speeded things up, but its still taking an age. I'm trying to import enwiki-20060125-pagelinks.sql.gz on quite a new machine and so far its taken over a day and only got to L. Does anyone have hints of how to speed this process up. --Salix alba (talk) 09:49, 1 February 2006 (UTC)

[edit] older dumps

Do we need to have the link to the older dumps way up in the page? We want to encourage people to use the latest dumps, and link at the bottom seems sufficient to me. -- WB 10:47, 1 February 2006 (UTC)

Fine by me, go to it. Page does need a major rework as much of it is directed to XML dumps whilst the new dumps are in SQL. --Salix alba (talk) 11:15, 1 February 2006 (UTC)

[edit] Trademark violation?

Can someone please explain the "trademark violation" to me? How exactly is the wikipedia's GFDL content rendered through the GPL MediaWiki software a trademark violation, suitable only for "private viewing in an intranet or desktop installation"? -- All the best, Nickj (t) 22:50, 1 February 2006 (UTC)

Take a look at sites like http://www.lookitup.co.za/n/e/t/Netherlands.html (a mirror violating our copyrights terms) that have used the static dumps. Many lack the working link back to the article and GFDL, etc. The wording itself is from http://static.wikipedia.org/. Cheers. -- WB 00:47, 2 February 2006 (UTC)

But http://static.wikipedia.org/ does not explain why it is a trademark violation, and neither do you. Can you please explain why it is trademark violation? Is because of the Wikipedia logo? If so, MediaWiki does not show the Wikipedia logo by default (certainly the article only dumps don't), hence no violation. -- All the best, Nickj (t) 06:02, 2 February 2006 (UTC)

Perhaps trademark violation is not the proper term. It is my understanding that with out the proper link back to the original article that the terms of the GFDL have not been met; because of this, with out a working hyperlink to the original location, no permission is granted to use the subject material (the static dumps, in this case), which would make most uses illegal in most of the world. Triddle 07:06, 2 February 2006 (UTC)

Yeah, I didn't write the website. It should be called "copyrights violation" instead. What I found was that most static mirror/forks lack proper licensing. (so do other mirrors/forks though) -- WB 07:16, 2 February 2006 (UTC)

I wasn't aware of the linking back to the original requirement; live and learn! I take it this would be section 4. J. of the GFDL license text, listing the requirements to distribute modified copies? But what if it's not modified? (e.g. it uses an unmodified dump of the data, which presumably is what most static mirrors will do). In that situation, why would the requirement to link back to the previous versions still apply? -- All the best, Nickj (t) 01:09, 3 February 2006 (UTC)

As far as I know, you need a live link back to the original article in Wikipedia, attribution to Wikipedia, mention of the GFDL license, and a link to some copy of GFDL as well. It derives from section 2 of GFDL:

You may copy and distribute the Document in any medium, either commercially or noncommercially, provided that this License, the copyright notices, and the license notice saying this License applies to the Document are reproduced in all copies, and that you add no other conditions whatsoever to those of this License. You may not use technical measures to obstruct or control the reading or further copying of the copies you make or distribute. However, you may accept compensation in exchange for copies. If you distribute a large enough number of copies you must also follow the conditions in section 3.

As well as Wikipedia's license:

Wikipedia content can be copied, modified, and redistributed so long as the new version grants the same freedoms to others and acknowledges the authors of the Wikipedia article used (a direct link back to the article satisfies our author credit requirement).

I hope that helped. I hope information can be added in DB download page somehow. If you have time, take a look at WP:MF. Cheers. -- WB 04:48, 3 February 2006 (UTC)

I will do! Thank you for clarifying. -- All the best, Nickj (t) 05:58, 3 February 2006 (UTC)

First of all, it's quite difficult to make a Verbatim Copy. Some would say it's almost impossible. More importantly, Wikipedia believes the link satisfies the history requirement for Verbatim Copies; this is important because creating a Verbatim Copy requires copying the history section along with the page. Superm401 - Talk 04:43, 4 February 2006 (UTC)

[edit] How do you retrieve previously deleted files?

Maybe it says somewhere, but I just don't see it.Gil the Grinch 16:09, 21 February 2006 (UTC)

You will not be able retrieve deleted files from the dumps. If your contribution was deleted for some reason, an admin maybe able to retrieve it; however, it is not guaranteed. Cheers! -- WB 07:35, 22 February 2006 (UTC)

[edit] Latest dumps

There were several links to the latest versions of dumps in this article, e.g.: http://download.wikimedia.org/enwiki/latest/

But if somebody will use them, it may be dangerous, because many latest dumps (especially English) are broken. And the worst, there is no information about completeness of every file at these links - just a list of files, which seems absolutely normal.
For example, link http://download.wikimedia.org/enwiki/20060402/ shows warnings about all broken files - I wonder why link http://download.wikimedia.org/enwiki/latest/ doesn't look the same way.

So I changed that links to last complete dumps (should be updated manually). E.g. last complete dump for enwiki is http://download.wikimedia.org/enwiki/20060219/ --Alexey Petrov 03:59, 9 April 2006 (UTC)

I added the latest versions link. As the dumps are now happening on a regular basis about weekly, linking to the last complete dump will need to be updated on a weekly basis. This makes a considerable maintanance task to update this page, which of often out of date, so latest is likely to be more current than the specific date in this page. As for which are broken it generally seems to be only the one with complete history which is broken. The recommended version pages-articles.xml.bz2 (This contains current versions of article content, and is the archive most mirror sites will probably want.) is considerably more up to date (2006-03-26) than 2006-02-19.

Maybe the best thing is just to point people to http://download.wikimedia.org/enwiki/ and let the user browse from there. --Salix alba (talk) 15:06, 9 April 2006 (UTC)

Yes, that seems to be the best solution. I have changed links. --Alexey Petrov 14:36, 11 April 2006 (UTC)

[edit] Opening the content of a dump

I have downloaded and decompressed a dump into an XML file. Now what? I made the mistake of trying to open that and it froze up my computer it was so huge! What program do I open it with in order to view the Wikipedia content? What next?! → J @red 01:08, 31 May 2006 (UTC)

Yes I've had the same problem, never managed to successfully get a dump imported into SQL on my machine. It all depends on what you want to do, Wikipedia:Computer help desk/ParseMediaWikiDump describes a perl library which can do some operations. I've avoided SQL entirely, instead I've created my own set of perl scripts which extracted linking information and writes it out to text files, I can then use the standard unix grep/sed/sort/uniq etc tools to gather cetain statistics, for example see User:Pfafrich/Blahtex en.wikipedia fixup which ilustrates the use of grep to pull out all the mathematical equations. --Salix alba (talk) 06:50, 31 May 2006 (UTC)

Well I understand that there are several ways to parse it, but is there any feasible way of easily opening up information from a dump in a program like IE or FireFox for viewing, just like I'm viewing this page now? → J @red 19:28, 1 June 2006 (UTC)

[edit] Subsets of Image Dumps

Is there a good, approved way to download a relatively small number of images (say ~4000)? The tar download is kinda slow for this purpose. It'd be faster to web crawl (even with the 1-second delay) than to deal with the tar. But that seems disfavoured.

--BC Holmes 17:07, 12 June 2006 (EST)

[edit] Simple steps for wikipedia while you travel

I travel a lot and some times like to look up stuff on my laptop, I see that I can do it but like a lot of people I am clueless about computer was hoping that someone would make a section with easy to follow steps to make this work figure that I’m not the only one who would like this . I see the size of wiki with picture 90G wow little to big hehe but how hard would it be to just get picture for say featured article and the rest text only Britannica and Encarta are both 3-4 gigs. if it is text only would be fine.Ansolin 05:59, 14 June 2006 (UTC)

That's basically what I want, too. → J @red 19:39, 15 June 2006 (UTC)

If you're running GNU/Linux you can give wik2dict a try. DICT files are compressed hypertext. wik2dict needs an upgrade though. I will work on that very soon. I have also written MaemoDict, to have Wikipedia on my Nokia 770, though I don't have any space on there to put a big Wikipedia on it (so one of the things I will add to wik2dict is to just convert some categories, instead of the whole thing).
There might also be some software to support dict stuff in Windows. Guaka 20:28, 15 June 2006 (UTC)

did not follow most of that i use windows btw i though text only was only about a gig should fit but as i said i have no clue about computers :).Ansolin 04:41, 16 June 2006 (UTC)

See, the problem I have is that I've downloaded the wiki from the downloads page and I've decompressed it, but it's just a big gargantuan xml file that you can't read. How would I now, offline, open/dump that information in a wiki of my own? Do I need mediawiki? This is all so confusing and it seems that nowhere is a proper explanation! → J @red 12:36, 16 June 2006 (UTC)

[edit] Downloading (and uploading) blocked IP addys

Hi, I recently created a wiki for sharing code snippets and I've started to get some spam bots adding nonsense or spam to the articles. Rather than block them one at a time, I was thinking it'd be easier to preemptively use Wikipedia's IP block list and upload it into my wiki.

So, my question is, how do I download the block list -- and -- how do I upload it to my wiki? Thanks!

[edit] How to import image dumps into a local Wiki installation?

I've imported the text dump file into my local Wiki using importDump.php. And I've downloaded the huge tar image dump,and put the extracted files (directory intact) under the upload path specified in the LocalSetting.php. But my Wiki installation doesn't seem to recognize them. What do I need to do? I think it has something to do with RebuildLinks or RebuildImages or something else. But I want to know the specific procedures. Thanks! ----Jean Xu 12:26, 2 August 2006 (UTC)

[edit] Using XML dumps directly

People seem to be having lots of issues with trying to get this working correctly without any detailed instructions, myself included. As an alternative, there is WikiFilter which is a far simpler tool to setup and get working. I am not affiliated with WikiFilter, I have just used the app and found it quite successful. WizardFusion 08:59, 20 September 2006 (UTC)

[edit] Possibility of using Bittorrent to free up bandwidth

Is it possible for someone to create an official torrent file from the wikipedia .bz2 dumps? this would encourage leeching to some degree but it would make it easier for people to download the large files for offline use (i would like wikipedia as maths/science reference for offline use when travelling). 129.78.64.106 04:54, 22 August 2006 (UTC)

[edit] Downloading Wikipedia, step-by-step instructions

Can someone replace this paragraph with step by step instructions on how to download wikipedia into a usable format for use offline?

What software progams do you need? (I've read that mysql is no longer being supported, I checked Apaches website but have no idea what program, of their many programs, is needed to make a usable offline wikipedia)
What are the web addresses of those programs?
What version of those programs do you need?
Is a dump what you need to download?
Can someone write a brief laymens translation of the different files at: http://download.wikimedia.org/enwiki/latest/ (what exactly each file contains, or what you need to download if you are trying to do a certain thing, ex; you want the latest version of the articles in text only) you can get a general idea from the title but it's still not very clear.
What program do you need to download the files on that page.
If you get Wikipedia downloaded and setup, how do you update it, or replace a single article with a newer version.

The instructions on m:Help:Running MediaWiki on Debian GNU/Linux tell you how to install a fresh mediawiki installation. From here, you will still need to get the xml dump, and run mwdumper.

[edit] Why is this all so complicated?

Not everybody who wants to have access to Wikipedia offline is a computer systems expert or a computer programmer.
What is with this bzip crap, why not use winzip. I've spent so long trying to figure out this bzip thing, that I could have downloaded 10 winzip files already
Even if i get this to work, can I get a html end product that I browse offline as if I was online?

— 195.46.228.21 12 September 2006.

Granted.
The best reason not to use WinZip is probably that WinZip is not free. On the other hand, bzip2 is. It is also a better compressor than WinZip 8, although WinZip 10 does compress more. [4] And because the source code of bzip2 is freely available, if someone has a computer that cannot decompress bzip2 files, it is much easier to rectify this situation than in the proprietary case.
It sounds like you want a Static HTML tree dump for mirroring.
Remember: just the articles are about 1.5GB compressed (so maybe around 7-8GB uncompressed?). Current versions of articles, talk pages, user pages, and such are about 2.5GB compressed (so I'd estimate about 13GB uncompressed); Images (e.g. pictures) will run you another 75.5GB. If you also want the old revisions (and who wouldn't?) that's about 45.9GB (I guess around 238GB uncompressed).

— The Storm Surfer 04:08, 7 October 2006 (UTC)

[edit] Possible Solution

I have written simple a page for getting the basics working on Ubuntu Linux (server edition). This worked for me, but there are issues with special pages and templates not showing. If anyone can help with this it would be great. It's located at http://meta.wikimedia.org/wiki/Help:Running_MediaWiki_on_Ubuntu. WizardFusion 23:31, 1 October 2006 (UTC)

[edit] old images

Hello, why are the image-dumps from last year? Would it be possible to get the small thumbnails as an extra file? It should be possible to seperate the fair-use images from the dump. de:Benutzer:Kolossos09:14, 12 October 2006 (UTC)

Retrieved from "http://en.wikipedia.org../../../d/a/t/Wikipedia_talk%7EDatabase_download_9769.html"