Wikipedia talk:Database download/archive1

From Wikipedia, the free encyclopedia

Archive This is an archive of past discussions. Do not edit the contents of this page. If you wish to start a new discussion or revive an old one, please do so on the current talk page.

Contents

Wiki2static ... or?

What is currently a best way to create static version of wikipedia. Does wiki2static have some problems? I would like to put national wikipedia on a CD. Thanks. -Juraj

Wiki2static down - Mirror?

It seems the original page of Wiki2static (http://www.tommasoconforti.com/wiki/) is down. All i get is a ad by some ISP - nothing more. It would be nice if either User:Alfio would put the page back up, or anyone who still has that file would put up a mirror. If its just a problem with the hoster i can provide webspace with a good connection.

- Dario


Wikipedia as XML? / Download one article at time

I want to write a soft that displays information from the wikipedia. I don't want to make a huge database download, but just want the him to have acess to an updated tiny piece of the wikipedia article (in wiki source), more like a browser.How can I? --Alexandre Van de Sande 01:09, 2 Aug 2004 (UTC)

Wouldn't it be nice to have a download of Wikipedia based on an XML language? ~cpb 2004-04-26

Maybe Special:Export would help? Angela. 16:51, Aug 12, 2004 (UTC)

To access any article in xml, one at a time, link:

-- Blinklmc

Experimental Mirror of Wikipedia

I've been attempting to setup a copy of wikipedia on one of my servers for experimenting and testing. I want to use the real data to experiment with the wikipedia code and be able to more closely examine the data structure. I have in mind potentially altering the code to use in another project that is something of a "People Data Store" Example: "Quotes" are made by people, people have biography that relates them to other people, places, and events in time. This could also apply to many other works of people such as "Lyrics", "Books", "Articles", "Film", "Programming code". Lots of possibilities. In many ways it it much like an encyclopedia, just more (for lack of a better way of expressing it) factual and concrete. 8-)

My problem, For a couple of days now I've been attempting to download the datadump of the encyclopedia and history from the download page. Unfortunately all I get instead of a gzip, tar, or zip file is the text data dump to my browser. Is there some way that I can get the current data and history files some other way? I don't care about the size, but a file is much more useful than a text data list of many mb. Also is there some method of data replication that is used to keep other copies current? Any help with this would be much appreciated, Thanks (albrown AT chook DOT com or al AT thetinfoilhat DOT com)

Your browser appears to be helpfully un-gzipping the data for you. If this is a problem (ie, you don't want to take up that much hard disk space just for the dump), try a less intelligent program. ;) "wget" is a nice command-line web/ftp file fetcher; I think there's a version compiled for windows. (Google it.) Keep in mind that the SQL dump will be equally effective zipped or unzipped; you have to read it back into the database or write your own program to suck the data out of the SQL commands. --Brion 19:01 Oct 2, 2002 (UTC)

There are times when I actively hate the latest IE. Is ftp an option then? I tried to connect to ftp.wikipedia.com and didn't get very far as "anon".

Get Mozilla!
Try this instead: make a link to the file you want to download by putting it in brackets, e.g. The Internet Movie Database ([http://www.imdb.com The Internet Movie Database]), then right click on the link and "save target as."
IE for me has been particularly contrary and addlebrained; I can only assume that that's what you're using too. Best, --KQ 22:20 Oct 2, 2002 (UTC)

I was able to get the files by using wget. Thanks for all the help. I wil remember the trick about making a link in brackets. This is odd though as this is the first and only time that I have ever had IE (and netscape 4.7 even tried to download with that brain dead clunky ......) both were unzipping the file into the page, normally I can click on any download link and then get a message asking me if I want to save or not. Oh well, got the files and thanks much again for the help. Al Brown 23:21 CST Oct 2, 2002.

Daily tarballs of older Non-English Wikipedias

These have not yet been upgraded and are running on UseMod-wiki. The software and data are included together in a single tarball.

<-- this is broken, I think The German, Polish, and Esperanto wiki trees are also --> <-- available for live update via rsync. -->

I removed the above section as it is no longer true and the links are dead. Angela. 14:49, Feb 21, 2004 (UTC)

wiki-table to html-table problems in script wiki2static

Hi, is there an update of wiki2static script which converts the sql-dump to a html file structure? This script has problems with conversion of the wiki table definitions to html. For this reason, articles containing tables aren't readable.

Hi, I haven't worked on the script for a while, so I didn't put the new table syntax into wiki2static. I'll see if I can do it in the following days. Alfio 14:58, 16 Mar 2004 (UTC)

Compressed text?

I have the Spanish version in a SQL database and the old_text entries are a bunch of symbolic nonsense. For instance, the entry for Andorra is "UAN1E÷•¸ƒ×¨*g@€`›Š]Õ…;q[£ÄŽìÑ#³D\O;@Ê"þqÞÿÿ¾J#(}8*4¦RªîŒàÈž1á2BQiø¥Žp+IÍBun=ž²:T6R_ÂaPq:tN ]Dî2A5õJÆ $òD#z\”à:7¢HbÌ€Ðß?=3´nìEWðŒù!a.ZGCŒåÑ /U¸>œËE¢XðÔÃçâºÇÜÎw.µ“7õÕÕbçz³¹yâŒq»„GÒ€íÿ)þtÇd¢³ð0e›…;b<àÀ*ä³ü½ªýa½pæ†6ÏkÊÓ/ ¨)oQþBø"

I don't know if this is related, but when I loaded up the .sql file I got the following error: "ERROR 1064 at line 269: You have an error in your SQL syntax..."

It's compressed text, I would guess. See the section "Format Change" at http://download.wikimedia.org/ Mr. Jones 12:55, 9 Aug 2004 (UTC)

Problems with Python

Hmmm. I'm having some trouble with this using python (specifically, ipython).

import MySQLdb as m
conn=m.connect("localhost", "user", "password")
conn.select_db("wikipedia")
conn.query("SELECT * FROM old WHERE old_user=333")
r=conn.store_result()
c=[0]
while (c[len(c) - 1] != ()):
    c.append(r.fetch_row())

z=c[len(c) - 2][0][3]

import gzip
gzip.zlib.decompress(z)
---------------------------------------------------------------------------
error                                     Traceback (most recent call last)

/home/mrjones/<console>

error: Error -3 while decompressing data: incorrect header check

f=file("temp.gz", "w")
f.write(z)
f.close()
del f, z, c, r, conn, m
^D
Do you really want to exit ([y]/n)? y
host:~$gunzip temp.gz

gunzip: temp.gz: not in gzip format
host:~$

Mr. Jones 14:24, 16 Dec 2004 (UTC)

Ah, this seems relevant: Old entries marked with old_flags="gzip" have their old_text compressed with zlib's deflate algorithm, with no header bytes. PHP's gzinflate() will accept this text plainly; in Perl etc set the window size to -MAX_WSIZE to disable the header bytes.

However, it's still not working:

gzip.zlib.decompress(z, gzip.zlib.MAX_WBITS)
---------------------------------------------------------------------------
error                                     Traceback (most recent call last)

/home/mrjones/<console>

error: Error -3 while decompressing data: incorrect header check

Mr. Jones 14:30, 16 Dec 2004 (UTC)

meta:Compression Looks relevant. Mr. Jones 14:44, 16 Dec 2004 (UTC) meta:Old_table Gave me the clue I needed. -MAX_WSIZE is not an option, it means (0 - MAX_WSIZE) . Now working :-D Will come back and clarify docs a bit later. Mr. Jones 14:49, 16 Dec 2004 (UTC)


Lost Connection Error using MySQL

I tried importing the en.sql db into MySQL 4.1.2-alpha and got this error:

$ mysql wiki -u root -p < en.sql Enter password: ******* ERROR 2013 at line 13147: Lost connection to MySQL server during query

Ran it again, got the same error after about 20 GB had been processed.

--

If I remember correctly, these errors are because the data that you are importing is sent over the connection in packets. Typically, in the server, you configure a limit to the size of these packets to avoid denial of service conditions. However, by doing that, you are also limiting the amound of data that can be put in one field, because each field has to be sent in one packet. It could be that one wikipage contains so much data that the server thinks you are trying to cause a denial of service, therefore it kicks you off.

The same thing happens when you are running mozilla's bugzilla; the packet size limits the size of attachments that users can insert into the database.

Tinus 22:19, 10 Aug 2004 (UTC)

meta?

Is this information on Meta somewhere? It looks as though this page was once at m:Meta:Database download (back when the Meta: namespace there was called Wikipedia: also)... but it's gone now. +sj+ 05:38, 16 Jul 2004 (UTC)

I don't think it was on Meta. Links on Meta to Wikipedia:Database download will lead to this page via a redirect from Database download. Angela. 07:20, 16 Jul 2004 (UTC)

Sample blocked crawler email

for some odd reason the arrow on the link http://en.wikipedia.org/wiki/Wikipedia:Database_download in the page is not showing properly. i think it's something to do with the <i> and the wiki stylesheet, or maybe some extra bug in internet explorer... Vbs 09:03, 23 Jul 2004 (UTC)

Joining SQL Download Files

What do I use to join the split SQL dump files?

See the Joining SQL Dump Files thread on wikitech-l. Angela. 20:23, Aug 2, 2004 (UTC)

Titles only download

Is there a possibility to download all article titles, as single compressed file? Pjacobi 17:53, 11 Aug 2004 (UTC)

The article titles of the English Wikipedia are available. Download allentitlesinns0.gz from download.wikimedia.org/archives/en/. It's possible other languages will be added in future. Angela. 00:28, Oct 5, 2004 (UTC)
This is now moved to all_titles_in_ns0.gz which is linked to from download.wikimedia.org/wikipedia/en/. Angela. 04:14, May 2, 2005 (UTC)

How big is the uncompressed wikipedia?

How big is the uncompressed 20040811_old_table.sql.bz2 from en? Less than 30gb, I would guess, given that it was reported to be 18gb fairly recently, and starts out at about 9gb. What is the compression ratio? It seems that the suggested bzip2.exe for windows (used with XP pro) does not work (it's much slower than unxutils' one, which does the same thing, i.e. produces a file of over 30gb without stopping, it just does it faster). I'll have to see if I can do it under Debian. Mr. Jones 21:56, 13 Aug 2004 (UTC)

OK, so the 18Gb is the size of all database files when compressed. I'll clarify the text. Still, the question remains. Mr. Jones 04:33, 14 Aug 2004 (UTC)

The answer is, for the record, that the size of the decompressed old table for the en database of 8-8-2004 is about 40Gb. Mr. Jones 14:13, 14 Aug 2004 (UTC)

Database Dump Compression Format

Is there anywhere I can download the dump files that have been compressed with something other than bz2!? maybe gzip or zip?

Why would you want to do that? (Please sign your posts with ~~~~) Mr. Jones 20:48, 16 Dec 2004 (UTC)

Using Special Export

Is there anyway to use http://en.wikipedia.org/wiki/Special:Export/ to return a nearest match or wikipedia's search results page?

Current size of database?

How much diskspace is currently needed for the en.wikipedia.org DB when it's imported? Currently the SQL file is 52 GB, but when importing the Inno DB database ends up exceeding 70GB. Are there also any suggested MySQL settings for handling a DB this large?

  • The .sql file you download contains only the information contained in wikipedia. When you import this information into a MySQL InnoDB database a number of bits of extra information are calculated, normally to help the database keep track of the data and access it quickly. This extra data (indexes, page-alignments, padding) accounts for the difference you see.
  • MySQL can handle databases of wikipedias size (which are, in database terms, quite modest) with the default settings. If running complex or repetative queries, you may want to adjust the innodb_buffer_pool_size variable in your my.ini file to abour 2/3rds of the physical memory on your PC - for example innodb_buffer_pool_size=640M on a PC with 1GB of RAM.
  • - TB 15:48, 2005 Feb 8 (UTC)

Wikinews database?

If someone gets to this, please include the different language wikinews databases in the DB dumps. -- Ilya 07:51, 23 Dec 2004 (UTC)

bunzip2 in WinXP

Using bunzip for WindowsXP I am unable to unzip the current DB for en. I have had the problem for around three months and was wondering if anyone else has the same issue, or knows a solution. I've tried other programs which handle bunzip files such as WinRAR and I get the same error. I also get an incorrect MD5 sum. The correct one should be "7a70559f2089155f441c322f6c565cc5" and mine is "1d423915d294592237f4450ded3b386b"  :


C:\Documents and Settings\*\Desktop>bunzip2 2004*

bunzip2: Caught a SIGSEGV or SIGBUS whilst decompressing, which probably indicates that the compressed data is corrupted. Input file = 20041023_cur_table.sql.bz2, output file = 20041023_cur_table.sql

It is possible that the compressed file(s) have become corrupted. You can use the -tvv option to test integrity of such files.

You can use the `bzip2recover' program to *attempt* to recover data from undamaged sections of corrupted files.

bunzip2: Deleting output file 20041023_cur_table.sql, if it exists.


Thanks. - Alterego @ 1:29AM on 10-29-04

Which version of bunzip2.exe are you using? Some versions (e.g. that at http://unxutils.sf.net last time I looked) can't handle files > 2GB. Mr. Jones 13:25, 5 Dec 2004 (UTC)
I am using "Version 0.1p12 29 Aug 1997". Admittantly a bit old lol. Do you know a better one? --Alterego 9:52 12/5/04
Well, that's not going to work, is it? :-) The latest version is 1.02. Try the link on the article page (press Ctrl+F and type "bzip2") Let us know how you get on. Mr. Jones 11:42, 8 Dec 2004 (UTC)

Dumps disabled?

Is there some reason that the weekly dumps have not been updated in three weeks? Michael L. Kaufman 16:06, Jan 26, 2005 (UTC)

They are doing a database conversion I believe. It has been going on for quite some time. --Alterego 17:44, Jan 26, 2005 (UTC)
There are new dumps up now from 2005-02-09, but unfortunately they seem to exclude "de" and "en" (and possibly others that I didn't notice). So if you're looking for those languages, it looks like you're still out of luck. -- All the best, Nickj (t) 01:04, 11 Feb 2005 (UTC)
OK, I'm confused ... if you look here, there is a current en dump. It's not listed on the download page though. Does that mean that they should be listed and it's just a mistake that it's not, or does it mean that there's something wrong with that dump and that it has been excluded for a reason? The same thing applies to de. -- All the best, Nickj (t) 01:14, 11 Feb 2005 (UTC)


Are there plans to update downloads.wikimedia.org in 2005?

Hi, I've been wondering why wiki DB dumps were updated only at start of 2005 and for 3 weeks it was not updated? Is it no longer supported? I need just a single (lithuanian) cur table dump for some statistical analysis of lithuanian articles, so it would be nice to know if and when there are plans to update? Thanks. Knutux 10:28, 2005 Feb 2 (UTC)

Is there any way to get an older Dump?

I need a copy of any dump before the current one 1/7/05? Preferable the one from just before it, but I will take whatever I can get. Any help would be appreciated. Thanks, Michael L. Kaufman 05:15, Feb 4, 2005 (UTC)

Have you tried rsync ([rsync://download2.wikimedia.org/dumps/]). I guess it may be possible to retrieve older revissions using it. Knutux 07:58, 2005 Feb 4 (UTC)
I'm not sure about that. The links with dates are actually just symlinks(sp) to download.wikimedia.org/archives<insert wiki>/<insert lang>/cur_table.sql.bz2 or the appropriate filenames. --Alterego 17:03, Feb 4, 2005 (UTC)

Problems importing the compressed data

I'm trying to import the OLD data of the hebrew wikipedia (he) but it seems that because of the compression there are some special chars in the text (such as - ' etc. ) that confuse the sql - which returen tons of errors. it seems to me that someone forgot formating the text (adding / before special chars and stuff) while exporting. (I've been importing the data few month ago on the same system with no problem, and the CUR data imports without any problem).

  • does anyone else expirienced the same problem?
  • can anyone check if it's not just my local problem?
  • any idia about how to solve it?

thnx, Costello 19:44, 24 Feb 2005 (UTC).

  I had no problems doing just that. You can contact me directly if you require assistance. --LevMuchnik 16:52, May 20, 2005 (UTC)


imagelinks table

Any chance of adding the imagelinks table to the dump script, too? It's a quite small table, and since links, categorylinks, brokenlinks etc. are getting dumped already, imagelinks is the only one still missing from a complete dump set.

Yes, I know that refreshlinks.php can be used to generate imagelinks - but it takes an *extemely* long time to run and also tends to bomb out on my installation eventually.

I've also tried to rework the obsolete rebuildlinks.php script for imagelinks only, but couldn't get it to work for images included by templates etc. and finally had to give up on this. I need imagelinks to decide which images to include in a notebook installation - more precisely, to drop (or not to mirror) orphaned images which are not referenced by a NS_MAIN or NS_TALK article (a considerable percentage BTW). --DerHund 14:14, 27 Feb 2005 (UTC)

I've added this to the script, so it'll be in the next public dump. I'm expecting to start that within the next few hours. Any other table requests from anyone? Jamesday 13:21, 8 Mar 2005 (UTC)
Dump started. We may have to abort it or part of it if it kills the site excessively - still working on making it less disruptive. It's sharing the image server which is experiencing very high disk load. Expect it to take 12 or more hours to run. Jamesday 07:56, 9 Mar 2005 (UTC)
Suspended while processing de wikipedia because it was hurting the site too much. Will resume after peak time today. Jamesday 14:04, 9 Mar 2005 (UTC)

Thanks much for including the imagelinks table in the DB dump, you've been very helpful. --DerHund 22:32, 15 Mar 2005 (UTC)

Image tarball dumps

Production of new image tarball dumps has been temporarily suspended while we work on preventing the production of them from taking the whole site down. Will probably be back again within a few weeks. Jamesday 13:21, 8 Mar 2005 (UTC)

Obviously the use of "a few weeks" is a bit liberal here. :-) 74.166.95.223 01:46, 9 October 2007 (UTC)

No SQL Query

Whenever I try to import the sql dumps using phpmyadmin, it says No SQL Query

AxyJo 23:59, 17 Mar 2005 (UTC)

I may be wrong, havong only used phpmyadmin a few times, but I don't believe you will be successful importing a large dump through it. --Alterego

How would I then import the dumps? 70.49.148.112 21:03, 19 Mar 2005 (UTC)

PhpmyAdmin

How can I import large dumps without phpmyadmin? 69.156.100.44 03:19, 20 Mar 2005 (UTC)

http://dev.mysql.com/doc/mysql/en/mysql.html --Brion 03:27, Mar 20, 2005 (UTC)


Actual size available for download differs from reported

Across 200 wikis

cur_table.sql.bz2:  1531747056 (exactly equal to reported)
       upload.tar: 42964259653
old_table.sql.bz2: 20287081657
                   ___________
total            : 64783088366 bytes

                         61782 actual megabytes
                         50503 reported megabytes
                         _____
                         11279 difference in megabytes

--Alterego 21:32, Mar 20, 2005 (UTC)

The archive link (http://download.wikimedia.org/archives/en/) does not work from http://en.wikipedia.org/wiki/Wikipedia:Database_download. Is it a temporary problem? I'd like to access to the previous (March, 2005) MySQL dump file.

Archives were moved but the link was not updated. Fixed. It's http://download.wikimedia.org/wikipedia/en/ now. JRM · Talk 02:45, 2005 May 6 (UTC)

Compression Format Change and Size Issues

The concatenation of the dump files (english version of Wikipedia) has ended up with a file of around 32 gigs. Apparently, the compression format has changed for bzip2 does not recognize the resulting file as a bz2 one but gunzip is able to uncompress the file (by naming the compressed file old_table.sql.gz). Can anyone officially confirm the change in the compression format? Moreover, the uncompressed file has a size of only 34,201,462 KB which is not much bigger than the compressed file. Is that normal? Nonetheless, the resulting sql files seems to be readable for it is possible to import the 'old table' from it. But I don't know whether the file is complete or not, and whether the old table that I got, will not miss any record.

Does anyone have similar problems?

--Kevouze 14:44, Apr 25, 2005 (UTC)

As of time of this post, the dump files now have the extension .sql.gz. Does that mean that gzip should now be used to decompress them, and that the section on bzip2 is irrelevant? Tsointsoin 00:47, 30 July 2005 (UTC)


Database dumps old, image dumps gone

The database dump hasn't been updated in a month now (it says it's done twice a week on the meta-wiki).

Image dumps have been broken for 2 weeks' time now. Image dumps use some strange compression that apparently can only be uncompressed using the right version of the right set of programs on the right platform (in other words, anything but standard platforms).

Is there any effort to standardize these practices, or is it always going to be "Whenever someone gets around to it or feels like doing it?" Is there anything Joe User can do to help out??

-- Q2 03:52, 16 Jun 2005 (UTC)

Downloading Large Files on Linux

I'm running stock Redhat 9 and can't seem to get the 16 gig image dump to download. I've tried both wget and curl but they both appear to have a 2gig file limit.

Has anyone had any success getting these to download on linux? If so what did you use?

Thanks.

TheLoneCoder 02:30, 1 August 2005 (UTC)

Nevermind. I figured it out by using lynx -dump url | tar xv


Error "Duplicate entry '8-VfD-Q+ç?' for key 2" on import

I've downloaded and ungzipped the latest available dump at the time of this posting (20050623_cur_table.sql). Upon importing it using "mysql -p -u root wikipedia < 20050623_cur_table.sql", I got an error and the import was not completed correctly: ERROR 1062 (23000) at line 1488: Duplicate entry '8-VfD-Q+ç?' for key 2 According to "select count(*) from cur;", I only got 606,328 entries in the table. I was able to work around the problem by editing the text file 20050623_cur_table.sql by hand, and removing the "UNIQUE" in front of the key 'name_title_dup_prevention' (for info HexToolbox is the only text editor I found capable of editing such a large file). The import then gave me 1,811,554 entries. Am I the only one getting this error? Is there any better solution than this workaround? I am on Windows XP using MySQL 14.9 Distrib 5.0.3-beta Tsointsoin 17:09, 2 August 2005 (UTC)


XML import

There was a new dump of the en.wikipedia cur table this weekend, and I'm itching to get my hands on it. Unfortunately for me and my software, the dump is in the new XML format rather than an SQL query. Is there a tool, perhaps, for importing the XML dump into mysql? — brighterorange (talk) 01:50, 7 September 2005 (UTC)


Wikipedia in DICT format

Moved from Wikipedia:Village pump on Thursday, July 10th, 02003.

I wonder if someone thought about making dict files of the Wikipedia. It would be cool to have the Wikipedia wherever I am, independent of an internet connection. (Okay, I still need my laptop for this...) dict seems a good way to achieve this. I'm willing to spend some time hacking a Python script that can create the dict files from the SQL stuff. But I'd like to know if other people are interested in this as well, or maybe there's someone who already did this job... :) --Guaka 22:38 5 Jul 2003 (UTC)

Doesn't Tombraider achieve this? CGS 22:40 5 Jul 2003 (UTC).

Dear Wikipedians! I enjoy very much my tomeraider wikipedia edition from december 2003. And I dream of downloading a current version. At that time it was 180 mb with 180000 articles. Now there are 360000. Please !!!! Thousands of PDA friens will be grateful to you ! The german wikipedia for tomeraider is available for download from 1 of september 2004 with 217 mb and 180000 articles. Vlad

You mean Tomeraider? No... First of all, tomeraider is shareware. And AFAICS it is totally not meant to convert the wikipedia into the dict format. Guaka 02:37 6 Jul 2003 (UTC)
Ha ha :) I know it's not meant to convert files to dict format, but it does what you want - view files on the go without a net connection. CGS 20:28 6 Jul 2003 (UTC)
Another thing is... Tomeraider is non-free software. This is already enough reason not to use it. But even if I wanted to, I couldn't because I run GNU/Linux. Guaka 16:06 7 Jul 2003 (UTC)
If it's the right tool for the job, swallow your pride and run it through Wine. CGS 22:15 7 Jul 2003 (UTC).
I guess you paid for the PDA hardware. So why is $20 for good software a no-go? I chose TomeRaider because it was the best option at the time (and it may still be, not sure). Some people write software for a living, and if they are good at it, I hope they continue doing so. Just because they earn some money they are not neccesarily a second Bill. Not that I wouldn't prefer GNU software which is equally cross plaform, fast and economical with PDA space, it just doesn't seem a matter of higher principle to me. Erik Zachte 22:54, 18 Mar 2004 (UTC)

Erm, excuse me if I'm missing something, but wouldn't it be silly to view Wikipedia on non-free software after we go through so much trouble to make sure that the content is under the GFDL? If the content is free but the medium is not, then the company that produces it controls the content, albeit in an indirect fashion. The company could go out of business and render Tomeraider files useless, etc. At any rate, I would be interested in seeing a GPL'ed Python script that could accomplish this task, especially since I'm a beginning programmer and I'm interested in learning Python. And I'm a beginning Linux user who doesn't have a clue how to use Wine, fix problems with a program running in Wine, or anything particularly complex at all. --Nelson 23:41 8 Jul 2003 (UTC)

I fully agree with that Nelson. We just need to have a name now, so that we have a page for it. Or maybe this project would better fit on the Meta Wikipedia? Guaka 00:10 10 Jul 2003 (UTC)

Try the following Perl script for generating the Dict database. Change the DBI->connect to have the correct values for username and password instead of dbuser and dbpass.

#!/usr/bin/perl -w

use strict;
use DBI();

sub article2dict {
  my ($title, $text) = @_;

  $title =~ s/_/ /g;
  $text =~ s/\r//g;
  $text =~ s/^/  /mg;

  print "$title\n";
  print $text;
  print "\n\n";
}

# Connect to the database.
my $dbh = DBI->connect("DBI:mysql:database=wikipedia;host=localhost",
                       "dbuser", "dbpass",
                       {'RaiseError' => 1});

# Now retrieve data from the table.
my $sth = $dbh->prepare("SELECT cur_title, cur_text FROM cur " .
                        "WHERE cur_namespace = 0 ORDER BY cur_title");
$sth->execute();
while (my $ref = $sth->fetchrow_hashref()) {
  article2dict($ref->{'cur_title'}, $ref->{'cur_text'});
}
$sth->finish();

# Disconnect from the database.
$dbh->disconnect();

wik2dict.py

I finally wrote something: wik2dict.py. It tries to create reasonably layouted dict articles. It can also automatically fetch the database dumps. There are some requirements though. And currently it is only version 0.2. So beware.

I would appreciate it if someone (possibly someone at Wikimedia?) could run the script regularly and put the dict files available for everyone to download. Too bad they can't be included in Debian though ("GFDL is non-free"). However, the script itself could probably be included in contrib :) G-u-a-k-@ 17:50, 27 Jul 2004 (UTC)


Moved from Wikipedia:Village pump:


CVSup/Rsync, and why it would be useful

For those who would like to keep their local copy in sync, I would suggest to set up a CVSup server ( http://www.cvsup.org/ ). The snapshot can be made a number of time per day (say 4 times a day) and people can very efficiently synchronize to the latest version of the wikipedia. This is a lot faster and saves a lot of bandwidth compared to downloading a complete tarball every time. The server does not even have to run on the wikipedia server itself, but it seems the logical choice. CVSup is very efficient. If the wikipedia dumps can be tagged with a version number similar to RCS, the synching will probably be blazingly fast. -- Tim Hemel

I don't think that would work very well with the dumps. The bzip2-compressed versions are not going to be cleanly diffable, and if I leave them as text (~380 megabytes for English current revisions, a few gigabytes for old revisions; I'm reluctant to have them sitting around uncompressed), they're still not going to work that well in CVS if I understand its storage system correctly. Each line of the dump is an SQL INSERT statement for about 500 pages, and the slightest change to any of them (including cache invalidation timestamps) would cause the whole line to be sucked out and replaced. --Brion 18:33 25 May 2003 (UTC)
I'm not sure how applicable cvsup would be, but I think rsync is worth considering. Rsync doesn't use diff for computing deltas, so I don't think the "long lines with small changes" problem applies. As for rsyncing compressed files, there are techniques to do that, such as http://svana.org/kleptog/rgzip.html or http://lists.samba.org/archive/rsync/2002-October/004035.html (merely two results from a brief Google search, I'm sure more research on the topic would be fruitful). If people feel this would be worth pursuing, let me know. Neilc 11:30, 7 Aug 2004 (UTC)
gzip does have an --rsyncable option (see http://rsync.samba.org/ftp/unpacked/rsync/patches/gzip-rsyncable.diff and gzip --help). bzip2 doesn't seem to. See http://www.debianplanet.org/node.php?id=524 for a discussion of pros and cons (for w:Debian). I don't know how relevant the server load problem is for d.wp .

See also http://lists.debian.org/debian-devel/2001/10/msg02187.html About efficiency in part (reportedly): http://rsync.samba.org/tech_report/

More later.

Mr. Jones 05:07, 14 Aug 2004 (UTC)


Offsite backup of dumps and mailing lists

I have some questions regarding downloading the database dumps. On the page it says last dump made July 13. Does that mean what I think it means (i.e. if I download the English and non-English tarballs I only have revisions up to the 13th?). Also, as I understand it, I would only have to download the cur tarballs from here on in (if I saved the old ones), is this correct? I figure having an extra backup of the database can't hurt...especially after last night :). Addendum: should I also download the mailing list archives (from what I gather, they're separate from the dumps)? Geez, another question: is it safe to assume images are not included with the dumps? -- Notheruser 15:42 28 Jul 2003 (UTC)

Ok, I think I've found most of the answers to my above questions; I'll list them here in case anyone else was curious. The database hadn't been backed up since July 13 at the time, but, currently, it is now updated until August 1. You have to download the cur and old files to completely backup the English Wikipedia (don't forget about otherlanguages.tar for a full backup). The mailing lists are archived offsite, so they seem safe and images are currently not backed up (about 1GB worth of files). -- Notheruser 18:53, 2 Aug 2003 (UTC)

If one downloads the old database for English, and then import it into a MySQL database, one finds that it is larger than the 4.1 GB limit most operating systems put on the table size. Of course, I can go in and edit the SQL myself, but it would be better if this was done at the source. Would it be possible to have the dump write out more than one table, each less than 4.1 GB in size? -- RayKiddy 20:01, 13 Sep 2003 (UTC)

If your OS still has a 4 GB file limit, you really need a new OS. :) Multiple tables doesn't make any sense, as it wouldn't be usable. I'd recommend (well, I'd really recommend getting yourself a modern Linux or FreeBSD or something) creating the table as InnoDB and making sure your configuration is set up to use <4GB files for the innodb space (as it can use multiple files). --Brion

Incremental wikipedia updates?

from village pump

Once the full Wikipedia is downloaded, can smaller periodic updates covering new stuff and changes be obtained and used to synch the local? --Ted Clayton 04:26, 13 Sep 2003 (UTC)

No, you can't. I've been thinking the same thing myself. I think we need to:
  • Allow incremental updates for all types of download
  • Allow bulk image downloads
  • Package a stripped-down version of the old table in with the cur dumps, where the revision history (users, times, comments etc.) is included, but the old text itself is not
  • Develop a method of compressing the old table so that the similarity between adjacent revisions can be used to full advantage
-- Tim Starling 04:38, Sep 13, 2003 (UTC)

Would it be easier to have incremental updates on something like a subscription basis? The server packages dailies or weeklies and shoots them out to everyone on the list? During off hours, mass-mail fashion?

Can you suggest sources or search-terms for table manipulations treatments, as background for stripping and compressing? --Ted Clayton 03:14, 14 Sep 2003 (UTC)

I'm going to continue this on wikitech-l, because it's very much on-topic there. See Wikipedia:Mailing lists for more information. -- Tim Starling 12:48, Sep 14, 2003 (UTC)
Also Wikitech-l thread on incremental backup
Wouldn't it be a great idea to provide split database dumps, one package with only article, and one package with articles, talk, users and all. This would reduce the spreading of Wikipedia userpages to forks. — Sverdrup 08:24, 18 Mar 2004 (UTC)

BitTorrent

An idea is to use a distributed downloading system. In such a system, multiple computer with the client running would help each other download faster. I recommend BitTorrent, an open-source distributed downloading system. However, to be most effective, BitTorrent should be used on large files that are frequently downloaded. --Ixfd64 06:25, 2004 Aug 15 (UTC)

  • Are the dumps available as torrents? That would be both cool and beneficial, I think. Seriously consider.