Wikipedia:Computer help desk/ParseMediaWikiDump

From Wikipedia, the free encyclopedia

Parse::MediaWikiDump is a Perl module created by Triddle that makes accessing the information in a MediaWiki dump file easy.

Contents

[edit] Download

The latest version of Parse::MediaWikiDump is available at http://www.cpan.org/modules/by-authors/id/T/TR/TRIDDLE/.

[edit] Examples

[edit] Find uncategorized articles in the main name space

  #!/usr/bin/perl -w
  
  use strict;
  use Parse::MediaWikiDump;
    
  my $file = shift(@ARGV) or die "must specify a Mediawiki dump file";
  my $pages = Parse::MediaWikiDump::Pages->new($file);
  my $page;
    
  while(defined($page = $pages->page)) {
    #main namespace only           
    next unless $page->namespace eq '';
  
    print $page->title, "\n" unless defined($page->categories);
  }

[edit] Find double redirects in the main name space

This program does not follow the proper case sensitivity rules for matching article titles; see the POD that comes with the module for a much more complete version of this program.

  #!/usr/bin/perl -w

  use strict;
  use Parse::MediaWikiDump;

  my $file = shift(@ARGV) or die "must specify a Mediawiki dump file";
  my $pages = Parse::MediaWikiDump::Pages->new($file);
  my $page;
  my %redirs;

  while(defined($page = $pages->page)) {
    next unless $page->namespace eq '';
    next unless defined($page->redirect);

    my $title = $page->title;

    $redirs{$title} = $page->redirect;
  }

  foreach my $key (keys(%redirs)) {
    my $redirect = $redirs{$key};
    if (defined($redirs{$redirect})) {
      print "$key\n";
    }
  }

[edit] Import only a certain category of pages

Can someone fill in the perl code below?

#!/usr/bin/perl

use Parse::MediaWikiDump;
use DBI;
use DBD::mysql;

$server         = "localhost";
$name           = "dbname";
$user           = "admin";
$password       = "pass";

$dsn = "DBI:mysql:database=$name;host=$server;";
$dbh = DBI->connect($dsn, $user, $password);

$source = 'pages_articles.xml';

$pages = Parse::MediaWikiDump::Pages->new($source);
print "Done parsing.\n";

while(defined($page = $pages->page)) {
    $c = $page->categories;
    if (grep /Mathematics/, @$c) {

        $id = $page->id;
        $title = $page->title;
        $text = $page->text;

        #$dbh->do("insert ...");

        print "title '$title' id $id was inserted.\n";
    }
}

[edit] Extract articles linked to important Wikis but not to a specific one

The script checks if an article contains interwikis to :de, :es, :it, :ja and :nl BUT not :fr. It is useful to link "popular" articles to a specific wiki. It may also give useful hints about articles that should be translated in priority.

#!/usr/bin/perl -w

# Code : Dake
use strict;
use Parse::MediaWikiDump;
use utf8;
    
my $file = shift(@ARGV) or die "must specify a Mediawiki dump file";
my $pages = Parse::MediaWikiDump::Pages->new($file);
my $page;
    
binmode STDOUT, ":utf8";

while(defined($page = $pages->page)) {
  #main namespace only           
  next unless $page->namespace eq '';

  my $text = $page->text;
  if (($$text =~ /\[\[de:/i) && ($$text =~ /\[\[es:/i) && 
     ($$text =~ /\[\[nl:/i) && ($$text =~ /\[\[ja:/i) && 
     ($$text =~ /\[\[it:/i) && !($$text =~ /\[\[fr:/i)) { 
     print $page->title, "\n";
  }             
}