User:Rich Farmbrough/Disambig scripts

From Wikipedia, the free encyclopedia

[edit] Scripting

Download the database.

Note these scripts work with the SQL databse dumps, not with the XML dumps. Create this perl script dab.pl



#!/usr/bin/perl

while (<>) {
        @lines=split /\[INSERT INTO \`cur\` VALUES \(|\d\'\),\(|\d\'\);\n/;
      foreach $line (@lines){
                $line =~ m/\d+,(\d+),'(.+?[^\\])','(.+?[^\\])','/;
                $space=$1;
                $name=$2;
                $text=$3;
            if ($space==0) {
                        if ($text =~ m/\{\{disambig\}\}/){
                                print $name, "\n";    
                        }
                        elsif ($text =~ m/\{\{msg:disambig\}\}/){
                                print $name, "\n";    
                        }
                }       
        }
}


run

perl dab.pl ddddddddd_cur_table.sql > dab.txt

Where dddddd is the appropriate date (Takes a few minutes, I didn't time it.)

then create countdab.pl



#!/usr/bin/perl
%dab=();
open (DAB,"dab.txt");
while (<DAB>){
        chomp();
        $dab{$_}=0;
}
$i=0;
while (<>) {
        @lines=split /\[INSERT INTO \`cur\` VALUES \(|\d\'\),\(|\d\'\);\n/;
      foreach $line (@lines){
                $line =~ m/\d+,(\d+),'(.+?[^\\])','(.+?[^\\])','/;
                if ($1==0){
                        $_=$3;
                        @links= /\[\[(.*?)(?:\||\]\])/g;

                        foreach $link (@links){
                                if ( exists $dab{$link} ) {
                                        $dab{$link}++;
                                }
                        }
                }
        }
        
        print STDERR ".",++$i;
}
$i=0;
foreach $key (sort { $dab{$b} <=> $dab{$a} } keys %dab) {
  print "# [[", $key,"]] ([[Special:Whatlinkshere/",$key,"|links]] to ",$dab{$key}," articles)\n";
  if ($i++>200) {last;}
}


and run with something like

perl dabcount.pl dddddddddd_cur_table.sql > count.txt

where dddddddddd is the appropriate date. (takes about twenty minutes)

and you have your result. It's not perfect because it ignores nowiki, comments etc. but for a disambiguation league table it's good enough.