Wikipedia:Scripts/mwlink

From Wikipedia, the free encyclopedia

This Ruby program has two modes. It can run as a daemon or text processor (daemon mode is preferred, since it's more efficient).

In text-scanning mode, it interprets its command line (or stdin if no command line given) as text possibly containing [[wikilinks]]. It preserves the original text and adds a text hyperlink (the http: address contained in <> braces).

In daemon mode, it receives HTTP requests like http://localhost:4242/mwlink?page=wiki-page-name and redirects to the appropriate Wikimedia page. It's convenient for scripts to just use that URL rather than constructing one themselves--all they have to do is URL-escape the text between [[ and ]].

   #!/usr/bin/ruby

   # This script is dual-licensed under the GPL version 2 or any later
   # version, at your option. See http://www.gnu.org/licenses/gpl.txt for more
   # details.

   =begin

   = NAME

   mwlink - Linkify mediawiki-style wikilinks in plain text

   = SYNOPSIS

      mwlink [options] [text-to-wikilink]
         --daemon[=port]     Run as HTTP daemon
         --encoding          Default character set encoding (utf-8)
         --default-wiki      Default wiki (wikipedia)
         --default-language  Default language (en)

   = DESCRIPTION

   In text-scanning mode (without the --daemon argument) The mwlink program scans
   its arguments (or its standard input, in the event of no arguments) for
   wikilinks of the form [[link]]. It expands such links into URLs and inserts
   them into the original text after the [[link]] in sharp braces ((({<})) and
   (({>}))). Options are provided for specifying a default wiki (the wiki to link
   to if no qualifier is given in the link) and a default language (the language
   to assume if no qualifier is given) as well as the character set encoding in
   use. The built-in defaults are ((*wikipedia*)), ((*en*)) and ((*utf-8*)),
   respectively.

   In daemon mode (now preferred), It receives HTTP requests of the form
   "http://.../page=((*wikipedia page*))" (the ((*wikipedia page*)) name is what
   would appear within a [[wikilink]]. URL-escaping is required but no other
   processing, making it convenient to use from scripts.

   == Initialization File

   The names of namespaces vary in different languages (especially due to
   language. For example, "User:" in English is "Benutzer:" in German. You can
   specify lists of namespaces to use for particular languages in an
   initialization file (({~/.mwlinkrc})). This is simply a line with the
   language, a colon, and a space-separated list of namespaces in that
   language. When interpreting links for that language (either because
   ((*--default-language*)) was specified or there is a language qualifier in
   the link, mwlink will recognize it as a namespace appropriately. All the
   namespaces must appear on one line--line continuation is not supported.

   Comments (lines introduced with (({#}})) (pound sign)) are comments, and
   are ignored, along with blank lines.

   Here is an example configuration containing (only) some namespaces from the
   German Wikipedia. ((*Note*)): To be kind to the wiki when this script is
   uploaded, I have broken the line, but it ((*may not be broken*)) in order
   to work with mwlink.

      de: Spezial Spezial_diskussion Diskussion Benutzer Benutzer_diskussion
      Bild Bild_diskussion Einordnung Einordnung_diskussion Wikipedia
      Wikipedia_talk WP Hilf Hilf_diskussion

   = WARNINGS

   * The program (like mediawiki) assumes links are not broken across line
     boundaries.
   * The mechanism for providing an alternate list of namespaces only works
     per-language; other wikis could have different namespaces, too.
   * The list of wikis and their abbreviations is doubtlessly incomplete.
   * The initialization file mechanism is not that useful for a shared daemon.
   * In command-line mode, it's very difficult to process ASCII em-dashes (--)
     correctly and still honor command-line options. mwlink gets it wrong, and
     that's one reason daemon mode is preferred.

   = AUTHOR

   Demi @ Wikipedia - http://en.wikipedia.org/wiki/User:Demi

   =end

   require 'cgi'
   require 'iconv'
   require 'getoptlong'
   require 'webrick'
   include WEBrick

   $opt = {
      'default-wiki' => 'wikipedia',
      'default-language' => 'en',
      'encoding' => 'utf-8'
   }

   class String

      def initcap()
         new = self.dup
         # Okay, I consider it dumb that a string subscripted produces an
         # integer --Demi
         new[0] = new[0].chr.upcase
         return new
      end

      def initcap!()
         self[0] = self[0].chr.upcase
         return self
      end

   end

   class Canon

      def initialize()
         @ns = { }
         @ns_array = %w(Media Special Talk User User_talk Project Project_talk
            Image Image_talk MediaWiki MediaWiki_talk Template Template_talk Help
            Help_talk Category Category_talk Wikipedia Wikipedia_talk WP)
         @ns['default'] = { }
         @ns_array.each { |nspc| @ns['default'][nspc] = nspc }

         if File::readable?(ENV['HOME'] + '/.mwlinkrc')
            IO::foreach(ENV['HOME'] + '/.mwlinkrc') { |line|
               next if line =~ /^\s*\#/
               next if line =~ /^\s*$/
               line.chomp!
               if m = line.match(/^(\w+)\:(.*)$/)
                  lang    = m[1]
                  nslist  = m[2].split
                  @ns[lang] = { }
                  nslist.each { |nspc| @ns[lang][nspc] = nspc }
               end
            }
         end

         @wiki = {
            'Wiktionary' => 'wiktionary',
            'Wikt' => 'wiktionary',
            'W' => 'wikipedia',
            'M' => 'meta',
            'N' => 'news',
            'Q' => 'quote',
            'B' => 'books',
            'Meta' => 'meta',
            'Wikibooks' => 'books',
            'Commons' => 'commmons',
            'Wikisource' => 'source'
         }

         @wikispec = {
            'wikipedia' => { 'domain' => 'wikipedia.org', 'lang' => 1 },
            'wiktionary' => { 'domain' => 'wiktionary.org', 'lang' => 1 },
            'meta' => { 'domain' => 'meta.wikimedia.org', 'lang' => 0 },
            'books' => { 'domain' => 'wikibooks.org', 'lang' => 1 },
            'commons' => { 'domain' => 'commmons.wikimedia.org', 'lang' => 0 },
            'source' => { 'domain' => 'sources.wikimedia.org', 'lang' => 0 },
            'news' => { 'domain' => 'wikinews.org', 'lang' => 1 },
         }

         @cs = Iconv.new("iso-8859-1", $opt['encoding'])

      end

      #TODO The % part of the # section of the URL should become a dot.

      def urlencode(s)
         CGI::escape(s).gsub(/%3[Aa]/, ':').gsub(/%2[Ff]/, '/').gsub(/%23/, '#')
      end

      def canonword(word)
         s = word.strip.squeeze(' ').tr(' ', '_').initcap

         begin
            @cs.iconv(s)
         rescue Iconv::IllegalSequence
            s
         end
      end

      def parselink(link)
         l = {
            'namespace' => '',
            'language' => $opt['default-language'],
            'wiki' => $opt['default-wiki'],
            'title' => ''
         }
         terms = link.split(':')
         l['title'] = canonword(terms.pop)
         terms.each { |term|
            next if term.nil? or term.empty?

            t = canonword(term)

            if @ns[l['language']]
            then
               ns = @ns[l['language']]
            else
               ns = @ns['default']
            end

            if ns.key?(t)
               l['namespace'] = ns[t]
            elsif @wiki.key?(t)
               l['wiki'] = @wiki[t]
            else
               l['language'] = t.downcase
            end
         }

         l
      end

      def canonicalize(link)
         linkdesc = parselink(link.sub(/\|.*$/, ''))

         if @wikispec.key?(linkdesc['wiki'])
            ws = @wikispec[linkdesc['wiki']]
            host = ws['domain']
            if ws['lang'] != 0
               host = linkdesc['language'] + '.' + host
            end
         else
            host = linkdesc['wiki'] + '.' + 'wikimedia.org'
         end

         uri =
            if linkdesc['namespace'].length > 0
               linkdesc['namespace'] + ':' + linkdesc['title']
            else
               linkdesc['title']
            end

         r = urlencode('http://' + host + '/wiki/' + uri)
         r
      end

      def to_s()
         "Namespace sets: " + @ns.keys.join(', ') +
         "; Wikis: " + @wiki.to_a.join(', ')
      end
   end

   def linkexpand(c, bracketlink)
      linktext =
         if m = /\[\[([^\]]+)\]\]/.match(bracketlink)
            m[1]
         else
            bracketlink
         end

      bracketlink +
         " <" + c.canonicalize(linktext) + ">"
   end

   c = Canon.new()
   re = /\[\[\s*[^\s\\][^\]]+\]\]/

   class MwlinkServlet < HTTPServlet::AbstractServlet

      def initialize(server, canonicalizer)
         super(server)
         @c = canonicalizer
      end

      def do_GET(rq, rs)
         p = CGI.parse(rq.query_string)
         # Just for testing
         l = @c.canonicalize(p['page'][0])
         rs.status = 302
         rs['Location'] = l
         rs.body = "<html><body>\n" +
            "<a href=\"#{l}\">#{p['page'][0]}</a>\n" +
                     "</body></html>\n"
      end
   end

   begin
      GetoptLong::new(
         ['--default-wiki',     GetoptLong::REQUIRED_ARGUMENT],
         ['--default-language', GetoptLong::REQUIRED_ARGUMENT],
         ['--encoding',         GetoptLong::REQUIRED_ARGUMENT],
         ['--daemon',           GetoptLong::OPTIONAL_ARGUMENT]
      ).each do |k, v|
         k = k.sub(/^--/,'')

         case k

         when 'default-wiki', 'default-language', 'encoding'
            $opt[k] = v

         when 'daemon'
            $opt['daemon'] = true
            if v.empty?
               $opt['port'] = 4242
            else
               $opt['port'] = v
            end
         end
      end
   rescue GetoptLong::InvalidOption
      true
   end

   if $opt['daemon']

      port = $opt['port'].to_i

      puts "Starting daemon on port #{port}"
      s = HTTPServer.new(:Port => port)
      s.mount("/mwlink", MwlinkServlet, c)

      trap('INT') { s.shutdown }

      s.start

   else

      # Note, there are various combinations of -- appearing in normal text that
      # will break this. --daemon is the recommended method.
      if ARGV.empty?
         STDIN.each_line { |line|
            puts line.chomp.gsub(re) { |expr| linkexpand(c, expr) }
         }
      else
         puts ARGV.join(' ').gsub(re) { |expr| linkexpand(c, expr) }
      end

   end

Example output:

 [[Ashland (disambiguation)]] is an example of a
 [[Wikipedia:Disambiguation]] page.
 [[Ashland (disambiguation)]] <http://en.wikipedia.org/wiki/Ashland_%28disambiguation%29> is an example of a
 [[Wikipedia:Disambiguation]] <http://en.wikipedia.org/wiki/Wikipedia:Disambiguation> page.
 GET http://localhost:4242/mwlink?page=Ashland+%28disambiguation%29
 GET http://localhost:4242/mwlink?page=Ashland+%28disambiguation%29 --> 302 Found
 GET http://en.wikipedia.org/wiki/Ashland_%28disambiguation%29 --> ...(page content)

The GET program is a utility distributed with Perl's libwww. Also, note that wikimedia servers forbid scripts based on the LWP Perl module.