Wikipedia:Scripts/mwlink
From Wikipedia, the free encyclopedia
This Ruby program has two modes. It can run as a daemon or text processor (daemon mode is preferred, since it's more efficient).
In text-scanning mode, it interprets its command line (or stdin if no command line given) as text possibly containing [[wikilinks]]. It preserves the original text and adds a text hyperlink (the http: address contained in <> braces).
In daemon mode, it receives HTTP requests like http://localhost:4242/mwlink?page=wiki-page-name and redirects to the appropriate Wikimedia page. It's convenient for scripts to just use that URL rather than constructing one themselves--all they have to do is URL-escape the text between [[ and ]].
#!/usr/bin/ruby # This script is dual-licensed under the GPL version 2 or any later # version, at your option. See http://www.gnu.org/licenses/gpl.txt for more # details. =begin = NAME mwlink - Linkify mediawiki-style wikilinks in plain text = SYNOPSIS mwlink [options] [text-to-wikilink] --daemon[=port] Run as HTTP daemon --encoding Default character set encoding (utf-8) --default-wiki Default wiki (wikipedia) --default-language Default language (en) = DESCRIPTION In text-scanning mode (without the --daemon argument) The mwlink program scans its arguments (or its standard input, in the event of no arguments) for wikilinks of the form [[link]]. It expands such links into URLs and inserts them into the original text after the [[link]] in sharp braces ((({<})) and (({>}))). Options are provided for specifying a default wiki (the wiki to link to if no qualifier is given in the link) and a default language (the language to assume if no qualifier is given) as well as the character set encoding in use. The built-in defaults are ((*wikipedia*)), ((*en*)) and ((*utf-8*)), respectively. In daemon mode (now preferred), It receives HTTP requests of the form "http://.../page=((*wikipedia page*))" (the ((*wikipedia page*)) name is what would appear within a [[wikilink]]. URL-escaping is required but no other processing, making it convenient to use from scripts. == Initialization File The names of namespaces vary in different languages (especially due to language. For example, "User:" in English is "Benutzer:" in German. You can specify lists of namespaces to use for particular languages in an initialization file (({~/.mwlinkrc})). This is simply a line with the language, a colon, and a space-separated list of namespaces in that language. When interpreting links for that language (either because ((*--default-language*)) was specified or there is a language qualifier in the link, mwlink will recognize it as a namespace appropriately. All the namespaces must appear on one line--line continuation is not supported. Comments (lines introduced with (({#}})) (pound sign)) are comments, and are ignored, along with blank lines. Here is an example configuration containing (only) some namespaces from the German Wikipedia. ((*Note*)): To be kind to the wiki when this script is uploaded, I have broken the line, but it ((*may not be broken*)) in order to work with mwlink. de: Spezial Spezial_diskussion Diskussion Benutzer Benutzer_diskussion Bild Bild_diskussion Einordnung Einordnung_diskussion Wikipedia Wikipedia_talk WP Hilf Hilf_diskussion = WARNINGS * The program (like mediawiki) assumes links are not broken across line boundaries. * The mechanism for providing an alternate list of namespaces only works per-language; other wikis could have different namespaces, too. * The list of wikis and their abbreviations is doubtlessly incomplete. * The initialization file mechanism is not that useful for a shared daemon. * In command-line mode, it's very difficult to process ASCII em-dashes (--) correctly and still honor command-line options. mwlink gets it wrong, and that's one reason daemon mode is preferred. = AUTHOR Demi @ Wikipedia - http://en.wikipedia.org/wiki/User:Demi =end require 'cgi' require 'iconv' require 'getoptlong' require 'webrick' include WEBrick $opt = { 'default-wiki' => 'wikipedia', 'default-language' => 'en', 'encoding' => 'utf-8' } class String def initcap() new = self.dup # Okay, I consider it dumb that a string subscripted produces an # integer --Demi new[0] = new[0].chr.upcase return new end def initcap!() self[0] = self[0].chr.upcase return self end end class Canon def initialize() @ns = { } @ns_array = %w(Media Special Talk User User_talk Project Project_talk Image Image_talk MediaWiki MediaWiki_talk Template Template_talk Help Help_talk Category Category_talk Wikipedia Wikipedia_talk WP) @ns['default'] = { } @ns_array.each { |nspc| @ns['default'][nspc] = nspc } if File::readable?(ENV['HOME'] + '/.mwlinkrc') IO::foreach(ENV['HOME'] + '/.mwlinkrc') { |line| next if line =~ /^\s*\#/ next if line =~ /^\s*$/ line.chomp! if m = line.match(/^(\w+)\:(.*)$/) lang = m[1] nslist = m[2].split @ns[lang] = { } nslist.each { |nspc| @ns[lang][nspc] = nspc } end } end @wiki = { 'Wiktionary' => 'wiktionary', 'Wikt' => 'wiktionary', 'W' => 'wikipedia', 'M' => 'meta', 'N' => 'news', 'Q' => 'quote', 'B' => 'books', 'Meta' => 'meta', 'Wikibooks' => 'books', 'Commons' => 'commmons', 'Wikisource' => 'source' } @wikispec = { 'wikipedia' => { 'domain' => 'wikipedia.org', 'lang' => 1 }, 'wiktionary' => { 'domain' => 'wiktionary.org', 'lang' => 1 }, 'meta' => { 'domain' => 'meta.wikimedia.org', 'lang' => 0 }, 'books' => { 'domain' => 'wikibooks.org', 'lang' => 1 }, 'commons' => { 'domain' => 'commmons.wikimedia.org', 'lang' => 0 }, 'source' => { 'domain' => 'sources.wikimedia.org', 'lang' => 0 }, 'news' => { 'domain' => 'wikinews.org', 'lang' => 1 }, } @cs = Iconv.new("iso-8859-1", $opt['encoding']) end #TODO The % part of the # section of the URL should become a dot. def urlencode(s) CGI::escape(s).gsub(/%3[Aa]/, ':').gsub(/%2[Ff]/, '/').gsub(/%23/, '#') end def canonword(word) s = word.strip.squeeze(' ').tr(' ', '_').initcap begin @cs.iconv(s) rescue Iconv::IllegalSequence s end end def parselink(link) l = { 'namespace' => '', 'language' => $opt['default-language'], 'wiki' => $opt['default-wiki'], 'title' => '' } terms = link.split(':') l['title'] = canonword(terms.pop) terms.each { |term| next if term.nil? or term.empty? t = canonword(term) if @ns[l['language']] then ns = @ns[l['language']] else ns = @ns['default'] end if ns.key?(t) l['namespace'] = ns[t] elsif @wiki.key?(t) l['wiki'] = @wiki[t] else l['language'] = t.downcase end } l end def canonicalize(link) linkdesc = parselink(link.sub(/\|.*$/, '')) if @wikispec.key?(linkdesc['wiki']) ws = @wikispec[linkdesc['wiki']] host = ws['domain'] if ws['lang'] != 0 host = linkdesc['language'] + '.' + host end else host = linkdesc['wiki'] + '.' + 'wikimedia.org' end uri = if linkdesc['namespace'].length > 0 linkdesc['namespace'] + ':' + linkdesc['title'] else linkdesc['title'] end r = urlencode('http://' + host + '/wiki/' + uri) r end def to_s() "Namespace sets: " + @ns.keys.join(', ') + "; Wikis: " + @wiki.to_a.join(', ') end end def linkexpand(c, bracketlink) linktext = if m = /\[\[([^\]]+)\]\]/.match(bracketlink) m[1] else bracketlink end bracketlink + " <" + c.canonicalize(linktext) + ">" end c = Canon.new() re = /\[\[\s*[^\s\\][^\]]+\]\]/ class MwlinkServlet < HTTPServlet::AbstractServlet def initialize(server, canonicalizer) super(server) @c = canonicalizer end def do_GET(rq, rs) p = CGI.parse(rq.query_string) # Just for testing l = @c.canonicalize(p['page'][0]) rs.status = 302 rs['Location'] = l rs.body = "<html><body>\n" + "<a href=\"#{l}\">#{p['page'][0]}</a>\n" + "</body></html>\n" end end begin GetoptLong::new( ['--default-wiki', GetoptLong::REQUIRED_ARGUMENT], ['--default-language', GetoptLong::REQUIRED_ARGUMENT], ['--encoding', GetoptLong::REQUIRED_ARGUMENT], ['--daemon', GetoptLong::OPTIONAL_ARGUMENT] ).each do |k, v| k = k.sub(/^--/,'') case k when 'default-wiki', 'default-language', 'encoding' $opt[k] = v when 'daemon' $opt['daemon'] = true if v.empty? $opt['port'] = 4242 else $opt['port'] = v end end end rescue GetoptLong::InvalidOption true end if $opt['daemon'] port = $opt['port'].to_i puts "Starting daemon on port #{port}" s = HTTPServer.new(:Port => port) s.mount("/mwlink", MwlinkServlet, c) trap('INT') { s.shutdown } s.start else # Note, there are various combinations of -- appearing in normal text that # will break this. --daemon is the recommended method. if ARGV.empty? STDIN.each_line { |line| puts line.chomp.gsub(re) { |expr| linkexpand(c, expr) } } else puts ARGV.join(' ').gsub(re) { |expr| linkexpand(c, expr) } end end
Example output:
[[Ashland (disambiguation)]] is an example of a [[Wikipedia:Disambiguation]] page.
[[Ashland (disambiguation)]] <http://en.wikipedia.org/wiki/Ashland_%28disambiguation%29> is an example of a [[Wikipedia:Disambiguation]] <http://en.wikipedia.org/wiki/Wikipedia:Disambiguation> page.
GET http://localhost:4242/mwlink?page=Ashland+%28disambiguation%29
GET http://localhost:4242/mwlink?page=Ashland+%28disambiguation%29 --> 302 Found GET http://en.wikipedia.org/wiki/Ashland_%28disambiguation%29 --> ...(page content)
The GET program is a utility distributed with Perl's libwww. Also, note that wikimedia servers forbid scripts based on the LWP Perl module.