User:Yurik/Query API/User Manual

From Wikipedia, the free encyclopedia

1 Overview
2 Installation
3 Usage

[edit] Overview

Query API provides a way for your applications to query data directly from the MediaWiki servers. One or more pieces of information about the site and/or a given list of pages can be retrieved. Information may be returned in either a machine (xml, json, php, wddx) or a human readable format. More than one piece of information may be requested with a single query.

Note: Query API is being migrated into the new API interface. Please use the new API, which is now a part of the standard MediaWiki engine.

New API live: http://en.wikipedia.org/w/api.php
Query API live: http://en.wikipedia.org/w/query.php
View the Source Code

[edit] Installation

These notes cover my experience - Fortyfoxes 00:50, 8 August 2006 (UTC) - of installing query.php on a shared virtual host [1], and may not apply to all set ups. I have the following configuration:

MediaWiki: 1.7.1
PHP: 5.1.2 (cgi-fcgi)
MySQL: 5.0.18-standard-log

Installation is fairly straight forward once you got the principles. Query.php is not like other documented "extensions" to MediaWiki - it does its own thing, and does not need integrating into the overall environment so that it can be called within wiki pages - so no registering with LocalSettings.php (my first mistake).

[edit] Installation Don'ts

Explicitly - do *NOT* place a "# require_once( "extensions/query.php" ); line in LocalSettings.php!

[edit] Installation Do's

All Query API files must be placed two levels below the main MediaWiki directory. For example:

/home/myuserName/myDomainDir/w/extensions/botquery/query.php

where the directory "w/" is the standard MediaWiki directory named in such a way as not to clash - ie not MediaWiki or Wiki. This allows easier redirection with .htaccess for tidier urls.

[edit] Apache Rewrite Rules and URls

This is not required, but might be desirable for shorter URLs to debug

In progress - have to see how pointing a subdomain (wiki.mydomain.org) at the installation affects query.php!

[edit] Short URLs with a symlink

Using the conventions above:

$ cd /home/myuserName/myDomainDir/w # change to directory containing LocalSettings.php
$ ln -s extensions/botquery/query.php .

[edit] Short URLs in proper way

If you've got permission to edit "httpd.conf" file (Apache server configuration file), it's much better to create alias for "query.php". To do that, just add the following line to "httpd.conf" aliases section:

Alias /w/query.php "c:/wamp/www/w/extensions/botquery/query.php"

Of course, the path could be different on your system. Enjoy. --CodeMonk 16:00, 27 January 2007 (UTC)

[edit] Usage

[edit] Python

This sample uses the simplejson library found here.

import simplejson, urllib, urllib2

QUERY_URL = u"http://en.wikipedia.org/w/query.php"
HEADERS = {"User-Agent"  : "QueryApiTest/1.0"}

def Query(**args):
    args.update({
        "noprofile": "",      # Do not return profiling information
        "format"   : "json",  # Output in JSON format
    })
    req = urllib2.Request(QUERY_URL, urllib.urlencode(args), HEADERS)
    return simplejson.load(urllib2.urlopen(req))

# Request links for Main Page
data = Query(titles="Main Page", what="links")

# If exists, print the list of links from 'Main Page'
if "pages" not in data:
    print "No pages"
else:
    for pageID, pageData in data["pages"].iteritems():
        if "links" not in pageData:
            print "No links"
        else:
            for link in pageData["links"]:
                # To safelly print unicode characters on the console, set 'cp850' for Windows and 'iso-8859-1' for Linux
                print link["*"].encode("cp850", "replace")

[edit] Ruby

This example prints all the links on the Ruby (programming language) page.

 require 'net/http'
 require 'yaml'
 require 'uri'
 
 @http = Net::HTTP.new("en.wikipedia.org", 80)
 
 def query(args={})
   options = {
     :format => "yaml",
     :noprofile => ""
   }.merge args
   
   url = "/w/query.php?" << options.collect{|k,v| "#{k}=#{URI.escape v}"}.join("&")
   
   response = @http.start do |http|
     request = Net::HTTP::Get.new(url)
     http.request(request)
   end
   YAML.load response.body
  end
 
 result = query(:what => 'links', :titles => 'Ruby (programming language)')
 
 if result["pages"].first["links"]
   result["pages"].first["links"].each{|link| puts link["*"]}
 else
   puts "no links"
 end

[edit] Browser-based

You want to use the JSON output by setting format=json. However, until you're figured out the parameters to supply query.php with and where the data will be, you can use format=jsonfm instead.

Once this is done, you eval the response text returned by query.php and extract your data from it.

[edit] JavaScript

// this function attempts to download the data at url.
// if it succeeds, it runs the callback function, passing
// it the data downloaded and the article argument
function download(url, callback, article) {
   var http = window.XMLHttpRequest ? new XMLHttpRequest()
     : window.ActiveXObject ? new ActiveXObject("Microsoft.XMLHTTP")
     : false;
  
   if (http) {
      http.onreadystatechange = function() {
         if (http.readyState == 4) {
            callback(http.responseText, article);
         }
      };
      http.open("GET", url, true);
      http.send(null);
   }
}

// convenience function for getting children whose keys are unknown
// such as children of pages subobjects, whose keys are numeric page ids
function anyChild(obj) { 
   for(var key in obj) {
      return obj[key];
   }
   return null; 
}

// tell the user a page that is linked to from article
function someLink(article) {
   // use format=jsonfm for human-readable output
   var url = "http://en.wikipedia.org/w/query.php?format=json&what=links&titles=" + escape(article);
   download(url, finishSomeLink, article);
}

// the callback, run after the queried data is downloaded
function finishSomeLink(data, article) {
   try {
      // convert the downloaded data into a javascript object
      eval("var queryResult=" + data);
      // we could combine these steps into one line
      var page = anyChild(queryResult.pages);
      var links = page.links;
   } catch (someError) {
      alert("Oh dear, the JSON stuff went awry");
      // do something drastic here
   }
   
   if (links && links.length) {
      alert(links[0]["*"] + " is linked from " + article);
   } else {
      alert("No links on " + article + " found");
   }
}

someLink("User:Yurik");

[edit] How to run javascript examples

In Firefox, drag JSENV link (2nd) at this site to your bookmarks toolbar. While on a wiki site, click the button and copy/paste the code into the debug window. Click Execute at the top.

[edit] Perl

This example was inherited from MediaWiki perl module code by User:Edward Chernenko.

use LWP::UserAgent;
sub readcat($)
{
   my $cat = shift;
   my $ua = LWP::UserAgent->new();
 
   my $res = $ua->get("http://en.wikipedia.org/w/query.php?format=xml&what=category&cptitle=$cat");
   return unless $res->is_success();
   $res = $res->content();
 
   # good for MediaWiki module, but ugly as example!
   # it should _parse_ XML, not match known parts...
   while($res =~ /(?<=<page>).*?(?=<\/page>)/sg)
   {
       my $page = $&;
       $page =~ /(?<=<ns>).*?(?=<\/ns>)/;
       my $ns = $&;
       $page =~ /(?<=<title>).*?(?=<\/title>)/;
       my $title = $&;
 
       if($ns == 14)
       {
          my @a = split /:/, $title; 
          shift @a; $title = join ":", @a;
          push @subs, $title;
       }
       else
       {
          push @pages, $title;
       }
   }
   return(\@pages, \@subs);
}
 
my($pages_p, $subcat_p) = readcat("Unix");
print "Pages:         " . join(", ", sort @$pages_p) . "\n";
print "Subcategories: " . join(", ", sort @$subcat_p) . "\n";

[edit] C# (Microsoft .NET Framework 2.0)

The following function is a simpified code fragment of DotNetWikiBot Framework.

Attention: This example needs to be revised to remove RegEx parsing of the XML data. There are plenty of XML, JSON, and other parsers available or built into the framework. --Yurik 05:44, 13 February 2007 (UTC)

using System;
using System.Text.RegularExpressions;
using System.Collections.Specialized;
using System.Net;
using System.Web;

/// <summary>This internal function gets all page titles from the specified
/// category page using "Query API" interface. It gets titles portion by portion.
/// It gets subcategories too. The result is contained in "strCol" collection. </summary>
/// <param name="categoryName">Name of category with prefix, like "Category:...".</param>
public void FillAllFromCategoryEx(string categoryName)
{
    string src = "";
    StringCollection strCol = new StringCollection();
    MatchCollection matches;
    Regex nextPortionRE = new Regex("<category next=\"(.+?)\" />");
    Regex pageTitleTagRE = new Regex("<title>([^<]*?)</title>");
    WebClient wc = new WebClient();
    do {
        Uri res = new Uri(site.site + site.indexPath + "query.php?what=category&cptitle=" +
            categoryName + "&cpfrom=" + nextPortionRE.Match(src).Groups[1].Value + "&format=xml");
        wc.Credentials = CredentialCache.DefaultCredentials;
        wc.Encoding = System.Text.Encoding.UTF8;
        wc.Headers.Add("Content-Type", "application/x-www-form-urlencoded");
        wc.Headers.Add("User-agent", "DotNetWikiBot/1.0");
        src = wc.DownloadString(res);                   
        matches = pageTitleTagRE.Matches(src);
        foreach (Match match in matches)
            strCol.Add(match.Groups[1].Value);
    }
    while (nextPortionRE.IsMatch(src));
}

[edit] PHP

// Please remember that this example requires PHP5.
 ini_set('user_agent', 'Draicone\'s bot');
 // This function returns a portion of the data at a url / path
 function fetch($url,$start,$end){
 $page = file_get_contents($url);
 $s1=explode($start, $page);
 $s2=explode($end, $page);
 $page=str_replace($s1[0], '', $page);
 $page=str_replace($s2[1], '', $page);
 return $page;
 }
 // This grabs the RC feed (-bots) in xml format and selects everything between the pages tags (inclusive)
 $xml = fetch("http://en.wikipedia.org/w/query.php?what=recentchanges&rchide=bots&format=xml","<pages>","</pages>");
 // This establishes a SimpleXMLElement - this is NOT available in PHP4.
 $xmlData = new SimpleXMLElement($xml);
 // This outputs a link to the curr diff of each article
 foreach($xmlData->page as $page) {
 echo "<a href=\"http://en.wikipedia.org/w/index.php?title=". $page->title . "&diff=curr\">". $page->title . "</a><br />\n";
 }

[edit] Chicken Scheme

;; Write a list of html links to the latest changes
;;
;; NOTES
;; http:GET takes a URL and returns the document as a character string
;; SSAX:XML->SXML reads a character-stream of XML from a port and returns
;; a list of SXML equivalent to the XML.
;; sxpath takes an sxml path and produces a procedure to return a list of all
;; nodes corresponding to that path in an sxml expression.
;;
(require-extension http-client)
(require-extension ssax)
(require-extension sxml-tools)
;;
(define sxml
  (with-input-from-string
    (http:GET "http://en.wikipedia.org/w/query.php?what=recentchanges&rchide=bots&format=xml&rclimit=200")
    (lambda ()
      (SSAX:XML->SXML (current-input-port) '()))))
(for-each (lambda (x) (display x)(newline))
  (map
    (lambda (x)
      (string-append
        "<a href=\"http://en.wikipedia.org/w/index.php?title="
        (cadr x) "&diff=cur\">" (cadr x) "</a><br/>"))
    ((sxpath "yurik/pages/page/title") sxml)))