Wikipedia:Database queries
From Wikipedia, the free encyclopedia
[edit] Possibilities for asking SQL
[edit] Download the database yourself
You can download the database from download.wikipedia.org and use MySQL to query.
[edit] Asking developers to run queries on the live database
Some developers have the ability to run queries on the live database. You may request they do this at meta:Requests for queries.
[edit] Using an external tool
It was once possible to run an sql query tool under http://www.wikisign.org, which supported Wikipedias of major languages etc. Currently inactive.
[edit] Notes
- When copying a query from this page, some browsers copy a '*' in front of the SELECT statement. This has to be removed!
- Most of these queries return article titles. These do not contain the namespace! The namespaces have numeric codes in the separate cur_namespace field. 0 is the regular article namespace, others can be found on Wikipedia:Namespace
To avoid time-consuming or malformed queries, here are some useful suggestions (please add more - some requests might be found on this page's talk page):
Note: to convert the returned list in mysql to wiki-format use: s/(\ |\|)//g;s/^/*\[\[/;s/$/\]\]/
[edit] Quick instructions for running your own MySQL queries
This entire process can be completed in ~1 hour on a broadband connection. <screenshots coming soon>
- Download the database dump you want to query from download.wikimedia.org
- While that is downloading, also download MySQL from MySQL.com
- Install MySQL. Under Windows you might have to restart your computer, although probably not
- Under Unix you will need to create the initial database (mysql_install_db) start the mysql server process (eg mysqld_safe)
- For windows click on Start > Programs > MySQL > MySQL command line (or start > run > type 'cmd' > enter > type 'mysql' > enter); under Unix or MacOS X just type mysql in a terminal.
- type : CREATE DATABASE wikipedia;
- type USE wikipedia
- Database should be ready by now. Uncompress it (with gunzip) if it is still compressed. Type SOURCE C:\Database Location\cur_en_whatever.sql This should take ~20+-10 minutes on EN
- When it's finished, you can start entering some of the example queries found on this page. Simply type the first line of query syntax at the mysql> command line, push enter, and type the next line. Your very last line should end with a semi-colon. Keep in mind that, as is shown in the example below, your table_rows and database.name will be different than found on this page. If you named your database wikipedia then the following format should work:
Example: (sample output)
mysql> SELECT c.cur_title FROM wikipedia.cur c -> WHERE c.cur_text LIKE "%directory.google.com%";
- Also consider installing the MySQL Query Browser, which is a very pretty graphical user interface for MySQL queries. To use it, you start the MySQL server on the command line, and then run this program on top of it. Of the options it gives you, you need only to type in 'localhost' for the server address and push enter. (You will likely need to restart if installing both the MySQL server and query browser at the same time.)<screenshots coming soon>
[edit] Modifiers
To limit results:
- "SELECT ... LIMIT 20".
You can also add an offset;
- "SELECT ... LIMIT 100, 20" will give you 20 records starting at the 101st.
[edit] Queries to find new users
Here's a couple of queries to find the 20 most newly created users - useful to find people to give welcome messages. It only counts users that have made edits (most users who create a login never make a single contribution and are just passing by, so it's not really worth welcoming them).
- SELECT user_name, COUNT(*) FROM user, cur WHERE user_id=cur_user GROUP BY user_id ORDER BY user_id DESC LIMIT 20
The above query gives the user names of the 20 most recently created users and the number of edits they have made. The query is not perfect - only edits that have not been 'overwritten' by other edits are counted here. To do the same thing looking only at 'overwritten' edits use:
- SELECT user_name, COUNT(*) FROM user, old WHERE user_id=old_user GROUP BY user_id ORDER BY user_id DESC LIMIT 20
[edit] Queries to find stub articles
For short pages containing "see" (takes about 2 seconds):
- SELECT cur_title FROM cur WHERE cur_is_redirect=0 AND LENGTH(cur_text)<50 AND cur_text LIKE '%see%'
finds all pages with less than 50 chars.
[edit] Queries to find articles with one edit (new)
SELECT cur_title FROM cur WHERE cur_is_new AND cur_namespace=0;
Finds all pages that have only been edited once.
[edit] Queries to find inappropriate links
[edit] Double Redirects
Article namespace:
- SELECT ca.cur_namespace, ca.cur_title FROM cur ca, cur cb, links l WHERE ca.cur_is_redirect=1 AND cb.cur_is_redirect=1 AND l.l_from=ca.cur_title AND l.l_to=cb.cur_id AND ca.cur_namespace=0 LIMIT 250
Talk namespace:
- SELECT ca.cur_namespace, ca.cur_title, FROM cur ca, cur cb, links l WHERE ca.cur_is_redirect=1 AND cb.cur_is_redirect=1 AND l.l_from=CONCAT("Talk:",ca.cur_title) AND l.l_to=cb.cur_id AND ca.cur_namespace=1 LIMIT 250
These queries are slow. You may have to reduce the limit for them to work.
[edit] Other
List all articles that link to a file on http://meta.wikipedia.com (takes 2-3 seconds):
- SELECT cur_title FROM cur WHERE cur_text LIKE "%http://meta.wikipedia.com/upload/%" AND cur_namespace=0
To find articles that link to jpegs, both external and internal, with the old non-[[Image:]] style (this isn't perfect, it also finds a few other pages):
- SELECT cur_title FROM cur WHERE cur_text LIKE "%http://%.jp%g%" AND cur_namespace=0
For articles containing a link to itself (!!24 seconds!!)
- SELECT cur_title FROM cur,links WHERE cur_is_redirect=0 AND cur_namespace=0 AND l_from=cur_title AND l_to=cur_id LIMIT 20
For redirects to non-existing pages (note: when a page is found that is redirecting correctly, it usually means that there is text going with the redirect, 2-4 seconds):
- SELECT cur_title FROM cur,brokenlinks WHERE bl_from=cur_id AND cur_is_redirect=1
[edit] Query to find articles without bold markings
Which means they don't have bold titles in the first sentence!
- SELECT cur_title FROM cur WHERE cur_namespace=0 AND cur_is_redirect=0 AND cur_text NOT LIKE "%'''%" AND cur_text NOT LIKE "%<b>%" LIMIT 50
[edit] Query to find articles without bold markings that are not disambiguation pages
Same as above only removes disambiguation pages which don't need bolds. SELECT cur_title FROM cur WHERE cur_namespace=0 AND cur_is_redirect=0 AND cur_text NOT LIKE "%'''%" AND cur_text NOT LIKE "%<b>%" AND cur_text NOT LIKE "%disambig%" LIMIT 50
[edit] Query to find sub-page articles
This finds all articles containing a "/" character; most of these are subpages.
- SELECT cur_title FROM cur WHERE cur_namespace=0 and cur_title like "%/%" and cur_is_redirect=0
[edit] Queries to give statistics
For a count of how many articles are more than 1500 characters long, use:
- SELECT COUNT(*) FROM cur WHERE LENGTH(cur_text)>1500 AND cur_namespace=0
[edit] Queries about blocked users
- SELECT DISTINCT ipb_address, COUNT(*) AS times, MIN(ipb_reason) AS reason, user_name AS blocker, DATE_FORMAT(MIN(ipb_timestamp), '%b %e %Y %k:%i') AS date FROM ipblocks, user WHERE user_id=ipb_by GROUP BY ipb_address ORDER BY ipb_timestamp
returns the blocked IP addresses, and the first entry for the blocking (some IP addresses have been blocked by multiple people), ordered by date.
This is mostly unnecessary (but may be of interest as an example query), as the Special:Ipblocklist returns similar information.
[edit] it's...
Find all pages with "it's" for spellchecking purposes. WARNING: takes a long time. May want to apply a limit (see above)
- SELECT cur_title from cur where cur_namespace=0 and cur_text regexp "[[:<:]]it's[[:>:]]"
[edit] Vandalism
When a certain IP-number or user has been found to be vandalising, it might be that some of their problems may already have been solved by someone else and others not. The following query shows those pages which the vandal (or otherwise problematic user) was the last to edit:
- SELECT cur_title FROM cur WHERE cur_user_text="name" LIMIT 100
If the vandal is a registered user, use their Username for name, if not, use their IP-number.
[edit] Query to find sysops/developers
- SELECT user_name, user_rights from user where user_rights != "" and user_rights != "bot" ORDER BY user_name LIMIT 100
[edit] Images
The queries here produce valid links when the results are copied and pasted into a page. Not very pretty, but functional. With 70,000 images in en, these take about 60 seconds to run the first time.
[edit] All images by a given contributor (including old versions)
select distinct concat('<br>[[:Image:', img_name, ']]') as name from image where img_user_text='USER_NAME_HERE' union select distinct concat('<br>[[:Image:', oi_name , ']]') as name from oldimage where oi_user_text='USER_NAME_HERE' order by name asc;
[edit] All images by a given contributor (only current version)
select distinct concat('<br>[[:Image:', img_name, ']]') as name from image where img_user_text='USER_NAME_HERE' order by name asc;
[edit] All image pages edited by a given contributor
select distinct concat('<br>[[:Image:', cur_title, ']]') as title from cur where cur_namespace=6 and cur_user_text='USER_NAME_HERE' union select distinct concat('<br>[[:Image:', old_title , ']]') as title from old where old_namespace=6 and old_user_text='USER_NAME_HERE' order by title asc;
[edit] External Link
- A detailed description of the database schema and the meaning of various fields is available.