Wikipedia:WikiProject Red Link Recovery/Link matching script
From Wikipedia, the free encyclopedia
- Note the description below is based on SQL dumps which have been discontinued. Current dumps are XML.
The script below details how the lists of suggested alternate targets for red links were created.
- Get yourself a mysql database. It's free to download, easy to set up and runs on most types of computer. The process should work on any reasonably recent version of the mysql software.
- Download the latest page and pagelinks tables from here. There are normally labelled enwiki-latest-page.sql.gz and enwiki-latest-pagelinks.sql.gz respectively.
- Unzip these files. Most file decompression utilities can do this, for example winzip or the freely available gzip.
- Import the unzipped tables into your mysql database. To do this start a mysql session and create a new database (CREATE DATABASE redlinks;) and make sure it's your current database (USE redlinks;). Now use source <filename> to import each of your files. This make take some time, the wikipedia database contains a lot of data.
You now have all the data needed to some quite nifty red link recovery on your PC, ready to use. For more information on this process, check out Wikipedia:Database download.
Now run this script:
DROP TABLE IF EXISTS crushed_art; DROP TABLE IF EXISTS crushed_links; CREATE TABLE crushed_art ( id mediumint(7) unsigned NOT NULL, title varchar(255) BINARY NOT NULL ); CREATE TABLE crushed_links ( id mediumint(7) unsigned NOT NULL, orig varchar(255) BINARY NOT NULL, link varchar(255) BINARY NOT NULL ); INSERT INTO crushed_art SELECT page_id, page_title FROM page WHERE page_namespace = 0; INSERT INTO crushed_links SELECT pl_from, pl_title, pl_title FROM pagelinks INNER JOIN page f ON pl_from = f.page_id AND f.page_namespace = 0 LEFT OUTER JOIN page t ON pl_title = t.page_title AND t.page_namespace = 0 WHERE t.page_id IS NULL AND pl_namespace = 0;
This creates two new tables (crushed_links and crushed_art) and copies all the data related to articles in the main wikipedia namespace - namespace 0.
The reason we've made copies of the data is becasue we're about to mangle it - or as I like to call the process, crushing. The goal of this process is to transform all of the red links and the article titles into simpler forms, then check if and of the simpler red links now match any of the simpler article titles. So for example if we are considering repeated instances of the letter 's', we might transform the article title Mississippi into Misisippi, Similarly we might transform a red link to Misssisippi into Misisippi. As the red link and title now match we might have found a good candidate to suggest!
Some example crushing operations are:
- Punctuation crushing
- Diacritic crushing
- Repeated letter crushing
- Capitalisation
- Pluralisation
- Numeric
- Initialisation of middle names
Having carried out some crushing operations that we believe might have produced useful matches, we now need to pick out these matches and carry out a little checking on them:
DROP TABLE IF EXISTS suggestions; CREATE TABLE suggestions ( title varchar(255) binary NOT NULL, link varchar(255) binary NOT NULL, suggestion varchar(255) binary NOT NULL, PRIMARY KEY ( title, link, suggestion ) ); ALTER TABLE crushed_links ADD INDEX( link ); REPLACE INTO suggestions SELECT f.page_title, orig, t.page_title FROM page f, crushed_links, crushed_art, page t WHERE f.page_id = crushed_links.id AND f.page_namespace = 0 AND t.page_id = crushed_art.id AND t.page_namespace = 0 AND link = title;
Next it's often useful to carry out a bit of automated weeding out of suggestions we expect to be quite poor. These might include:
- Suggestions to transform red links into the name of the article that contains them.
- Red links that are very short after crushing. Short links are likely to match many spurious article titles.
- Links that have already been suggested and marked as incorrect by project members - see the exception list
- Suggestions that are repeated for many articles.
As a rule of thumb, I try to ensure that over 75% of all entries on a list of suggestions are valid, even if this means discarding a large number of entries that have a lesser but still significant chance of being correct. There are over 2 million red links out there - it's not worth trying to develop complex techniques when sheer numbers ensure that simple ones produce amply long lists.
DELETE FROM suggestions WHERE title = suggestion; DELETE FROM suggestions WHERE length( suggestion ) < 5; DELETE FROM suggestions WHERE EXISTS ( SELECT 1 FROM exceptions WHERE suggestions.title = exceptions.title AND suggestions.link = exceptions.link AND suggestions.suggestion = exceptions.suggestion ); SELECT count(*), link, suggestion FROM suggestions GROUP BY link, suggestion ORDER BY 1 desc LIMIT 25;
To export your cleaned-up list of suggestions, the simplest thing to do is use mysql's tee command. This copies the output of your mysql session into a file. This file can then be opened in a text-editor (I like textpad) and prepared for cutting and pasting into wikipedia. Do be cautious of trying to cut and paste results from a command window under Microsoft Windows, this can lead to uncommon characters being lost.
A simple export of your suggestions table to a text file could be done by:
tee c:\suggestions.txt; SELECT concat( '*[[', title, ']] links to [[', link, ']], try [[', suggestion, ']]' ) FROM suggestions; notee;
If your have mysql version 5.0 or newer, much tidier output can be had using the script below. Start the mysql client with the options --disable-column-names --silent to strip out any pretty textboxes. You may need to log in as your mysql sysadmin to have permission to create a stored procedure.
DROP PROCEDURE IF EXISTS report_suggestions; DELIMITER // CREATE PROCEDURE report_suggestions( group_size INT ) BEGIN DECLARE sug_pos, sug_base, done INT; DECLARE title, link, suggestion VARCHAR(255); DECLARE sug CURSOR FOR SELECT * FROM suggestions; DECLARE CONTINUE HANDLER FOR SQLSTATE '23000' SET done = 1; SET sug_pos = 0; SET sug_base = 0; SET done = 0; OPEN sug; REPEAT FETCH sug INTO title, link, suggestion; IF NOT done THEN IF sug_pos = 0 THEN SELECT concat( '=== ', sug_base, ' - ', sug_base + group_size - 1, ' ===' ); SET sug_base = sug_base + group_size; SET sug_pos = group_size - 1; ELSE SET sug_pos = sug_pos - 1; END IF; SELECT concat( '*[[', title, ']] links to [[', link, ']], try [[', suggestion, ']]' ); END IF; UNTIL done END REPEAT; CLOSE sug; END; // DELIMITER ; call report_suggestions( 10 );