User:Dispenser/Link checker

From Wikipedia, the free encyclopedia

For the tool itself see tools:~dispenser/view/Main Page. Feel free to edit this documentation.
Link checker
Design by Dispenser
Written in Python, JavaScript
Platform Backend:
Python
Frontend:
Acid2 compliant web browser with JavaScript support
Available in English
Development status Active
Genre Wikipedia Tools
Website Website

Link checker is a tool that runs on the Wikimedia Toolserver that checks external links on a wiki. After parsing a page for external links it connects to each server and figures out what each one does. It uses a set of heuristics to determine what to classify the link as and lists it. The tool can be run "on the fly" for individual pages or scan projects pages periodically. With a JavaScript capable browser it is possible to fix the link by pointing to the correct address or by changing to an address of an archiving service. Which then can be saved using the "Save changes" button.

Contents

[edit] Background

Linkrot is a problem for many websites, including Wikipedia. However, as Wikipedia is increasingly including references to more sources, many of which are websites, news articles, books, paper, patents, government archives, and video. Some of the dead links are caused by content being moved around without proper redirection, while other require micropayments after a certain time period, and others simply vanish. With nearly a hundred links in an article it becomes an ornate ordeal to ensure that all the links and references are working correctly even in our featured articles that appear on the main page.

Some Wikipedians have already built tools to scan for dead links. There is giant aged lists like at Wikipedia:Dead external links. However, the tool required much work repairing the dead links which involved checking to see if the link was still there, searching for the replacement, editing the link. Much of this is repetitive and inefficient use of a person's time. This tool attempts to increase efficiency as much as possible.

[edit] Interface

Tools ▼ Save changes Jimmy Wales
302 Moved Temporarily {Wired (magazine)|2006-02-14} Wikipedia Founder Edits Own Bio [wired.com] Dated archive url
302 Found In Search of an Online Utopia [msn.com] Changes domain and redirect to /
Notes
  1. The name of the article appears above each set of links
  2. The HTTP status code is the left column with the human readable message in the next
  3. The external link as it appear on Wikipedia is in the third coumn, may contains extra meta information extracted from template.
  4. The last column display the analysis information. In this example it determined that one of the 302 redirects was likely a dead link. See #Classifications bellow.
  5. "Save changes" is used after setting actions using the drop down.

[edit] Repair

Once the page has fully loaded, selected an article to work on. Click on the link to make sure the tool has correctly identified the problem (errors can be reported on the talk page). If the link is incorrect you can try a Google search to locate it again, right-click and copy the URL, and paste into prompt create by the "Input correct URL" option or "Input archive URL". The color in the box on the left changes to the type of replacement that will be performed on the URL. When you're finish click "Save changes" and the tool will merge your changes and present a preview or the difference before letting you save.

[edit] Redirects

There are principally two types redirects 301 (permanent redirect) and 302 (regular redirect) used. In the former it is recommend that site update the URL to using the new address. While in contrast, the latter it is optional and should be reviewed by a human operator.

Some links might be access redirect to as to avoid the need to log into a system. These may be said to be permalink. Finally, there are redirects the point to fake or soft 404 pages. Do not blindly change these links!

[edit] Archives

The Wayback Machine is a valuable tool for dead link repair. The simplest way to get the list of links from archive.org is to click on the row. You can also load the results manually and paste them in using the "Use archive URL" option. The software will attempt to insert the URL using the archiveurl parameter of {{cite web}}.

[edit] Tips

  • Most non-news links can be found again by doing a search with the title of the link. This is the default setup for searching.
  • Link can be taken from the Google results via right-clicking and selecting "Copy Link Location" and inputting it through the drop down.
  • Always check the link by click on it (not the row) as some website do not like how tool send requests (false positive) or the tool wasn't smart enough to handle the incorrect error handling (false negative).
  • Non-HTML document can sometimes be found by searching for their file name.
  • If Google turns up the same link, leave it be as it has recently or temporally become dead and you will not find a replacement until the Google's index is updated.
  • You may wish to email the webmaster asking them to use redirection to keep the old links working.

[edit] Classifications

Identifier Rank Meaning Action
Working (White) 0 The link appears to work No action necessary.
Message (Green) 1 An HTTP Move (redirect) has occured. Link should work but should be checked. If the server responded with HTTP 301 the link should be updated
Warn (Yellow) 2 Link that could pose a problem to users. This includes expiring News sources, subscription required, or low signal to noise of links to text. If the link is expiring ensure that all critical detail are fill in to allow someone to find a offline copy.
Heuristically determined (Orange) 4 The tool thinks that the link is dead. 404 in redirects or redirection / of the website. Check the link, if dead attempt to use archiveurl with an archived copy from the Internet Archive, otherwise tag with {{dead link}}.
Client Error (Red) 5 Server has confirmed the link as dead. Ensure the link is correct and doesn't have any bits of wiki markup. If possible use archiveurl with an archived copy from the Internet Archive, otherwise tag with {{dead link}}.
Server Error or Connection Issue (Blue) 3 Five hundred Server Error or Connection Issue If a Server Error contact the webmaster to fix the problem. If a connection issue check to see if the Whois is still valid.
Bad link (Purple) 6 Spamlink or Google Cache link Parking links should be removed. Google Cache links should be converted back to the regular link or archiveurl

[edit] Files

[edit] Web accessible

linkchecker.py (11 KiB)
The render engine for files and output from checklinks.py and job management (disabled).
webchecklinks.py (4.6 KiB)
"On the fly" link checker, links to the checklink.py library, formats for the web. Limited support for other languages.
jsonchecklinks.py (2.4 KiB)
Similar to above, but output in JSON instead of HTML
mergeChanges.py (6.1 KiB)
Merge and append string to existing page and submits them to server for previewing, diffs, and saving.
url_info.py (5.1 KiB)
Displays how the URL is redirecting. If the result end with a Client Error it will attempt to retrieve search results from the Internet Archive.

[edit] Non accessible

checklink.py (41 KiB)
Library for parse MediaWiki page, evaluates external links.
checklinks.py (13 KiB)
Command line interface for checklink.py library runs atop of pywikipedia.
parser.py (5.0 KiB)
Wiki markup to html parser
wikipedia.py (13 KiB)
Minimal implementation of the pywikipedia library for webchecklinks.py.

[edit] See also

[edit] Resources

[edit] External links