Wikipedia:WikiProject Red Link Recovery/Strategy

From Wikipedia, the free encyclopedia

Contents

[edit] Ways in which red links can be 'fixed'

With over 2 million red links in the 15th May 2005 database dump, red links are not a soluble problem, nor should they be - it's perfectly reasonable to link to something that needs an article but doesn't have one yet. That said, there are quite a few red links that aren't necessary. As an initial stab at the problem, I see three ways in which unecessary red links can be dealt with.

[edit] Unlinking text

I believe that if only one red link exists for a topic then there's a chance that the topic in question isn't one an article should exist for. I have in the past produced a list of such things (User:Topbanana/Reports/This article contains the only link to a missing article) - this didn't prove a popular report.

We need to find better heuristics for identifying text that could usefully be unlinked.

[edit] Changing what links point to

A very rough and ready analysis has shown that at least 25% of the current crop of red links could usefuly point to articles that already exist. I've started tackling this by producing by automated means lists of suggested fixes for links, concentrating on heuristics that give low numbers of false positives initially. What few false positives are produced will be retained and used to ensure they're not generated again.

[edit] Adding new articles

The "most wanted" list has consistently proved and effective and popular method of getting articles produced that many things link to. Initially I suggest keeping this list as up-to-date as possible, with a medium-term goal of producing specialised most wanted lists for particular fields of interest.

- TB 18:23, 2005 Jun 23 (UTC)

[edit] Add redirects

Additional option: add redirects. -- User:Docu

[edit] Detection strategies

Below are listed all currently proposed strategies for identifying alternate targets for red links:

  • Character-based permutations
    • Diacritics (e vs é) - Tried, > 90% success
    • Capitalisation - Tried, > 80% success
    • Punctuation - Tried, > 90% success
    • Repeated letters - Tried, > 85% success
    • Transposed letters - Tried, limited success (<60% accurate) - need to work out which transposed letters are most problematic
    • Phonetic confusions (ie gh/ff) - Tried, limited success (<75% accurate)
  • Word based permutations
    • Common spelling errors (from the typo team's list)
    • Titles in names (dr/dr./doc/doctor) - Tried, produced no results, although I may have botched it
    • Symbolic abbreviations (and/&)
    • Numbers (1/one and 1st/first) - Tried, >85% success
    • Roman numerals ( III, 3, three, third ) - Tried, >80% success
    • Abbreviations in general (ltd/ltd./limited)
    • UK/US spelling differences (armor/armour)
    • Non-english words (French/Francais)
    • Remove common disambiguation name Ex: (film), (book), (band) etc.
    • Singulars/plurals - Tried, not very successful (~75% success)
    • Names with alternate spellings (Mohammed, Muhammad) - Tried, small result set but very successful
    • For names, e.g.: Jr/Jr./, Jr/, Jr./Junior/ (junior)/, junior/ the younger/, younger
    • Homonyms
  • General strategies
    • Regexp distance (matches by removing, adding or changing only one character) - the french incarnation of this project does this already
    • Typing distance (hitting a key near the one you meant to)
    • Initialised middle names (Fred W. Bloggs/Fred William Bloggs) - Tried, >80% success but problems with roman numerals
    • Omitted middle names (Fred Bloggs/Fred William Bloggs)

Also, things that red links might contain that indicate they are not correct:

  • Mismatched brackets - Tried some problems with chemical formulaebut very successful (>90%)
  • Mismatched quotes - Tried, very successful (>95%)
  • Unlikely punctuation (ie ??)- Tried, quite successful (>85%)
  • Triple letters - normal english words never contain these
  • Unlikely character combinations (ie q with no following u)
  • Badly formed image links (ie missing Image:) - Very successful (>98%!)
  • Badly formed external links (ie starting http)
  • Badly formed template links
  • Links to the User namespace

Some word and character analysis of red links in the database might highlight further potential strategies