ReCAPTCHA
reCAPTCHA is a user-dialogue system originally developed by Luis von Ahn, Ben Maurer, Colin McMillen, David Abraham and Manuel Blum at Carnegie Mellon University's main Pittsburgh campus, and acquired by Google in September 2009.[1] Like the CAPTCHA interface, reCAPTCHA asks users to enter words seen in distorted text images onscreen. By presenting two words it both protects websites from bots attempting to access restricted areas[2] and helps digitize the text of books.
The reCAPTCHA service supplies subscribing websites with images of words that optical character recognition (OCR) software has been unable to read. The subscribing websites (whose purposes are generally unrelated to the book digitization project) present these images for humans to decipher as CAPTCHA words, as part of their normal validation procedures. They then return the results to the reCAPTCHA service, which sends the results to the digitization projects.
reCAPTCHA has worked on digitizing the archives of The New York Times and books from Google Books.[3] As of 2012, thirty years of The New York Times had been digitized and the project planned to have completed the remaining years by the end of 2013. UPDATE: The complete archive of The New York Times can now be searched from NYTimes.com — more than 13 million articles total. The archive is divided into two search sets: 1851–1980 and 1981–present. http://www.nytimes.com/ref/membercenter/nytarchive.html [4]
The system has been reported as displaying over 100 million CAPTCHAs every day,[3] on sites such as Facebook, TicketMaster, Twitter, 4chan, CNN.com, StumbleUpon,[5] Craigslist (since June 2008),[6] and the U.S. National Telecommunications and Information Administration's digital TV converter box coupon program website (as part of the US DTV transition).[7]
reCAPTCHA's slogan is "Stop spam, read books."[8]
Origin
Distributed Proofreaders was the first project to volunteer its time to decipher scanned text that could not be read by OCR. It worked with Project Gutenberg to digitize the public domain works.
The reCAPTCHA program originated with Guatemalan computer scientist Luis von Ahn,[9] and was aided by a MacArthur Fellowship. An early CAPTCHA developer, he realized "he had unwittingly created a system that was frittering away, in ten-second increments, millions of hours of a most precious resource: human brain cycles".[10]
Operation
Scanned text is subjected to analysis by two different optical character recognition programs. Their respective outputs are then aligned with each other by standard string-matching algorithms and compared both to each other and to an English dictionary. Any word that is deciphered differently by both OCR programs or that is not in the English dictionary is marked as "suspicious" and converted into a CAPTCHA. The suspicious word is displayed, out of context, along with a control word already known. The system assumes that if the human types the control word correctly, then the response to the questionable word is accepted as probably valid. If enough users were to correctly type the control word, but incorrectly type the 2nd word which OCR had failed to recognize, then the digital version of documents could end up containing the incorrect word. The identification performed by each OCR program is given a value of 0.5 points, and each interpretation by a human is given a full point. Once a given identification hits 2.5 points, the word is considered valid. Those words that are consistently given a single identity by human judges are later recycled as control words.[11] If the first three guesses match each other but do not match either of the OCRs, they are considered a correct answer, and the word becomes a control word.[12] When six users reject a word before any correct spelling is chosen, the word is discarded as unreadable.[12]
The original reCAPTCHA method was designed to show the questionable words separately, as out-of-context correction, rather than in use, such as within a phrase of 5 words from the original document.[13] Also, the control word might mislead context for the 2nd word, such as a request of "/metal/ /fife/" being entered as "metal file" due to the logical connection of filing with a metal tool being considered more common than the musical instrument "fife".[citation needed]
In 2012, reCAPTCHA began using photographs of house numbers taken from Google's Street View project, in addition to scanned words.[14]
Implementation
The reCAPTCHA tests are displayed from the central site of the reCAPTCHA project, which supplies the words to be deciphered. This is done through a JavaScript API with the server making a callback to reCAPTCHA after the request has been submitted. The reCAPTCHA project provides libraries for various programming languages and applications to make this process easier. reCAPTCHA is a free service (that is, the CAPTCHA images are provided to websites free of charge, in return for assistance with the decipherment),[15] but the reCAPTCHA software itself is not open source.[16]
Also, reCAPTCHA offers plugins for several web-application platforms, like ASP.NET, Ruby, or PHP, to ease the implementation of the service.[17]
Security
The basis of a CAPTCHA system is to prevent automated access to a system by computer programs or "bots". On 14 December 2009, Jonathan Wilkins released a paper describing weaknesses in reCAPTCHA that allowed a solve rate of 18%.[18][19][20]
On 1 August 2010, Chad Houck gave a presentation to the DEF CON 18 Hacking Conference detailing a method to reverse the distortion added to images which allowed a computer program to determine a valid response 10% of the time.[21][22] The reCAPTCHA system was modified on 21 July 2010, before Houck was to speak on his method. Houck modified his method to what he described as an "easier" CAPTCHA to determine a valid response 31.8% of the time. Houck also mentioned security defenses in the system, including a high security lock out if an invalid response is given 32 times in a row.[23]
On 26 May 2012, Adam, C-P and Jeffball of DC949 gave a presentation at the LayerOne hacker conference detailing how they were able to achieve an automated solution with an accuracy rate of 99.1%.[24] Their tactic was to use a form of artificial intelligence known as machine learning to analyse the audio version of reCAPTCHA which is available for the visually impaired. Google released a new version of reCAPTCHA just hours before their talk, making major changes to both the audio and visual versions of their service. In this release, the audio version was increased in length from 8 seconds to 30 seconds, and is much more difficult to understand, both for humans as well as bots. In response to this update and the following one, the members of DC949 released two more versions of Stiltwalker which beat reCAPTCHA with an accuracy of 60.95% and 59.4% respectively. After each successive break, Google updated reCAPTCHA within a few days. According to DC949, they often reverted to features that had been previously hacked.
In an August 2012 presentation given at BsidesLV 2012, DC949 called the latest version "unfathomably impossible for humans" - they were not able to solve them manually either.[24] The web accessibility organization WebAIM reported in May 2012, "Over 90% of respondents [screen reader users] find CAPTCHA to be very or somewhat difficult.".[25]
On 27 June 2012, Claudia Cruz, Fernando Uceda, and Leobardo Reyes (a group of students from México) published a paper showing a system running on reCAPTCHA images with an accuracy of 82%.[26] The authors have not said if their system can solve recent reCAPTCHA images, although they claim their work to be intelligent OCR and robust to some changes.
reCAPTCHA frequently modifies its system, requiring hackers to frequently update their methods of decoding, which may frustrate potential abusers.[citation needed]
Only words that both OCR programs failed to recognize are used as control words. Thus, any program that can recognize these words with nonnegligible probability would represent an improvement over state of the art OCR programs.[12]
Derivative projects
reCAPTCHA had also created project Mailhide, which protects email addresses on web pages from being harvested by spammers.[27] By default, the email address is converted into a format that does not allow a crawler to see the full email address, for example, "mailme@example.com" would be converted to "mai...@example.com". The visitor would then click on the "..." and solve the CAPTCHA in order to obtain the full email address. One can also edit the pop-up code so that none of the address is visible.
References
- ↑ "Teaching computers to read: Google acquires reCAPTCHA". Google. Retrieved 2009-09-16.
- ↑ Luis von Ahn, Ben Maurer, Colin McMillen, David Abraham and Manuel Blum (2008). "reCAPTCHA: Human-Based Character Recognition via Web Security Measures" (PDF). Science 321 (5895): 1465–1468. doi:10.1126/science.1160379. PMID 18703711.
- ↑ 3.0 3.1 "reCAPTCHA FAQ". Google. Retrieved 2011-06-12.
- ↑ Luis von Ahn (2009). NOVA ScienceNow s04e01 (Television production). Event occurs at 46:58. "The New York Times has this huge archive, over 130 years of newspaper archive there. And we've done maybe about 20 years so far of The New York Times in the last few months, and I believe we're going to be done next year by just having people do a word at a time."
- ↑ Rubens, Paul (2007-10-02). "Spam weapon helps preserve books". BBC.
- ↑ "Fight Spam, Digitize Books". Craigslist Blog. June 2008.
- ↑ TV Converter Box Program
- ↑ "reCAPTCHA: Stop Spam, Read Books". Google.com. Retrieved 2013-07-10.
- ↑ ""Full Interview: Luis von Ahn on Duolingo", Spark, November 2011". Cbc.ca. 2011-11-30. Retrieved 2013-07-10.
- ↑ Hutchinson, Alex (March 2009). "Human Resources: The job you didn't even know you had". The Walrus. pp. 15–16.
- ↑ Timmer, John (2008-08-14). "CAPTCHAs work? for digitizing old, damaged texts, manuscripts". Ars Technica. Retrieved 2008-12-09.
- ↑ 12.0 12.1 12.2 Luis von Ahn, Ben Maurer, Colin McMillen, David Abraham and Manuel Blum (2008). "reCAPTCHA: Human-Based Character Recognition via Web Security Measures" (PDF). Science 321 (5895): 1465–1468. doi:10.1126/science.1160379. PMID 18703711.
- ↑ ""questionable validity of results if words are presented out of context", Google Groups, August 29, 2008". Groups.google.com. Retrieved 2013-07-10.
- ↑ Thursday, March 29th, 2012 (2012-03-29). "Google Now Using ReCAPTCHA To Decode Street View Addresses". TechCrunch. Retrieved 2013-07-10.
- ↑ "FAQ". reCAPTCHA.net.
- ↑ "reCAPTCHA: Stop Spam, Read Books". google.com. Retrieved 14 January 2014.
- ↑ "Developer's Guide - reCAPTCHA — Google Developers". developers.google.com. Retrieved 14 January 2014.
- ↑ "Strong CAPTCHA Guidelines".
- ↑ "Google's reCAPTCHA busted by new attack".
- ↑ "Google's reCAPTCHA dented".
- ↑ "Def Con 18 Speakers". defcon.org.
- ↑ "Decoding reCAPTCHA Paper". Chad Houck.
- ↑ "Decoding reCAPTCHA Power Point". Chad Houck.
- ↑ 24.0 24.1 "Project Stiltwalker".
- ↑ "Screen Reader User Survey #4 Results".
- ↑ Claudia Cruz-Perez; Fernando Uceda-Ponga, Leobardo Reyes-Cabrera (27 June 2012). Carrasco-Ochoa, JesúsAriel and Martínez-Trinidad, JoséFrancisco and Olvera López, JoséArturo and Boyer, KimL, ed. Pattern Recognition. México. pp. 155–165. ISBN 978-3-642-31148-2. Archived from the original on 30 June 2012. Retrieved 23 January 2013.
- ↑ "Mailhide: Free Spam Protection". reCAPTCHA.net.
External links
Wikimedia Commons has media related to ReCAPTCHA. |
- Official website
- Try reCAPTCHA at google.com
- ReCAPTCHA: The job you didn't even know you had Two-page article in The Walrus magazine
- Luis von Ahn, Benjamin Maurer, Colin McMillen, David Abraham and Manuel Blum. 2008. "reCAPTCHA: Human-Based Character Recognition via Web Security Measures" Science 12 September 2008: Vol. 321 no. 5895 pp. 1465–1468. http://dx.doi.org/10.1126/science.1160379
- "Luis von Ahn: Massive-scale online collaboration", YouTube video of the "TEDtalksDirector" channel, uploaded 2011-12-06.
- Luis von Ahn at TED
- Example of an unexpected risk in using reCAPTCHA