Utilizing Wasted Human Efforts : reCAPTCHA

Most of us are familiar with CAPTCHA. Many of us know that it stands for “Completely Automated Public Turing test to tell Computers and Humans Apart”. Few of us probably know that this seemingly trivial application is helping to digitize old printed material to preserve human knowledge for posterity !

CAPTCHA is a challenge response test used in web applications to determine whether a user is a human or some robot spewing spam. Most of us have seen this typical CAPTCHA asking to to do some trivial math functions (e.g. 1 + 2 = ? ) or type in a number of distorted characters when we are about to “submit” some  information form on a web page.

This solution is based on the premise that current capabilities of computer programs cannot read distorted text as well as humans can.

A very innovative extension of this scheme was devised – a reCAPTCHA. Humans around the world type more than 200 million CAPTCHAs every day (source :  http://www.google.com/recaptcha/learnmore) proving that CAPTCHAs are effective at preventing large scale abuse of online services. But while deciphering these CAPTCHAs, people perform a task that cannot be accomplished by computers. This mental effort spent by each person to solve a CAPTCHA -  amounting to thousands of human hours of work - can it be utilized in a better way ?

Google who now owns reCAPTCHA, is on a project to digitize physical books and other forms of text (e.g. NY Times) written before the computer age. The pages are photographically scanned and the resulting bitmap images are transformed into text files by optical character recognition (OCR) software. This transformation into text is useful because the books can then be indexed, searched, and stored in a format that can be easily analyzed and manipulated. One of the stumbling blocks in the digitization process is that OCR is far from perfect at deciphering the words in bitmap images of scanned texts. For older prints with faded ink and yellowed pages, OCR cannot recognize about 20% of the words.

Using human transcribers to correct any discrepencies is expensive. So, instead of using random characters generated by a computer, reCAPTCHA sends those words that are the OCR programs cannot recognize. The human users on the other side peform the role of a transcriber.

The catch here is that the system must be able to verify the answer submitted by the user. How is that possible if we send only those words that are not recognized by the OCR ? So, reCAPTCHA sends two words to the user : one for which the answer is know - called the "control" word. And the second - unknown word from the scanned documents. If the user types in the control word correctly, then the system can have a confidence that the response to the unknown word also "could" be correct. Combining the response to a single unknown word played to multiple users, and scoring them, the system can conclusively identify the word ! The exact process and details of various scenarios can be found at this paper (http://www.sciencemag.org/content/321/5895/1465).

To top it all, the implementation of this service is a breeze ! With content management system like Drupal - all a simple math-based CAPTCHA needs is downloading the right module and enabling them on the pages that you want. If you want a reCAPTCHA, there's an additional bit configuration where you need to register with www.recaptcha.net and get a pair of public& private keys for setup with your website. That's it  !

But the idea to use "wasted" human processing power to solve problems that cannot be solved by computers is indeed very innovative. Fold it (http://www.fold.it/) is another "crowdsourcing" experiment that relies on humans' pattern-recognition and puzzle-solving abilities that make them more efficient than existing computer programs at protein-folding tasks.

reCAPTCHA

Comments

reCAPTCHA

Very nice article Abhijit. Using 'wasted' human effort in CAPTCHA is innovative indeed.