In my last article I briefly mentioned OCRs and decaptcha services. I stated basic info about how each of them works, but without going into details. If you still haven’t read my last article then don’t worry, I’ll make a more detailed explanation about each of these and show you all you need to know in order to make your decision on which one to pick.
Optical Character Recognizer (OCR)
OCRs algorithms are an important branch of Digital Image Processing field of study. Nowadays they’re used to create intelligent scanners and haste the process of digitalizing century old documents without having to employ an army of typers. OCRs were used for a large variety of activities in accordance to the technologies present at past times, but it hasn’t been until the late 1990’s that OCR algorithms attained enough maturity and resource availability to read images as complex as captchas.
OCRs vs Captchas: A short history lesson
Captchas became popular back in the day when blogs made the big leap in the internet. Before anyone had facebook or anyone had a twitter account… almost everyone had a blog. You could see news articles about Grannies having their own blogs, about how anyone can push their thoughts into the blogs-sphere and a bunch of other nonsense similar to what twitter’s social media experts say today at a regular basis.
It wasn’t long before a SEO backlink builder noticed this and saw the potential in all these random blogs that were appearing. In order to make their websites rank higher, they created automated robots (or bots) that went into blogs and made a comment with a link. This link (usually with anchor text) passed link juice from the blog page to the backlinker’s target webpage, giving the target page more rank and making the target page appear higher in the search results. NOTE: This method is a legacy method that has long been detected by the search engines and the bloggers and is now practically useless unless done with properly. If you’re interested in getting to know SEO 101, I recommend that you check out http://seomoz.com and read their starter guides.
These method and other similar methods made the bloggers and other social web pages create defense mechanisms to avoid these “intrusions”. Here comes the broad introduction of the CAPTCHA (Completely Automated Public Turing test to tell Computers and Humans Apart) image. Early captchas were plain simple to read, they were made this way because OCRs weren’t widespread and were really hard to code. So you would basically find pictures with white background and plain Arial font with numbers or capital letters as the only thing stopping people from taking action.
It didn’t take long for skilled programmers to take these CAPTCHAs and just destroy them. And then the fighting began. For the last 10 years there has been a constant ping-pong table duel between OCR decaptcha developers and CAPTCHA makers. –“Here is the latest unbreakable captcha”. –“Here’s the decaptcha for the former unbreakable captcha”. And things haven’t changed just yet. Now the reigning almost unbreakable CAPTCHA is the famous (or infamous if you hate it like I do) re-captcha.
Re-CAPTCHA was born as an open source project to digitalize books. What it basically does is use text from scanned books, detect the words and create a two word CAPTCHA. The catch here is that one of the words couldn’t be read by the Re-CAPTCHA Book OCR and the other word was perfectly read. The aim of Re-CAPTCHA is to have humans read and type-in the word that was unreadable to the Re-CAPTCHA Book OCR and use the other word as the actual challenge that users need to decaptcha correctly to pass. This is why sometimes you see captchas as this:
There are currently no working OCRs in the market for Re-CAPTCHA, but it will be only a matter of time until one comes out.
About Captcha OCRs
Captcha OCRs are sold as closed code software built to handle only 1 type of captcha. They will have simple Input-Ouput commands that, in its simplest flavors, will only allow as input the path of where the captcha image is located and will return as output the decoded captcha text. If well engineered, Captcha OCRs can return the text in less than 1 seconds and handle multi-threading.
Captcha OCRs can’t read all captchas that you throw at them. They have a success rate that’s inherent to the complexity of the code and its flexibility to support different character shapes of captcha types. Just think that captcha OCRs have the same ability to read as a 4 year old child, they struggle to recognize some letters and they get confused when distinguishing similar characters.
When using captcha OCRs you will be faced with two things:
- Read Rate: Ability to read the captcha characters.
- Correctness Rate: Ability to correctly recognize the captcha characters.
Still Confused? Just think of the read rate as the child’s ability to recognize a word. When the child sees the word “DOG” he/she recognizes that he’s looking at a string of characters that composes a word. The Correctness Rate is the ability of the child to recognize each letter of the word and not to confuse them with any other letter in the alphabet. We have the same case with captcha OCRs, the Read Rate is the ability that the OCR has to recognize that the image is actually a captcha that is readable to the OCR and the correctness rate is the ability that the OCR has to correctly recognize each character of the text captcha.
Let’s see an example:
If you got an OCR with 50% Read Rate and 50% Correctness Rate working to decode a medium difficulty captcha, then out of every 1000 CAPTCHAs (of the same type) only 500 will be read, and out of those 500 only 250 will be recognize all the characters correctly. NOTE: OCR developers may inflate their Correctness rate by not considering the Read Rate in their specs, always talk thoroughly about all the specs of the OCR you’re going to purchase before you make any payments.
Where to buy an OCR and how much they cost?
OCRs are usually custom built by developers that can be found through freelance websites and forums. . The cost of a captcha OCR varies according to the complexity of the OCR. The last captcha OCR I bought read a very complex captcha with 65% Read Rate and 45% Correctness Rate and cost me USD $4,000. The reason I bought this OCR is because I had to decode almost 1M CAPTCHAs per day and the investment would be returned in a couple of months, but there’s always a 3-4 months waiting gap for getting it implemented.
The price of a pre-built OCR can be slightly lower than a custom built one, but the existing pre-built OCRs in the market work mainly for deprecated captcha images.
About Human Decaptcha Services
With the constant battles that were being waged between captchas and OCR makers, there was a sudden decrease of OCR software availability and a higher demand for decaptcha. This is when Human Decaptcha Services came into play.
It all started with the sudden increase of outsourcing programming projects to third worlds countries. The rent-an-Indian-programmer hype went up in such a way that more and more people living in developing countries came into outsourcing business. All from Virtual Assistants, Designers, Project Managers, Illustrators, copywriters and eventually Data Entry specialists came into the outsourcing industry. When Data Entry personnel came into play they were just digitalizing books and company documents. But it didn’t take long before someone replaced books for captchas and then Human Decaptcha Services were born.
How do Human Decaptcha Services Work?
- The process to upload a captcha to a decaptcha service is pretty straight forward:
- Grab the Captcha Image and save it locally.
- Upload the captcha image to the Decaptcha Service through their HTTP interface or API client.
- Wait for a text response.
- Grab the responded text and Insert it into the captcha text box.
What’s behind all that process is a group of hired Data Entry personnel that are constantly decoding every captcha thrown at them. Once you upload the captcha, it will be assigned to some sort of queue and eventually passed on to one of the hundreds of human decaptchers using that service. Human Decaptcha Services usually have a response time between 10-30 seconds and can have a correctness rate of around 94%.
Human Decaptcha Services charge per 1,000 decaptchas. The prices range from $1 to around $8 per 1,000 decaptchas. When you make your payment, the money that you inserted into your account is going to be converted into captcha credits and for every decaptcha processed your account is going to be deducted according to the Human Decaptcha service’s rate. You also have the ability to report incorrect decaptcha’s, so you only pay for correct decaptchas.
UPDATE: I’ve made a list of best decaptcher services, take a look into the next post.
Uploading a CAPTCHA to a Human Decaptcha Service
Human Decaptcha Services provide different API Clients in different languages. Some of the API clients use plain HTTP requests while others use Socket based system. I personally recommend choosing socket based API clients, I made a test in the past in which I saw 20% faster speed when using a socket based APIs vs a HTTP requests API.
Integrating the Deacaptcha Service’s client varies according to the decaptcha service, API client type and API client language. But there are two things to always keep an eye on when using any of them:
- Watch out with the function to mark CAPTCHAs as incorrect. It’s easy to forget to implement this, but it’s the best tool for not paying for wrong decaptchas (remember that where there are humans, there are errors).
- Always integrate at least 2 Decaptcha services, ALL OF THEM are prone to errors given the high volume of requests that they get per day.
For Non Programmers:
Hire a Freelancer for your customer script. That’s as easy as it gets 🙂 . You can hire freelancers for under $200 with experience in these types of integrations. For this I recommend oDesk and Freelancer.com. NOTE: If you’ve never used this type of websites before, I advise caution before you hire anyone. Make sure you hire someone with good referrals and investigate bids on similar projects before you choose the winning bidder.
OCR vs Decaptcha Services
Human Decaptcha Service
Initial Cost VERY HIGH: If you don’t have at least $3000 in your pocket, don’t even look into this option. VERY LOW: You can start using a Decaptcha Service starting from $7 (Deathbycaptcha.com)
Speed VERY FAST: An OCR responds in less than 1second if it’s properly made. SLOW: Decaptcha Services are operated by humans, so the decaptchas will be made between 10-30 seconds.
Correctness Rate LOW: OCR’s have a solve rate of around 40% for difficult CAPTCHAs. HIGH: Decaptcha Services have a correctness rate of over 90%.
Setup Time VERY LONG: You will need to wait 2-3 months to get the OCR coded. SHORT: Integrating the Decaptcha Service with your script can be done in a matter of hours.
Risk of Investment BIG GAMBLE: I’ve both lost and saved thousands of dollars with OCRs. You’re always a simple change away from getting the OCR unusable, so it’s always a risk to buy. MANAGEABLE RISK: Since Initial Investment is low, you can see if it’s profitable for you to keep on paying for the service.
Remember the OCR that cost me $4000? It got deprecated a week later due to some minor changes in the captcha. That’s why I suggest that you only invest in an OCR as a high risk investment with opportunity to save hundreds of dollars, but at the same time the chance of losing your complete investment like I’ve experienced twice.
I always advise everyone to go for Decaptcha Services first and then, according to their experience, decide if it’s profitable to buy an OCR.
Phew, long article J… next I’ll write about my Decaptcha Services Top Picks.