OCR Engine: How To Convert Good Start Into Great Finish
by Amit Jnagal, on June 14, 2017 10:45:00 AM PDT
In my past life, when I used to consult with companies that struggled with software performance challenges like apps that crashed, hanged, or slowed down to a crawl, I saw a consistent pattern in the kind of applications that ran into performance issues.
Most of them used a heavy mix of frameworks - codebases that were not developed by their team and were more or fewer black boxes. And almost every team had the same reason for using frameworks - they gave you a head start and gave you a fundamental approach for building a solution.
They gave everyone a good start.
But in my experience, frameworks have almost always failed to give a great finish, especially when it came to performance. Most of the teams would get to 60-70% of the solution very rapidly but the framework would bring them to their knees for fine-tuning that last 10% through customization or configuration.
"Well you can't do that in this framework" or "No, it does not work like that" has made several teams make compromises in design that eventually came back to hunt them as performance challenges.
Having spent a few years helping enterprises make sense of data and images, I have started spotting the same trend in solutions that are built using OCR engines - they give you a good start but fail to give you a good finish.
Anyone can get to a level of 60-70% accuracy when it comes to the quality of OCR extraction - but the nuances of the last 25-30% of data accuracy bring you down to your knees.
From my experience, the best OCR solution is usually not just an OCR solution but an OCR solution with solid analytics and machine learning capabilities added to it. These additions help you improve upon the shortcomings of the raw extraction. This can get rid of almost all manual interventions over time and offer pretty accurate extraction results.
Machine learning on top of OCR solutions works with one of the two training models based on the specific extraction case:
- Domain-based extraction - This approach helps when you know beforehand what kind of data extraction you are after. Let’s say you were trying to extract features of wines from a set of wine ratings and notes that you have OCR-ed. Before you can do the feature extraction, you may consider running data modeling algorithms on a large collection of existing wine notes to figure out trends and topics. Once you build a learning model, you can then deploy it on top of OCR extracted data. This will not only help you extract features but also help in automatically fixing the OCR output - text that is read incorrectly by the OCR engine.
- Data-based extraction - In case your extraction case is generic and you are unlikely to know in advance what kind of data you will need to extract, then the domain-based extraction does not work. The data could be an invoice or scanned page of a book. In this case, an unsupervised learning algorithm can be used to run through large volume of data. The system would need to use a number of signals such as a source of the data, words in OCR data, meta tags on the file, geographical location, etc. to first take the best guess of categorizing the data in one of many buckets per domain. Extraction models can then be built on top of each of these buckets to improve accuracy.
Whether it's a marathon run or building complex systems like OCR extraction, getting a good start is easy but it is getting a good finish that makes all the difference.