Understanding IDP: Data Validation and Feedback Loop
by Sujith Parakkunnath, on December 9, 2021 10:34:28 AM PST
According to Gartner, "The market for document capture, extraction, and processing is highly fragmented. Data and analytics leaders should use this research to understand the process flow and differentiated capabilities offered by intelligent document processing solutions". Gartner's recently released "Infographic: Understand Intelligent Document Processing" covers these 6 critical flows in IDP.
1. Capture or Ingestion
2. Document Preprocessing
3. Document Classification
4. Data Extraction
5. Validation and Feedback Loop
6. Data Integration
Source: Gartner, Infographic: Understand Intelligent Document Processing, Shubhangi Vashisth et al., 22 September 2021
This is the fourth post in the series exploring Data Validation and Feedback Loop.
When it comes to IDP systems, one of the key evaluation parameters is the accuracy it offers. Besides depending on just the quality of the extraction process, there are external signals that IDP systems tap into to improve accuracy. Data validation against an external source is one of many such signals.
When you think of these signals, try to draw a parallel to how modern-day GPS location systems work. You may know that GPS systems measure the distance of the subject from three or more satellites and apply a technique called triangulation to detect an intersection point. It is impossible to accurately pinpoint the location of the subject with a signal from just one satellite.
To relate to this problem, stick out your arm, raise a finger and close one eye. You will notice that with one closed eye, you lose the sense of distance. You cannot really tell how far your finger is. Getting visual signals from both eyes helps you get a true reading of your depth of field. Similarly, GPS systems use three different signals to accurately place the subject's location. Opening an IDP conversation with satellites is quite a stretch but the point to note here is that more signals lead to higher accuracy. Similarly, data validation and feedback loops are techniques used by modern IDP systems to improve accuracy and thereby mature faster exponentially. An efficient data validation system can lift your IDP accuracy by 15 to 20%. Let's see how.
If IDP is the best option to automate data processing, what does data validation add to it? Data validation, as the name suggests, is the process of validating the extracted data for multiple points of accuracy, such as is the right data being extracted and if the extracted data itself is accurate. A typical use case for data validation is exception handling, such as weeding out documents that are out of scope. For example, you have a list of vendors where only documents from these vendors should be extracted, or a receipt is mixed among the invoices you are processing and needs to be disregarded. If you experience these or similar cases, then you need data validation.
Let us look at a scenario for data validation. Imagine you are extracting information from a loan document. Borrowers have availed loans from different banks, but you want to validate the list of approved lenders or banks in your system and differentiate between the approved and unapproved lenders. In this case, you implement data validation techniques where an IDP system usually connects with the third-party database through APIs or to a set of data in the IDP vendor's cloud system synced daily or periodically from the third-party database. Let me simplify this. You are extracting a loan document where the borrower has availed a loan from Bank of America, and Bank of America is your approved lender. Then, with data validation, you can have an identifier for it, maybe list the lender as a lien-holder in the extraction results.
Data validation is one of the key factors that brings in an exponential increase in the extraction accuracies, which means your IDP models mature in no time. Let me give you a ballpark figure. After analyzing the extraction results of our customers for the past few months, we have observed that Infrrd's data validation algorithms immediately spike the accuracy levels around 10%. It means if the IDP system was providing 80% accuracy without data validation, it may give 90% accuracy or more with data validation.
There are different types of validation. The most common ones are:
Pattern-based validation: Here, the data is validated based on patterns. For example, the vehicle identification number (VIN), which is a unique identifier for a car, is a combination of digits and capital letters and usually constitutes 17 characters. This number has a pattern, such as the first 3 digits representing the manufacturer, digits 4 to 8 may be alphanumeric and represent the vehicle descriptions, and so on. In this case, pattern-based data validation detects and corrects the extraction errors in the VIN number, including tricky ones, such as the number 1 and the capital letter I getting interchanged.
Dictionary-based validation: This is done against a set of data in the system. For example, you can verify the extracted invoice approver name matches the name of the approver in the IDP system. In this case, the dictionary-based validation detects and corrects the currency code.
Context-based validation: This is done where the same value is relevant in two contexts. For example, you are extracting an insurance document that has the same value in two contexts, say collision deductible and comprehensive deductible always have the value 500. In such cases, the ML models may misinterpret the context as the values are the same and may learn incorrectly, which eventually may have a dip in the accuracy. So, to detect these kinds of different contexts with similar or the same value, context-based validation is the way forward.
So, how do you implement data validation in IDP solutions? One of the key strategies is configuring business rules.
Modern IDP solutions mostly validate extracted data using business rules. Let us say you have an expense management system to process invoices. You are extracting relevant information from these invoices using an OCR system. In the initial stages, the extraction accuracy is not expected to be high. However, you have an agreement with your IDP vendor that an expected level of accuracy can be achieved in a specific timeframe. Now, how do you frequently measure the improvements in accuracy? You can do this by configuring business rules.
Business rules can be configured in an IDP solution in two ways, either through customization from the backend or through the user interface. In modern IDP solutions, business rules are a high-value offering in the user interface, where you can configure them based on your requirements.
Automated Accuracy Improvement
Any corrections performed by your data entry or correction user acts as an input to the system so that the accuracy is improved in future extractions. Modern ML-based IDP systems automatically learn from corrections so that the accuracy of future extractions is improved. The feedback loop brings the best results when corrections are integrated with extraction.
When you extract data, human-in-the-loop (HITL) plays the role of correcting the data that are extracted with low confidence. IDP solutions assign a confidence score while extracting data at a granular level, usually at the field level. So, each field that is automatically extracted has a confidence score assigned to it. You can decide the fields that need correction based on the confidence score.
Let us take an example. You are extracting the invoice number, merchant name, merchant address, and total amount from an invoice. In this case, you set a high confidence score for critical fields, such as the invoice number. If the invoice number is not extracted with high confidence, it will be served to a human to correct it.
Some companies outsource corrections to manage costs. However, the chances are that they incur higher costs in the long run. Let us say you have an OCR system to extract data but corrections are outsourced to a BPO team because it is cheaper or more convenient than employing data entry or correction users. However, what you miss here is a long-term matured IDP system that can drastically reduce the corrections efforts for the future.
Infrrd's IDP solution has an integrated dashboard to perform corrections where the feedback loop is automated. There are patent-pending capabilities Infrrd offers to ensure efficient and intelligent analysis of data before triggering a feedback loop.
After Infrrd's IDP automatically extracts the data, two things can happen based on the maturity of the models: either a document goes through Straight Through Processing, or it is served for correction. If some fields are extracted with low confidence, the corresponding documents are sent to queues for correction by a data entry user.
The queues are configured based on the confidence score assigned by the system during extraction.
The corrections performed by the data entry user act as feedback for the system to learn, and this ensures improved accuracy in future extractions.
There you go. Ensure that you choose a futuristic IDP solution to stay competitive. It means choosing an IDP solution that offers excellent extraction and classification features and has excellent data validation and feedback loop capabilities to manage variations and inaccuracies efficiently.
Here is a table that depicts the industry-relevant data validation and feedback loop features and Infrrd's capabilities:
Business Rules Through Configuration
Self Service Business Rules
|On The Roadmap|
Automated Accuracy Improvements
In our next post, we explore Gartner's description of Integration and how Infrrd stacks up.