What Mick Jagger Taught Me about Data Extraction from Tables

Anusha Venkatesh
IDP Evangelist

If you try sometime, you might find you get what you need.

You can’t always get what you want.

You can’t always get what you want.

You can’t always get what you want.

But, if you try sometime, you might find…

You get what you need.

Last night, I went to a reception. And, when I saw this woman with a glass of wine in her hand, I immediately thought of how wrong The Rolling Stones were.

Specifically, I considered how wrong they were when it comes to extracting data from tables.

See, you CAN always get what you want...AND what you need. Faster.

Even if it doesn’t seem like you can.

Even if you have data that doesn’t seem consumable.

But, first...you have to truly understand the problem you’re up against.

Let’s face it. Extracting information from tables to feed an automation process is as complex as it is difficult.

When even human readers struggle to understand the information presented in tables... when most are challenged if performing a mental operation (like adding) is needed to grasp all the necessary information... it may seem like Mick and friends were right.

“You can’t always get what you want” is further validated when we recognize that automating the information extraction process from tables is a challenge few have fully conquered.

Why? It’s because when it comes to tables:

  • Data is difficult to extract.
  • The extracted data is rarely consumable.
  • Information is not uniform throughout the table.
  • The extracted information can be missing or invalid.
  • Contextual information is usually lost.

As a result, tables almost always get the manual workflow treatment. That’s you trying sometimes—getting workers who could be doing more valuable work to extract—so you get what you need—the data.

But, what if there was a way to get both?

What if you could always get what you want when it comes to

extracting all that valuable data trapped in tables?

The way to address any challenge is not just to identify the problem but truly understand it.

Try our table extraction demo

So let’s dive in on how to extract data from tables, then clean and transform that data so it can be consumed by an automation system or an ML platform.

That way, you’ll get what you want—the speed and cost-savings of automation—and what you need—the extracted gold trapped in your tables.

What is a table?

I know. Sounds silly, right? That is until you consider that there’s no standard for what makes a table...no universal definition.

Tables are an intuitive and universal way of presenting large sets of data, findings, and information.  

What’s more? A table is more than its data.

Tables contain a variety of data and information (e.g. words, digits, formulas, or images) and are embedded in a variety of document types (e.g. plain text, image, handwritten, or web pages).

But wait, there’s more!

We can’t forget relationships and context.

And there’s where tables are unique.

A table displays multi-dimensional information using a two-dimensional, linear format. It’s a set of data and data context presented in a non-standard format.

Because of this unique feature—displaying not only sets of information but their relationships and context—tables present a challenge for data extraction. And, that challenge expands when you consider there’s no universal format for a table: You might have three suppliers send you three different tables in three different formats...all trying to measure and show the same thing.

The Challenges of Tables

So let’s call out all the challenges tables present. That way, we can see how they may be overcome. That way we can see if we CAN get what we want, as well as what we need.

Challenge 1: No standard structural layouts or visual relationships

The structure of a table is determined by the structure and the relationships of its cells.

Tables capture multi-dimensional information using a two-dimensional, linear format. There is no standard table layout. For example:

  • How lines are used
  • How formatted (e.g., bold or italicized) text is used
  • Header location: Headers can be in two places—the top row or the first column
  • Use of borders: A table has or may not have a border, which makes it hard to locate
  • Variation in styles: separators for cells, rows, and columns
  • Nested tables: one table inside another
  • Cells spanning across columns and rows show hierarchical grouping of data
  • Word wrap and column merge allows content to span across multiple lines and cells
  • Multi-page tables that are used for long data displays. In some cases, headers repeat
  • Tables that float and have text around them

Challenge 2: Data visualization for humans…only

Formatting is for visual consumption by humans, not technology. Often, table design is poor. Consider the header rows that aren’t explicit, but rather implied based on the table or the supporting text.

Challenge 3: Cell content that uses various formats (letters, numbers, symbols, etc)

When cell values are presented using different syntactic representation patterns—like symbols, images, text, abbreviations, or mathematical notations—extraction requires knowledge of all possible presentation patterns.

Challenge 4: Multiple languages in cell content

The table and its cells may use different languages or domain-specific jargon.

Challenge 5: Cell content varies in density and formats

The content of the cells can be numbers or text. But what happens when cell content is dense, containing ambiguous, short chunks of text with the use of acronyms and abbreviations? To decode tables, text must be made more clear with abbreviations and acronyms fully defined.

Challenge 6: Document types may vary

A document and table can be in a PDF, text, image, HTML, or another format. Some formats are more challenging than others. For example, the PDF format has no internal representation of a table structure, which makes it difficult to extract tables for analysis.

How to Get What You Want...AND What You Need

This list may not be all-inclusive, but it’s a good start. When it comes to data extraction, tables are tough.

When you want to free your workforce to focus on more value-based action (and when DON’T you want that?), you want to be able to enable the automation and ML processes to take the action instead.

To do this takes just four steps.

Once you understand the challenges tables present...and how to overcome them, getting your prescription filled is a whole lot better than standing online at the Chelsea drugstore with that Mr. Jimmy character.

Want to get what you want AND what you need? Let's Chat!

Start a conversation to get what you want and learn a lot more about what it takes to overcome tabular data extraction challenges.

Try our table extraction demo

Frequently asked questions

What does your pricing model look like?

We price based on the annual volume of pages and complexity of document type.  We can get you preliminary pricing once we outlined a solution.  Let's do this.

To know more, book a 15-min session with an IDP expert

How can I try Infrrd before I commit to a full deployment?

Sure.  The first step is to schedule a guided demo where you get to jump into the thick of it.  After you explore our solution you can try a proof of concept. When you're ready, you can deploy the system to one use case.  Then more use cases.  Then across your enterprise.

To know more, book a 15-min session with an IDP expert

How does your system integrate with others in my enterprise?

We play nice.  Our solutions are API-based.  Your documents are feed into the solution using APIs. And extracted data is sent out through APIs.  We use REST APIs.

To know more, book a 15-min session with an IDP expert

Does your solution run in the cloud or on premise?

Our solution is cloud-native but is also design for premise deployments.  Your choice on how you want to deploy it.

To know more, book a 15-min session with an IDP expert

Does Infrrd run on mobile or desktop device?

Glad you asked.  Our data extraction process runs on servers.  We have found performance and accuracy decline when running on a desktop or mobile device. (Remember Infrrd is running a powerful AI stack).

To know more, book a 15-min session with an IDP expert

Does your system work out of the box or does it require training?

Common documents and use cases work out of the box.  The cool thing is your solution will improve as the system learns from your documents upfront and over time.

To know more, book a 15-min session with an IDP expert

How does your solution handle corrections?

Did you know no system is 100% accurate all the time?  When extraction errors occur you want to correct them.  We provide a simple UI that your business analyst will use to make corrections.

To know more, book a 15-min session with an IDP expert

Does your solution work with handwriting?

Our solution excels at data extraction from handwriting.  We've got proprietary methods and techniques that do the trick.  It's pretty cool.  See for yourself.

To know more, book a 15-min session with an IDP expert