Learning from similarity and information extraction from structured documents It is a human tendency to formulate assumptions while analyzing the difficulty of information extraction in documents. We automatically assume it is easier to extract information in the form of named entities from a set of similar documents. Nonetheless, similar-looking documents have a distinct set of problems. The named entities in these document types vary in size, akin to the number of characters, words, height, width, and location. These variations cannot be handled using heuristics or pre-trained language models. When we have exhausted all the modeling options, we look for a new direction. The method s presented here is an exploration of deep learning techniques to improve the information extraction results [4]. To evaluate the techniques, a dataset with more than 25,000 documents has been compiled, anonymized, and published. It is already known that convolutions, graph convolutions, and self-attention can work together and exploit all the information present in a structured document. Here, we examine various approaches such as siamese networks, concepts of similarity, one-shot learning, and context/memory awareness as deep learning techniques. Information extraction tasks are not a new problem. Information extraction starts with a collection of texts, then transforms these into information that is more readily digested and analyzed. It isolates relevant text fragments, extracts pertinent information from the fragments, and then pieces together the targeted information in a coherent framework [1]. The relevant collection of texts for this study is the content within business documents such as invoices , pro forma invoices, and debit notes.
Example of an invoice and an extraction system together with its output.
A classical heuristic way to generally improve a target metric is to provide more relevant information to the network. The idea of providing more information is fundamental – even for simpler templating techniques as problems cannot necessarily be solved using templates alone. The research question will focus on a similarity-based mechanism with various model implementations and whether they can improve an existing solution. In the background, we have assessed that none of these methods is well-suited for working with structured documents (like invoices), since they generally do not have any fixed layout, language, caption set, delimiters, or fonts. For example, invoices vary across countries, companies and departments and change over time. To retrieve any information from a structured document, it must be understood.
One-shot learning and similarity In one-shot learning, we are usually able to correctly identify classes by comparing them with already-known data. One-shot learning works well when the concept of similarity is utilized. For similarity to work, two types of data must be recognized – unknown and known. For the known data, target values are known to the method and/or the model. To classify any unknown input, the usual practice is to assign the same class to it as this is the most similar-known input. Siamese network is used for similarity in this type of work meaning the retrieval of similar documents that need to be compared. This is performed using the nearest neighbor search in the embedding space for this work. The loss used for similarity learning is called triplet loss because it is applied on a triplet of classes (R reference, P positive, N negative) for each data point:
L(R, P, N ) = min( ||f (A) − f (P)||2 − ||f(A) − f(N )||2 + α, 0)
Where α is a margin between positive and negative classes and f is the model function mapping inputs to embedding space (with the euclidean norm).
Methodology The main unit of our scope is every individual word on every individual page of each document. For the scope of this work, we define a word as a text segment that is separated from the rest of the text by (at least) a white space, and we will not consider any other text segmentation. Inputs and outputs: Conceptually, a document's entire page is considered the input to the whole system. Each word – together with its positional information (or word-box for short) – is to be classified into zero, one, or more target classes as the output. We are dealing with a multi-label problem with 35 possible classes in total. The dataset and the metric: Overall, we have a dataset with 25,071 PDF documents, totaling 35,880 pages. The documents are of various vendors, layouts, and languages, and are split into a training, validation, and test set at random (80 % / 10 % / 10 %). A validation set is used for model selection and early stopping. The metric used is computed first by computing all the F1 scores of all the classes and aggregated by micrometric principle. The architecture used in the paper is called a simple data extraction model [2].
The features of each word-box are: Geometrical – To construct graphical CNN, reading order, and normalized (left, top, right, bottom) coordinates. Textual – The count of all characters, numbers, the length of word, the count of first two and last two characters and trainable word features are one-hot encoded, deaccented, lowercase characters. Image – Each word-box is cropped from the image. The five inputs namely downsampled picture, feature of all word-boxes, 40 one-hot encoded characters for each word-box, neighbor-ids, position id by geometric ordering are concatenated to generate an embedding vector. The transformer approach is used to the position embedding vector. Image is stacked, max-pooled and morphologically dilated to generate 32 float features. Before attention, dense, or graph convolution layers are used, all the features are simply concatenated. Basic building block definition ends with each word-box embedded to a feature space of a specified dimension (being 640 unless said otherwise in a specific experiment). The following layer, for the “Simple data extraction model”, is a sigmoidal layer with binary cross-entropy as the loss function. This is a standard setting, since the output of this model is meant to solve a multi-class multi-label problem. The learning framework spans: 1) The system needs to keep a notion of already-known documents in a reasonably sized set. 2) When a “new” or “unknown” page is presented to the system, search for the most similar page (given any reasonable algorithm) from the knownpages. 3) Allow the model to use all the information from both pages (and “learn from similarity”) to make the prediction. Nearest neighbor definition: For one-shot learning to work on a new and unknown page (sometimes denoted reference), the system always needs to have a known (also denoted as similar or nearest) document with familiar annotations at its disposal. Embedding for nearest neighbor search [3] is prepared by removing the latest layer and adding a simple pooling layer to the document classification model. This modified the model to output 4850 float features based only on image input. These features were then assigned to each page as its embedding. These embeddings are held fixed during training and inference and computed only once in advance. Baselines: We do not have any benchmark for comparison. Therefore, some models are prepared as baselines in the process. Simple data extraction model without any access to the nearest known page. Copypaste – overlay the target classes from the nearest known page word-boxes. It provides counterpart to triplet loss and pairwise classification. Oracle – always correctly predicts the nearest known page classes. Fully linear model without feature picture data - provides a counterpart to the query and answer approach. Model architectures: Every single one of the architectures is trained as a whole, no pre-training or transfer learning takes place, and every model is always implemented as a single computation graph in tensorflow. Triplet Loss architecture – using siamese networks canonically with triplet loss. Pairwise classification – using a trainable classifier pairwise over all combinations of word-box features from reference and nearest page.
Query answer architecture (or “QA” for short) – using the attention transformer as an answering machine to a question of which word-box class is the most similar.
The filtering mechanism addresses only the annotated word-boxes from the nearest page. The tiling mechanism takes two sequences – first, the sequence of reference page word-boxes and secondly, the sequence of nearest page filtered word-boxes; and produces a bipartite matrix. The model selected in each experimental run was always the one that performed best on the validation set in terms of loss. The basic building blocks present in every architecture were usually set to produce feature space of dimensionality 640. Table 1: The results could be interpreted as the model reaching its maximal reasonable complexity at one transformer layer and smaller feature space.
Table 2: Low score indicates that it is not enough to just overlay a different similar known page over the unknown page, as the dataset does not contain completely identical layouts.
Table 3: Only roughly 60% of word-boxes have their counterpart (class-wise) in the found nearest page.
Table 4: Linear baseline performance justifies the progress from the basic Copypaste model towards trainable architectures with similarity.
Table 5: Pairwise classification performed better than simple Copypaste, but still worse than linear architecture.
Table 6: It also verifies that all of the visual, geometric and textual features are important for good quality results.
Conclusion: We have verified that all possible parts of the architecture are needed in the training and prediction of the Query Answer model to achieve the highest score. What is the effect of the size of the datasets? By exploring the effect of the size of the training dataset and/or the search space for the nearest pages, we could ask if (and when) the model needs to be retrained and determine what a sample of a difficult-to-extract document looks like. How to improve the means of generalization? Currently, the method generalizes to unseen documents. In theory, we could desire a method to generalize to new classes of words, since this way the model needs to be retrained if a new class is desired to be detected and extracted. The model can fit into just one consumer-grade GPU and trains from scratch for at most four days using only one CPU process.
References: [1] Cowie, J., Lehnert, W.: Information extraction. Commun. ACM 39, 80–91 (1996) [2] Holecek, M., Hoskovec, A., Baudis, P., Klinger, P.: Table understanding in structured documents. In: 2019 International Conference on Document Analysis and Recognition Workshops (ICDARW), vol. 5, pp. 158–164 (2019). [3] Burkov, A.: Machine Learning Engineering. True Positive Incorporated (2020) [4] Holecek, M.: Learning from similarity and information extraction from structured documents (2020) URL: Learning from similarity and information extraction from...