Azure, AWS and GCP Document Extraction — Part 1
Dark Data
We live in an era where the collective efforts of past, present and relentlessly into future computerisation of business and personal lives has created and will continue to generate immense amounts of data. By some estimates there is currently around 40 Zetta bytes of data residing in different data stores in the world. That’s:
40,000,000,000,000,000,000,000 bytes or
40,000,000,000 Terra bytes
Or put another way, enough the fill 40 Billion 1TB M1 MacBook Pro’s, which happens to be what I’m writing this document with. That’s a lot of data, far to much to really comprehend. Add to this then that a significant percentage of data is rarely used after collection, this so called Dark Data sits around waiting for the day that it might useful, if only the computing resources were available to make sense of it. Lots of this Dark Data are written documents, think Word Docs, PDF’s, scans, etc…
Given that we tend to write down important and useful information, having all these documents in formats that are not readily usable to Machine Learning and AI presents a problem for organisations wishing to become more data driven.
There is promise though, for a few years now, the big cloud players have been offering solutions to extract data from these type of documents and return not just the textual content but also some or all of this data in a structured format. Accuracy of varies depending on the document, although from the marketing it is not clear how they compare to each other.
I have recently worked with DocumentAI from Google Cloud Platform, in this case the business had 10,000’s of Purchase Orders in PDF format submitted monthly. Although each of these documents contains orders for the companies products, numerous different formats and table headings were used making it a challenging problem.
As a consultant working in the data science area, when asked to extract structured data from 10,000’s of documents on a short time scale I’d rather be able to rely on PaaS AI to solve the problem rather than have to develop my own solution and then be responsible for supporting it. This way the customer gets value early on and gain experience with the end-end system. Then if required a more bespoke solution can be developed.
Products Used
The products used in this article are:
- Google Cloud Platform: DocumentAI
- Azure Cognitive Services: Form Recognizer
- Amazon Web Services: Textract
Data Set
Focusing on PDF documents containing forms, tables and general text. I have collected an assortment of documents, the full set is available for download here.
Below is an example of a Purchase Order which has somewhat similar structure to an invoice. Different enough that invoice parsers will not extract all the information you might need, for say, inserting the order into SAP.
Cloud Based Services
Google Cloud Platform: DocumentAI
“Google Cloud’s Vision OCR (optical character recognition) and form parser technology uses industry-leading deep-learning neural network algorithms to perform text, character, and image recognition in over 200 languages with exceptional accuracy.”
DocumentAI has a Python SDK that we can use from a Jupyter Notebook. In order to get started using DocumentAI are first required to create a “Processor” from which we have the options to create general purpose or specialised processors. Within the product offering, GCP have decided to go into some industry specific verticals by building off of the base capability. Below is the list of processors available (some on request) at the time of writing. For this evaluation we will use the Form Parser since it offers a base capability to extract text, key-value and table entities.
General Processors:
- Document OCR (Optical Character Recognition),
- Form parser
- Document splitter.
Specialised Processors:
- W9
- 1040
- W2
- 1099-MISC
- 1003
- Invoice
- Receipt
Azure Cognitive Services
“Accurately extract text, key/value pairs and tables from documents, forms, receipts and business cards without manual labelling by document type or intensive coding or maintenance. Utilise Form Recognizer’s Custom Forms, Pre-built and Layout APIs to extract information from your documents in an organised manner.”
Azure offers a slightly different product, they have three categories:
- Layout — base capability
- Pretrained — models trained on specific document types such as receipts
- Custom — Leverages the Layout API base capability to transfer learn to customer specific documents
Its clear seeing this split that Azure has considered how organisations wish to access this type of ML capability. When developer time is expensive, offering an API to customise the core platform product is an attractive offering since it reduces the amount of code written and creates a clear answer to the “how do I improve the results?” question that is invariably asked by stake holders.
AWS Textract
“Amazon Textract is a fully managed machine learning service that automatically extracts printed text, handwriting, and other data from scanned documents that goes beyond simple optical character recognition (OCR) to identify, understand, and extract data from forms and tables.”
AWS provides the Textract API as a single endpoint and does not appear to have any document specific ‘processors’ or configurations. Instead it is relying on its ability to successfully extract all required content from the document and return it in a structured format. Developers are then required to structure the returned object according to their needs. It is an interesting approach and provides insights to how both Azure and GCP have aligned themselves to compete against AWS.
Code
Code to work with these services is implemented in python Jupyter Notebooks running the in the notebook service for each cloud. These notebooks are available at the gists linked below for each platform.
GCP Notebook
AWS Notebook
Azure Notebook
Results
The full set of results for the input documents are available here
Shown below are two examples of tables extracted by each service. Content is rendered in blue for the table cells and red for the overall table outline as returned from the different API’s. If there is a specific header row highlighted by the API results it is rendered in yellow.
Google Cloud Platform
AWS Textract
Azure Cognitive Services
Summary
From a developer perspective all three of these services are easy to work with and provide enough language bindings to make most integrations smooth. As expected the cloud providers offer enough sample code and documentation to help out in this area.
Looking at the results, I would have to say that GCP DocumentAI lags behind both Azure and AWS in the core table extraction capability. Although not by a huge margin. In both examples shown above, the columns are not extracted correctly. If you examine all the results, it becomes clear that for some documents DocumentAI will work just fine.
Header rows are important for tables and although DocumentAI provides an explicit header row for the table. This header row is not correct for one of the documents shown above. Which makes using the header row un-reliable, for example using it to identify tables based on the columns would be un-reliable. My main fear with this, is that architects may design a system based on an assumption that this technology works as stated. Leading to delays to the project because more development is required to get accurate and usable values.
Azure and AWS both avoid merging the text just above the table into the extracted content. They don’t however provide an explicit header row, meaning that more development effort is required to obtain the header and identify tables.
From personal experience of using these services, relying on the header for like tables being same is best avoided. There is always a significant number of edge cases that will result in complex and hard to understand code. Better is to build a classifier to classify the tables into the categories that make sense for the usecase. This has the side effect that the improvement of the classifier can be built into the workflow via active learning. It’s enough for a separate article in itself, so I won’t cover it here.
I would have to recommend either AWS Textract or Azure Cognitive Services for this task. If you have to use GCP DocumentAI, then depending on the documents it may work fine.
For all of the products though, the best chance for success will be to plan for more developer effort to build the tools to help users correct the output. Since all of these tools cannot yet achieve the accuracy required to replace humans.
Hope you found this useful, I’ll follow up with some more analysis another time since its quite a large area and only the surface has been scratched here. If you need any advice with the integration of these tools please reach out to Cogniflare, hello@cogniflare.io, we would be more than happy to assist you.