Amazon wants to make it easier to extract text and data from tables, forms and virtually any document. The company announced its new fully-managed Textract service, which removes the need to manually review or custom code text and data extractions with machine learning.
According to the company, traditional OCR technologies can only do so much due to their inability to recognize layouts like forms and tables, often spitting out one big text heap.
Textract uses machine learning to automatically extract text and data from a wide variety of formats that include image formats such as scans, PDFs and photos, which customers can then input into Amazon’s wide array of database and analytics services, according to Amazon.
“Many companies extract text and data from files such as contracts, expense reports, mortgage guarantees, fund prospectuses, tax documents, hospital claims, and patient forms through manual data entry or simple OCR software. This is a time-consuming and often inaccurate process that produces an output requiring extensive post-processing before it can be put in a format that is usable by other applications,” the company explained in a announcement.
The company claims Textract goes beyond traditional OCR technologies to identify the context of the information such as a name or social security number on a tax form, a product SKU, or quantity in a warehouse from an inventory report.
Amazon Textract takes scanned files stored in an Amazon S3 bucket, reads them, and returns data in the form of JSON text annotated with the page number, section, form labels, and data types. Additionally, the company explains developers will be able to analyze and query extracted text and data with database and analytic services such as Amazon Elasticsearch Service, Amazon DynamoDB and Amazon Athena. Developers can also integrate Textract with machine learning services like: Amazon Comprehend, Amazon Comprehend Medical, Amazon Translate and Amazon SageMaker.
“In addition to the integration with other AWS services, the rich partner community developing around Amazon Textract makes it possible for customers to gain real meaning from their file collections, operate more efficiently, improve security compliance, automate data entry, and facilitate faster business decisions,” said Swami Sivasubramanian, the vice president of Amazon Machine Learning.