AWS CEO Announces Textract to Extract Data Without Machine Learning Skills

AWS CEO Andy Jassy announced Amazon Textract at the AWS re:Invent 2018 conference. Textract allows AWS customers to automatically extract formatted data from documents without losing the structure of the data. Best of all, there are no machine learning skills required to use Textract. It’s something that many data-intensive enterprises have been requesting for many years.

Amazon Launches Textract to Easily Extract Usable Data

Our customers are frustrated that they can’t get more of all those text and data that are in documents into the cloud, so they can actually do machine learning on top of it. So we worked with our customers, we thought about what might solve these problems and I’m excited to announce the launch of Amazon Textract. This is an OCR plus plus service to easily extract text and data from virtually any document and there is no machine learning experience required.

This is important, you don’t need to have any machine learning experience to be able to use Textract. Here’s how it generally works. Below is a pretty typical document, it’s got a couple of columns and it’s got a table in the middle of the left column.

When you use OCR it just basically captures all that information in a row and so what you end up with is the gobbledygook you see in the box below which is completely useless. That’s typically what happens.

Let’s go through what Textract does. Textract is intelligent. Textract is able to tell that there are two columns here so actually when you get the data and the language it reads like it’s supposed to be read. Textract is able to identify that there’s a table there and is able to lay out for you what that table should look like so you can actually read and use that data in whatever you’re trying to do on the analytics and machine learning side. That’s a very different equation.

Textract Works Great with Forms

What happens with most of these forms is that the OCR can’t really read the forms or actually make them coherent at all. Sometimes these templates will kind of effectively memorize in this box is this piece of data. Textract is going to work across legal forms and financial forms and tax forms and healthcare forms, and we will keep adding more and more of these.

But also these forms will change every few years and when they do something that you thought was a Social Security number in this box turns out now not to be a date of birth. What we have built Textract to do is to recognize what certain data items or objects are so it’s able to tell this set of characters is a Social Security number, this set of characters is a date of birth, this set of characters is an address.

Not only can we apply it to many more forms but also if those forms change Textract doesn’t miss a beat. That is a pretty significant change in your capability in being able to extract and digitally use data that are in documents.

Source link