Amazon Textract overview

Following you will find a brief of Textract

Service

Textract service is divided in 2 API’s

Detect Document Text API: The Detect Document Text API uses optical character recognition (OCR) technology to extract text from a provided document.

Optical Character Recognition (OCR): Detect printed text and numbers in a scan or rendering of a document, use synchronous or asynchronous operations via API and information is returned in JSON format. Synchronous refer to live scenes such as posters or road signs, asynchronous to a multy page documents.


Optical Character Recognition (OCR)
The following diagram shows how the line Hello, world. in the text Hello, world. How are you? is represented by Block objects

Analyze Document API: The Analyze Document API extracts data from tables and key-value pairs from forms.

Key-Value Pair Extraction: Detect key-value pairs in document images automatically to retain the inherent context of the document. Use synchronous or asynchronous operations to analyze text in a document. The results of text analysis are returned in a JSON format

Key-Value Pair Extraction
The following diagram shows how the key-value pair Name: Ana Carolina is represented by Block object

Table Extraction: Automatically load the extracted data into a database using a pre-defined schema. Preserves the composition of data stored in tables during extraction.

Table Extraction
The following diagram shows how a single cell in a table is represented by Block objects.

Pricing

No minimum fees and no upfront commitments. Amazon Textract charges for each page processed and whether we extract only text from documents or text with tables and/or form data.

As AWS customers we have access to the API’s in a free tier for 12 months with the following restrictions:

  • 1,000 pages per month using the Detecting Document Text API

  • 100 pages per month using the Analyze Document API

Full price list in the following link:

https://aws.amazon.com/textract/pricing/


Testing

Bellow the requirements to test the Amazon Textract API

Dependencies:

Python 3.7

pip 19.03

Documentation :

https://serverfault.com/questions/918335/best-way-to-run-python-3-7-on-ubuntu-16-04-which-comes-with-python-3-5

https://docs.aws.amazon.com/cli/latest/userguide/install-linux.html#install-linux-pip


User configuration

Create an IAM user with the following permissions

  • AmazonTextractFullAccess
  • AmazonSQSFullAccess
  • Sufficient permissions to upload and read images from a bucket in S3

Documentation :

https://docs.aws.amazon.com/IAM/latest/UserGuide/getting-started_create-admin-group.html


  1. Install AWS CLI

This client is available in Linux, Windows, macOS, Virtualenv, Bundled Installer. We will test it in Ubuntu 16.04

https://docs.aws.amazon.com/cli/latest/userguide/install-linux.html#install-linux-pip

Verify that the AWS CLI installed correctly.

2. Configure AWS Cli

Credentials sent to Manuel Navarro (Unlicensed)

User: manuel.navarro
Access Key ID
AKIAI524ELATZZGGSPDA
Secret Access Key
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

To select region please check the following link (Amazon Textract)

https://docs.aws.amazon.com/general/latest/gr/rande.html

Instructions

https://docs.aws.amazon.com/cli/latest/userguide/cli-chap-configure.html

3. Test by executing a list operation



Related links:

https://docs.aws.amazon.com/aws-sdk-php/v3/api/class-Aws.Textract.TextractClient.html