8/23/2020 Building a Document Understanding Pipeline with Cloud | Google Cloud Blog

Blog Menu

AI & M ACH INE LEARNING Building a document understanding pipeline with Google Cloud

Holt Skinner Cloud Technical Resident Find an article... Michael Munn Machine Learning Solutions Engineer

Latest storiMesichael Sherman Machine Learning Engineer Products September 20, 2019

Topics

About Document understanding is the practice of using AI and machine learning to extract RSS Feed data and insights from text and paper sources such as emails, PDFs, scanned documents, and more. In the past, capturing this unstructured or “dark data” has been an expensive, time-consuming, and error-prone process requiring manual data entry. https://cloud.google.com/blog/products/ai-machine-learning/building-a-document-understanding-pipeline-with-google-cloud 1/10 8/23/2020 p Building a Dgocument Undersptanding Ppipeline with Goqogle Clogud | Google Cloud Blog y Today, AI and machine learning have made great advances towards automating this process, enabling businesses to derive insights from and take advantage of this data that had been previously untapped. Blog Menu

In a nutshell, document understanding allows you to:

Organize documents

Extract knowledge from documents

Increase processing speed

At Google Cloud we provide a solution, Document AI, which enterprises can leverage in collaboration with partners to implement document understanding. However, many developers have both the desire and the technical expertise to build their own document understanding pipelines on (GCP)—without working with a partner—using the individual Document AI products.

If that sounds like you, this post will take you step-by-step through a complete document understanding pipeline. The Overview section explains how the pipeline works, and the step-by-step directions below walk you through running the code.

Overview

In order to automate an entire document understanding process, multiple machine learning models need to be trained and then daisy-chained together alongside processing steps into an end-to-end pipeline. This can be a daunting process, so we have provided sample code for a complete document understanding system mirroring a data entry workflow capturing structured data from documents. Find an article... Our example end-to-end document understanding pipeline consists of two components: 1. A training pipeline which formats the training data and uses AutoML to build Image Latest stories Classification, Entity Extraction, Text Classification, and Object Detection models.

2. A prediction pipeline which takes PDF documents from a specified Cloud Storage Products Bucket, uses the AutoML models to extract the relevant data from the documents, and stores the extracted data in BigQuery for further analysis. Topics

Training Data About The training data for this example pipeline is from a public dataset containing PDFs of U.S. and European patent title pages with a corresponding BigQuery table of manually RSS Feed entered data from the title pages. The dataset is hosted by the Google Public Datasets Project. https://cloud.google.com/blog/products/ai-machine-learning/building-a-document-understanding-pipeline-with-google-cloud 2/10 8/23/2020 Building a Document Understanding Pipeline with Google Cloud | Google Cloud Blog Pa 1: The Training Pipeline

The training pipeline consists of the following steps: Blog Menu

Training data is pulled from the BigQuery public dataset. The training BigQuery table includes links to PDF in of patents from the United States and European Union.

The PDF files are converted to PNG files and uploaded to a new Cloud Storage bucket in your own project. The PNG files will be used to train the AutoML Vision models.

The PNG files are run through the Cloud Vision API to create TXT files containing the raw text from the converted PDFs. These TXT files are used to train the AutoML Natural Language models.

The links to the PNG or TXT files are combined with the labels and features from the BigQuery table into a CSV file in the training data format required by AutoML. This CSV is then uploaded to a Cloud Storage bucket. Note: This format is different for each type of AutoML model.

This CSV is used to create an AutoML dataset and model. Both are named in the format patent_demo_data_%m%d%Y_%H%M%S. Note that some AutoML models can sometimes take hours to train.

Pa 2: The Prediction Pipeline

This pipeline uses the AutoML models previously trained by the pipeline above. For predictions, the following steps occur:

The patent PDFs are collected from the prescribed bucket and converted to PNG and Find an article... TXT files with the Cloud Vision API (just as in the training pipeline).

The AutoML Image Classification model is called on the PNG files to classify each Latest stories patent as either a US or EU patent. The results are uploaded to a BigQuery table.

Products The AutoML Object Detection model is called on the PNG files to determine the location of any figures on the patent document. The resulting relative x, y coordinates

Topics of the bounding box are then uploaded to a BigQuery table.

The AutoML Text Classification model is called on the TXT files to classify the topic of About the patent content as medical technology, computer vision, cryptocurrency or other. The results are then uploaded to a BigQuery table. RSS Feed The AutoML Entity Extraction model is called to extract predetermined entities from the patent. The extracted entities are applicant, application number, international https://cloud.google.com/blog/products/ai-machine-learning/building-a-document-understanding-pipeline-with-google-cloud 3/10 8/23/2020 Building a Document Understanding Pipeline with Google Cloud | Google Cloud Blog classification, filing date, inventor, number, publication date and title. These entities are then uploaded to a BigQuery table.

Finally, the BigQuery above are joined to produce a final results table with all Blog Menu the properties above. Step-by-Step Directions

For the developers out there, here’s how you can build the document understanding pipeline of your dreams. You can find all the code in our GitHub Repository.

Before you begin:

You’ll need a Google Cloud project to run this demo. We recommend creating a new project.

We recommend running these instructions in Google Cloud Shell (Quickstart). Other environments will work but you may have to debug issues specific to your environment. A Compute Engine VM would also be a suitable environment.

1. Git clone the repo and navigate to the patents example.

01 git clone https://github.com/munnm/professional-services.git 02 03 cd professional-services/examples/cloudml-document-ai-patents

2. Install the necessary system dependencies. Note that in Cloud Shell system-wide changes do not persist between sessions, so if you step away while working through these step-by-step instructions you will need to rerun these commands after restarting Cloud Shell.

Find an article... 01 sudo apt-get update 02 03 sudo apt-get install -y imagemagick jq poppler-utils Latest stories

Products 3. Create a virtual environment and activate it.

01 virtualenv --python=python3 $HOME/patents-demo-env Topics 02 03 source $HOME/patents-demo-env/bin/activate About

patents-demo-env RSS Feed When your virtual environment is active you’ll see in your command prompt. Note: If your Cloud Shell session ends, you’ll need to reactivate the virtual environment by running the second command again. https://cloud.google.com/blog/products/ai-machine-learning/building-a-document-understanding-pipeline-with-google-cloud 4/10 8/23/2020 Building a Document Understanding Pipeline with Google Cloud | Google Cloud Blog

For the remaining steps, make sure you’re in the correct directory in the professional- services repo: /examples/cloudml-document-ai-patents/ .

Blog 4. Install the necessary libraries into the virtual environment. Menu

01 pip3 install -r requirements.txt

5. Activate the necessary APIs.

01 gcloud services enable vision.googleapis.com automl.googleapis.com

Note: We use the gcloud SDK to interact with various GCP services. If you are not working in Cloud Shell you’ll have to set your GCP project and authenticate by running gcloud init .

6. Edit the config.yaml file. Look for the configuration parameter project_id (under pipeline_project ) and set the value to the project id where you want to build and run the pipeline. Make sure to enclose the value in single quotes, e.g. project_id: ‘my-cool-project’ .

Note: if you’re not used to working in a shell, run nano to open a simple text editor.

7. Also in config.yaml, look for the configuration parameter creator_user_id (under service_acct ) and set the value to the email account you use to log in to Google Cloud. Make sure to enclose the value in single quotes, e.g. creator_user_id: ‘[email protected]’ . 8. Create and download a service account key with the necessary permissions to be used by the training and prediction pipelines. Find an article...

01 ./get_service_acct.sh config.yaml

Latest stories 9. Run the training pipeline. This may take 3–4 hours, though some models will finish Products more quickly.

Topics 01 python3 run_training.py

About Note: If Cloud Shell Closes while the script is still downloading, converting, or uploading the PDFs, you will need to reactivate the virtual environment, navigate to RSS Feed the directory, and rerun the pipeline script. The image processing should take about 15-20 minutes, so make sure Cloud Shell doesn’t close during that time. https://cloud.google.com/blog/products/ai-machine-learning/building-a-document-understanding-pipeline-with-google-cloud 5/10 8/23/2020 Building a Document Understanding Pipeline with Google Cloud | Google Cloud Blog

10. After training the models (wait about 4 hours), your Cloud Shell session has probably disconnected. Reconnect to Cloud Shell and run the following commands to reinstall dependencies, reactivate the environment, and navigate to the document understanding Menu Blog example code.

01 sudo apt-get update 02 sudo apt-get install -y imagemagick jq poppler-utils 03 source $HOME/patents-demo-env/bin/activate 04 cd professional-services/examples/cloudml-document-ai-patents

11. Next, you need to use the AutoML UIs to deploy the object detection and entity extraction models, and to find the ids of your models (which you will enter into config.yaml ). In the UIs, you can also view relevant evaluation metrics about the model and see explicit examples where your model got it right (and wrong).

Note: Some AutoML products are currently in beta; the look and function of the UIs may change in the future.

Go to the AutoML Image Classification models UI, and make sure you are in same project you ran the training pipeline in (top right dropdown). Note the ID of your trained image classification model. Also note the green check mark to the right of the model—if this is not a check mark, or if there is nothing listed, it means model training is still in progress and you need to wait.

In Cloud Shell, edit config.yaml . Look for the line model_imgclassifier: and Find an article... on the line below you’ll see model_id: . Put the image classification model id after model_id: in single quotes, so the line looks something like model_id: Latest storie'sABC1234567890' Go to the AutoML Natural Language models UI, and similarly, make sure you are in the Products correct project, make sure your model is trained (green check mark) and note the model id of the trained text classification model. Topics

About

RSS Feed

In Cloud Shell, edit config.yaml . Look for the line model_textclassifier: and https://cloud.google.com/blog/products/ai-machine-learning/building-a-document-understanding-pipeline-with-google-cloud 6/10 8/23/2020 Building a Document Understanding Pipeline with Google Cloud | Google Cloud Blog on the line below you’ll see model_id: . Put the text classification model id after model_id: in single quotes, so the line looks something like " model_id: 'ABC1234567890' " Blog Menu

Go to the AutoML Object Detection models UI, again making sure your project is correct and your model is trained. In this UI, the model id is below the model name.

You also need to deploy the model, by opening the menu under the three dots at the far right of the model entry and selecting “Deploy model”. A UI popup will ask how many nodes to deploy to, 1 Node is fine for this demo. You need to wait for the model to deploy before running prediction (about 10-15 minutes), the deployment status is in the UI under “Deployed” and it will be “Yes” when the model is deployed.

In Cloud Shell, edit config.yaml . Look for the line model_objdetect: and on the line below you’ll see model_id: . Put the object detection model id after model_id: in single quotes, so the line looks something like model_id: 'ABC1234567890' Go to the AutoML Entity Extraction models UI, and make sure the project and region are correct. Verify the model training is complete by checking to make sure the Precision and Recall metrics are listed. Note the id as well.

Find an article...

To deploy the model, click the model name and a new UI view will open. Click “DEPLOY

Latest storiMesODEL” near the top right and confirm. You need to wait for the model to deploy before running prediction (about 10-15 minutes), the UI will update when deployment is

Products complete.

Topics

About

RSS Feed

https://cloud.google.com/blog/products/ai-machine-learning/building-a-document-understanding-pipeline-with-google-cloud 7/10 8/23/2020 Building a Document Understanding Pipeline with Google Cloud | Google Cloud Blog

Blog Menu

In Cloud Shell, edit config.yaml. Look for the line model_ner: and on the line below you’ll see model_id: . Put the entity extraction model id after model_id: in single quotes, so the line looks something like model_id: 'ABC1234567890' .

Before going further, make sure the model deployments are complete.

12. To run the prediction pipeline, you’ll need some PDFs of patent first pages in a folder in Cloud Storage. You can provide your own PDFs in your own Cloud Storage location, or for Findde mano anrsttircaleti.o..n purposes, you can run the following commands, which create a bucket with the same name as your project id (if it doesn’t already exist) and copy a small set of five patent first pages into a folder in the bucket called patent_sample . Latest stories

01 PROJECT_ID=$(gcloud config get-value project) Products 02 SOURCE_BUCKET=gs://gcs-public-data--labeled-patents 03 DEST_BUCKET=gs://${PROJECT_ID}/patent_sample/ 04 gsutil mb -p $PROJECT_ID gs://${PROJECT_ID} Topics 05 gsutil cp ${SOURCE_BUCKET}/med_tech_14.pdf ${DEST_BUCKET} 06 gsutil cp ${SOURCE_BUCKET}/espacenet_fr36.pdf ${DEST_BUCKET} 07 gsutil cp ${SOURCE_BUCKET}/us_006.pdf ${DEST_BUCKET} About 08 gsutil cp ${SOURCE_BUCKET}/espacenet_en67.pdf ${DEST_BUCKET} 09 gsutil cp ${SOURCE_BUCKET}/crypto_6.pdf ${DEST_BUCKET} RSS Feed

13. In the config.yaml file, fill in the configuration parameter labeled https://cloud.google.com/blog/products/ai-machine-learning/building-a-document-understanding-pipeline-with-google-cloud 8/10 8/23/2020 Building a Document Understanding Pipeline with Google Cloud | Google Cloud Blog demo_sample_data (under pipeline_project ) with the Cloud Storage location of the patents you want to process (in single quotes). If you’re following the example above, the parameter value is 'gs:///patent_sample' , Blog substituting your project id. Menu

Also, fill in the parameter labeled demo_dataset_id , just under demo_sample_data . With the BigQuery dataset (in single quotes) in your project where the predicted entities will be collected. For example, you can use 'patent_demo' . Note that this dataset must not exist already; the code will attempt to create the dataset and will stop running if the dataset already exists.

Now, you are ready to make predictions!

14. Run the prediction pipeline. This will process all the patent PDF first pages in the Cloud Storage folder specified in the demo_sample_data parameter, and upload predictions to (and create) BigQuery tables in the dataset specified by the demo_dataset_id parameter.

python3 run_predict.py 15. Finally, go check out the dataset in the BigQuery UI and examine all the tables that have been created. There is a table called "final_view" which collects all the results in a single table. You should see something like this:

Note, for this, or any of the other intermediate tables that were created during the prediction pipeline, you may need to query the table to see the results since it is so new. Find an article... For example:

Latest stories01 SELECT * FROM `..final_view`

Products If you are new to BigQuery, you can open a querying interface in the UI prepopulated with the table name by clicking the “QUERY TABLE” button visible when you select a table Topics to view.

About Conclusion

RSS Feed Congratulations! You now have a fully functional document understanding pipeline that can be modified for use with any PDF documents. To modify this example to work with https://cloud.google.com/blog/produdcts/ai-machine-leahrnindg/building-lal-docui ment-undderstainding-pipeline-with-giollogle-cdloud b difi d 9/10 8/23/2020 Building a Document Understanding Pipeline with Google Cloud | Google Cloud Blog your own documents, the data collection and training stages will need to be modified to pull documents from a local machine or another Cloud Storage bucket rather than a public BigQuery dataset. The training data will also need to be created manually for the

Blog specific type of documents you will be using. Menu

Now, if only we could create a robot that could scan paper documents to PDFs… For more information about the general process of document understanding, see this blog post from Nitin Aggarwal at Google Cloud. And for more information about the business use cases of Document Understanding on Google Cloud, check out this session from Google Cloud Next ’19.

POSTED IN: AI & MACHINE LEARNING—SOLUTIONS AND HOW-TO'S—GOOGLE CLOUD PLATFORM

RELATED ARTICLES

Performance and cost optimization Google Cloud AI and Harvard Global best practices for machine learning Health Institute Collaborate on new COVID-19 forecasting model

Follow Us

Privacy Terms About Google Google Cloud Products

Language Help

https://cloud.google.com/blog/products/ai-machine-learning/building-a-document-understanding-pipeline-with-google-cloud 10/10