Building a Document Understanding Pipeline with Google Cloud | Google Cloud Blog
Total Page:16
File Type:pdf, Size:1020Kb
8/23/2020 Building a Document Understanding Pipeline with Google Cloud | Google Cloud Blog Blog Menu AI & M ACH INE LEARNING Building a document understanding pipeline with Google Cloud Holt Skinner Cloud Technical Resident Find an article... Michael Munn Machine Learning Solutions Engineer Latest storiMesichael Sherman Machine Learning Engineer Products September 20, 2019 Topics About Document understanding is the practice of using AI and machine learning to extract RSS Feed data and insights from text and paper sources such as emails, PDFs, scanned documents, and more. In the past, capturing this unstructured or “dark data” has been an expensive, time-consuming, and error-prone process requiring manual data entry. https://cloud.google.com/blog/products/ai-machine-learning/building-a-document-understanding-pipeline-with-google-cloud 1/10 8/23/2020 p Building a Dgocument Undersptanding Ppipeline with Goqogle Clogud | Google Cloud Blog y Today, AI and machine learning have made great advances towards automating this process, enabling businesses to derive insights from and take advantage of this data that had been previously untapped. Blog Menu In a nutshell, document understanding allows you to: Organize documents Extract knowledge from documents Increase processing speed At Google Cloud we provide a solution, Document AI, which enterprises can leverage in collaboration with partners to implement document understanding. However, many developers have both the desire and the technical expertise to build their own document understanding pipelines on Google Cloud Platform (GCP)—without working with a partner—using the individual Document AI products. If that sounds like you, this post will take you step-by-step through a complete document understanding pipeline. The Overview section explains how the pipeline works, and the step-by-step directions below walk you through running the code. Overview In order to automate an entire document understanding process, multiple machine learning models need to be trained and then daisy-chained together alongside processing steps into an end-to-end pipeline. This can be a daunting process, so we have provided sample code for a complete document understanding system mirroring a data entry workflow capturing structured data from documents. Find an article... Our example end-to-end document understanding pipeline consists of two components: 1. A training pipeline which formats the training data and uses AutoML to build Image Latest stories Classification, Entity Extraction, Text Classification, and Object Detection models. 2. A prediction pipeline which takes PDF documents from a specified Cloud Storage Products Bucket, uses the AutoML models to extract the relevant data from the documents, and stores the extracted data in BigQuery for further analysis. Topics Training Data About The training data for this example pipeline is from a public dataset containing PDFs of U.S. and European patent title pages with a corresponding BigQuery table of manually RSS Feed entered data from the title pages. The dataset is hosted by the Google Public Datasets Project. https://cloud.google.com/blog/products/ai-machine-learning/building-a-document-understanding-pipeline-with-google-cloud 2/10 8/23/2020 Building a Document Understanding Pipeline with Google Cloud | Google Cloud Blog Pa 1: The Training Pipeline The training pipeline consists of the following steps: Blog Menu Training data is pulled from the BigQuery public dataset. The training BigQuery table includes links to PDF files in Google Cloud Storage of patents from the United States and European Union. The PDF files are converted to PNG files and uploaded to a new Cloud Storage bucket in your own project. The PNG files will be used to train the AutoML Vision models. The PNG files are run through the Cloud Vision API to create TXT files containing the raw text from the converted PDFs. These TXT files are used to train the AutoML Natural Language models. The links to the PNG or TXT files are combined with the labels and features from the BigQuery table into a CSV file in the training data format required by AutoML. This CSV is then uploaded to a Cloud Storage bucket. Note: This format is different for each type of AutoML model. This CSV is used to create an AutoML dataset and model. Both are named in the format patent_demo_data_%m%d%Y_%H%M%S. Note that some AutoML models can sometimes take hours to train. Pa 2: The Prediction Pipeline This pipeline uses the AutoML models previously trained by the pipeline above. For predictions, the following steps occur: The patent PDFs are collected from the prescribed bucket and converted to PNG and Find an article... TXT files with the Cloud Vision API (just as in the training pipeline). The AutoML Image Classification model is called on the PNG files to classify each Latest stories patent as either a US or EU patent. The results are uploaded to a BigQuery table. Products The AutoML Object Detection model is called on the PNG files to determine the location of any figures on the patent document. The resulting relative x, y coordinates Topics of the bounding box are then uploaded to a BigQuery table. The AutoML Text Classification model is called on the TXT files to classify the topic of About the patent content as medical technology, computer vision, cryptocurrency or other. The results are then uploaded to a BigQuery table. RSS Feed The AutoML Entity Extraction model is called to extract predetermined entities from the patent. The extracted entities are applicant, application number, international https://cloud.google.com/blog/products/ai-machine-learning/building-a-document-understanding-pipeline-with-google-cloud 3/10 8/23/2020 Building a Document Understanding Pipeline with Google Cloud | Google Cloud Blog classification, filing date, inventor, number, publication date and title. These entities are then uploaded to a BigQuery table. Finally, the BigQuery tables above are joined to produce a final results table with all Blog Menu the properties above. Step-by-Step Directions For the developers out there, here’s how you can build the document understanding pipeline of your dreams. You can find all the code in our GitHub Repository. Before you begin: You’ll need a Google Cloud project to run this demo. We recommend creating a new project. We recommend running these instructions in Google Cloud Shell (Quickstart). Other environments will work but you may have to debug issues specific to your environment. A Compute Engine VM would also be a suitable environment. 1. Git clone the repo and navigate to the patents example. 01 git clone https://github.com/munnm/professional-services.git 02 03 cd professional-services/examples/cloudml-document-ai-patents 2. Install the necessary system dependencies. Note that in Cloud Shell system-wide changes do not persist between sessions, so if you step away while working through these step-by-step instructions you will need to rerun these commands after restarting Cloud Shell. Find an article... 01 sudo apt-get update 02 03 sudo apt-get install -y imagemagick jq poppler-utils Latest stories Products 3. Create a virtual environment and activate it. 01 virtualenv --python=python3 $HOME/patents-demo-env Topics 02 03 source $HOME/patents-demo-env/bin/activate About patents-demo-env RSS Feed When your virtual environment is active you’ll see in your command prompt. Note: If your Cloud Shell session ends, you’ll need to reactivate the virtual environment by running the second command again. https://cloud.google.com/blog/products/ai-machine-learning/building-a-document-understanding-pipeline-with-google-cloud 4/10 8/23/2020 Building a Document Understanding Pipeline with Google Cloud | Google Cloud Blog For the remaining steps, make sure you’re in the correct directory in the professional- services repo: /examples/cloudml-document-ai-patents/ . Blog 4. Install the necessary libraries into the virtual environment. Menu 01 pip3 install -r requirements.txt 5. Activate the necessary APIs. 01 gcloud services enable vision.googleapis.com automl.googleapis.com Note: We use the gcloud SDK to interact with various GCP services. If you are not working in Cloud Shell you’ll have to set your GCP project and authenticate by running gcloud init . 6. Edit the config.yaml file. Look for the configuration parameter project_id (under pipeline_project ) and set the value to the project id where you want to build and run the pipeline. Make sure to enclose the value in single quotes, e.g. project_id: ‘my-cool-project’ . Note: if you’re not used to working in a shell, run nano to open a simple text editor. 7. Also in config.yaml, look for the configuration parameter creator_user_id (under service_acct ) and set the value to the email account you use to log in to Google Cloud. Make sure to enclose the value in single quotes, e.g. creator_user_id: ‘[email protected]’ . 8. Create and download a service account key with the necessary permissions to be used by the training and prediction pipelines. Find an article... 01 ./get_service_acct.sh config.yaml Latest stories 9. Run the training pipeline. This may take 3–4 hours, though some models will finish Products more quickly. Topics 01 python3 run_training.py About Note: If Cloud Shell Closes while the script is still downloading, converting, or uploading the PDFs, you will need to reactivate the virtual environment, navigate to RSS Feed the directory, and rerun the pipeline script. The image processing should take about 15-20 minutes, so make sure Cloud Shell doesn’t close during that time. https://cloud.google.com/blog/products/ai-machine-learning/building-a-document-understanding-pipeline-with-google-cloud 5/10 8/23/2020 Building a Document Understanding Pipeline with Google Cloud | Google Cloud Blog 10. After training the models (wait about 4 hours), your Cloud Shell session has probably disconnected.