Building a Cloud Platform using AWS for Data Analysis of Digital Library

Yinlin Chen, Ph.D. Software Engineer [email protected] Research and Informatics Virginia Tech Libraries Agenda

• Virginia Tech Libraries services and dataset • Data analytics platform • Architecture design and considerations • (AWS) • Lessons learned and best practices • Future works Virginia Tech Libraries Services

• VTechData: • VTechWork: Semvera/Sufia and Fedora DSpace Virginia Tech Libraries Dataset

• Log Data: – Application log – Web log • Collection content: – Virginia Tech ETD – Research content Data Analytics Platform

• Log data analysis – Detect particular log events and patterns – Identify performance issues – Applications’ normal and unusual behaviors • Content analysis – Topics/Categories – Trends – Identify record issues – Usage statistics From Log Data to Visualization From Content to Visualization Data analytics pipeline

Process & Visualize Collect Store Analyze Platform Decisions

• Not building everything from scratch • Easy to build a prototype, then ship to production • Low maintenance needs • Programmable, automatable, and optimizable • Learning curve Architecture Decisions

• Agility • Stability • Scalability • Extensibility • Security • Auditability • Cost Start Small and Grow Big

• Collection size: 100 10,000 1M 100M • Storage size: 1MB 1GB 1TB 1PB From Simple to Complex Log Analytics Options

• ELG: Elasticsearch, Logstash, /Graylog • ELK: Elasticsearch, Logstash, • EKK: Elasticsearch, Kinesis, Kibana

Amazon Amazon Kinesis Elasticsearch Amazon Kinesis Services

• Use Amazon Kinesis Streams to collect and process large streams of data records in real time • Use Amazon Kinesis Firehose to deliver real- time streaming data to destinations such as Amazon S3 and Amazon Redshift • Use Amazon Kinesis Analytics to process and analyze streaming data with standard SQL Data analytics pipeline in AWS

Process & Collect Store Visualize Analyze

Real-time Object Store Hadoop Ecosystem Business Intelligence & Amazon Kinesis Firehose Amazon S3 Amazon EMR Data Visualization Amazon Glacier Amazon QuickSight Data Import Real-time Amazon Import/Export Real-time AWS Lambda Elastic Search Analytics Snowball Amazon Kinesis Streams Amazon Kinesis Analytics Amazon ElasticSearch

Message Queuing RDBMS Data Warehousing Amazon SQS Amazon RDS Amazon Redshift

Web/app Servers NoSQL Machine Learning Amazon EC2 DynamoDB Amazon Machine Learning

Search Elastic Search Analytics Amazon CloudSearch Amazon ElasticSearch

IoT Process & Move Data Amazon IoT AWS Data Pipeline Data Analytics Architecture

Process & Collect Store Visualize Analyze

Data Amazon S3 Amazon Kibana Source Elasticsearch AWS Services

• Amazon S3 • AWS Lambda • Amazon Elasticsearch Service with Kibana • AWS Config • AWS SNS • IAM • CloudWatch • CloudTrail • VPC Collect and Store

• The largest S3 object size is 5TB • Uploaded in a single PUT is 5GB • Use multipart upload when size is larger than 100MB

Log Data

Collection

Amazon S3 Others S3 Buckets Process Data in S3 and Send to Elasticsearch

File upload to S3, trigger a lambda function, and ingest content into Elasticsearch Content Transformation Using AWS Lambda

"2018-01-30T08:13:13.200Z","192.168.2.6","prod_log: 202.248.84.158 vtechworks.lib.vt.edu [30/Jan/2018:03:13:15 -0500] GET /bitstream/handle/10919/33268/WGraf_Thesis_2005.pdf?sequence=1 HTTP/1.1 Mozilla/5.0 (Windows NT 6.3; WOW64; Trident/7.0; rv:11.0) like Gecko"

Amazon Elasticsearch Service

• AWS managed service • Log analytics • Full-text search • Integrated with Kibana • Integrated with other AWS services • Different instance type comes with different limits – EBS volume size: GB, TB – HTTP request payloads: 10MB, 100MB Amazon Elaticsearch Service Best Practices

• Double the index size if deploying index replica • Size based on storage requirements – Amazon Elastic Block Store (EBS) per instance – Example: 2TB corpus will need 8 instances when using 512GB EBS per instance • Use General Purpose SSD (GP2) EBS volumes • Use 3 dedicated master nodes for production deployments • Enable zone awareness • Set indices.fielddata.cache.size = 40 AWS CLI, Console and SDK

• AWS CLI Visualize - Kibana Amazon SNS Amazon S3 AWS Amazon Lambda Elasticsearch

IAM topic S3 Bucket AWS Amazon EMR Lambda

CloudWatch CloudTrail AWS Config Design Principles and Pattern in AWS • Decouple the data analytics pipeline • Use the right tools • Use managed services

Data Insight

Amazon Amazon Amazon Amazon S3 Amazon Amazon Kinesis EMR DynamoDB Redshift QuickSight Storage Process Storage Access Future Works

• Resiliency, automation, and optimization • Batch/Real-time Analytics – Amazon EMR – Kinesis streams/analytics • Pattern learning and recognition and prediction – Machine learning – Deep learning Batch/Real-time Analytics

SPARK

Data Source Logs stored in Amazon S3 PRESTO

HIVE

TEZ

Amazon Amazon EMR EMR Clusters Machine Learning Predictions Query for predictions with Amazon ML batch API

Load predictions into Load dataset in Amazon Redshift Amazon S3

Store predictions results in Amazon S3 Send answers to user application

Store / retrieve data

Structured data in Dataset / Amazon Redshift Predictions in User application Amazon S3

Send answers to user application Other Cloud Platform

vs AWS Q & A

Supported by Virginia Tech Libraries – Beyond Boundaries project and AWS Cloud Credits for Research program

Yinlin Chen [email protected]