Building a Cloud Platform using AWS for Data Analysis of Digital Library
Yinlin Chen, Ph.D. Software Engineer [email protected] Research and Informatics Virginia Tech Libraries Agenda
• Virginia Tech Libraries services and dataset • Data analytics platform • Architecture design and considerations • Amazon Web Services (AWS) • Lessons learned and best practices • Future works Virginia Tech Libraries Services
• VTechData: • VTechWork: Semvera/Sufia and Fedora DSpace Virginia Tech Libraries Dataset
• Log Data: – Application log – Web log • Collection content: – Virginia Tech ETD – Research content Data Analytics Platform
• Log data analysis – Detect particular log events and patterns – Identify performance issues – Applications’ normal and unusual behaviors • Content analysis – Topics/Categories – Trends – Identify record issues – Usage statistics From Log Data to Visualization From Content to Visualization Data analytics pipeline
Process & Visualize Collect Store Analyze Platform Decisions
• Not building everything from scratch • Easy to build a prototype, then ship to production • Low maintenance needs • Programmable, automatable, and optimizable • Learning curve Architecture Decisions
• Agility • Stability • Scalability • Extensibility • Security • Auditability • Cost Start Small and Grow Big
• Collection size: 100 10,000 1M 100M • Storage size: 1MB 1GB 1TB 1PB From Simple to Complex Log Analytics Options
• ELG: Elasticsearch, Logstash, Grafana/Graylog • ELK: Elasticsearch, Logstash, Kibana • EKK: Elasticsearch, Kinesis, Kibana
Amazon Amazon Kinesis Elasticsearch Amazon Kinesis Services
• Use Amazon Kinesis Streams to collect and process large streams of data records in real time • Use Amazon Kinesis Firehose to deliver real- time streaming data to destinations such as Amazon S3 and Amazon Redshift • Use Amazon Kinesis Analytics to process and analyze streaming data with standard SQL Data analytics pipeline in AWS
Process & Collect Store Visualize Analyze
Real-time Object Store Hadoop Ecosystem Business Intelligence & Amazon Kinesis Firehose Amazon S3 Amazon EMR Data Visualization Amazon Glacier Amazon QuickSight Data Import Real-time Amazon Import/Export Real-time AWS Lambda Elastic Search Analytics Snowball Amazon Kinesis Streams Amazon Kinesis Analytics Amazon ElasticSearch
Message Queuing RDBMS Data Warehousing Amazon SQS Amazon RDS Amazon Redshift
Web/app Servers NoSQL Machine Learning Amazon EC2 DynamoDB Amazon Machine Learning
Search Elastic Search Analytics Amazon CloudSearch Amazon ElasticSearch
IoT Process & Move Data Amazon IoT AWS Data Pipeline Data Analytics Architecture
Process & Collect Store Visualize Analyze
Data Amazon S3 Amazon Kibana Source Elasticsearch AWS Services
• Amazon S3 • AWS Lambda • Amazon Elasticsearch Service with Kibana • AWS Config • AWS SNS • IAM • CloudWatch • CloudTrail • VPC Collect and Store
• The largest S3 object size is 5TB • Uploaded in a single PUT is 5GB • Use multipart upload when size is larger than 100MB
Log Data
Collection
Amazon S3 Others S3 Buckets Process Data in S3 and Send to Elasticsearch
File upload to S3, trigger a lambda function, and ingest content into Elasticsearch Content Transformation Using AWS Lambda
"2018-01-30T08:13:13.200Z","192.168.2.6","prod_log: 202.248.84.158 vtechworks.lib.vt.edu [30/Jan/2018:03:13:15 -0500] GET /bitstream/handle/10919/33268/WGraf_Thesis_2005.pdf?sequence=1 HTTP/1.1 Mozilla/5.0 (Windows NT 6.3; WOW64; Trident/7.0; rv:11.0) like Gecko"
Amazon Elasticsearch Service
• AWS managed service • Log analytics • Full-text search • Integrated with Kibana • Integrated with other AWS services • Different instance type comes with different limits – EBS volume size: GB, TB – HTTP request payloads: 10MB, 100MB Amazon Elaticsearch Service Best Practices
• Double the index size if deploying index replica • Size based on storage requirements – Amazon Elastic Block Store (EBS) per instance – Example: 2TB corpus will need 8 instances when using 512GB EBS per instance • Use General Purpose SSD (GP2) EBS volumes • Use 3 dedicated master nodes for production deployments • Enable zone awareness • Set indices.fielddata.cache.size = 40 AWS CLI, Console and SDK
• AWS CLI Visualize - Kibana Amazon SNS Amazon S3 AWS Amazon Lambda Elasticsearch
IAM topic S3 Bucket AWS Amazon EMR Lambda
CloudWatch CloudTrail AWS Config Design Principles and Pattern in AWS • Decouple the data analytics pipeline • Use the right tools • Use managed services
Data Insight
Amazon Amazon Amazon Amazon S3 Amazon Amazon Kinesis EMR DynamoDB Redshift QuickSight Storage Process Storage Access Future Works
• Resiliency, automation, and optimization • Batch/Real-time Analytics – Amazon EMR – Kinesis streams/analytics • Pattern learning and recognition and prediction – Machine learning – Deep learning Batch/Real-time Analytics
SPARK
Data Source Logs stored in Amazon S3 PRESTO
HIVE
TEZ
Amazon Amazon EMR EMR Clusters Machine Learning Predictions Query for predictions with Amazon ML batch API
Load predictions into Load dataset in Amazon Redshift Amazon S3
Store predictions results in Amazon S3 Send answers to user application
Store / retrieve data
Structured data in Dataset / Amazon Redshift Predictions in User application Amazon S3
Send answers to user application Other Cloud Platform
• Google Cloud Platform vs AWS Q & A
Supported by Virginia Tech Libraries – Beyond Boundaries project and AWS Cloud Credits for Research program
Yinlin Chen [email protected]