Building a Cloud Platform Using AWS for Data Analysis of Digital Library

Building a Cloud Platform using AWS for Data Analysis of Digital Library Yinlin Chen, Ph.D. Software Engineer [email protected] Research and Informatics Virginia Tech Libraries Agenda • Virginia Tech Libraries services and dataset • Data analytics platform • Architecture design and considerations • Amazon Web Services (AWS) • Lessons learned and best practices • Future works Virginia Tech Libraries Services • VTechData: • VTechWork: Semvera/Sufia and Fedora DSpace Virginia Tech Libraries Dataset • Log Data: – Application log – Web log • Collection content: – Virginia Tech ETD – Research content Data Analytics Platform • Log data analysis – Detect particular log events and patterns – Identify performance issues – Applications’ normal and unusual behaviors • Content analysis – Topics/Categories – Trends – Identify record issues – Usage statistics From Log Data to Visualization From Content to Visualization Data analytics pipeline Process & Visualize Collect Store Analyze Platform Decisions • Not building everything from scratch • Easy to build a prototype, then ship to production • Low maintenance needs • Programmable, automatable, and optimizable • Learning curve Architecture Decisions • Agility • Stability • Scalability • Extensibility • Security • Auditability • Cost Start Small and Grow Big • Collection size: 100 10,000 1M 100M • Storage size: 1MB 1GB 1TB 1PB From Simple to Complex Log Analytics Options • ELG: Elasticsearch, Logstash, Grafana/Graylog • ELK: Elasticsearch, Logstash, Kibana • EKK: Elasticsearch, Kinesis, Kibana Amazon Amazon Kinesis Elasticsearch Amazon Kinesis Services • Use Amazon Kinesis Streams to collect and process large streams of data records in real time • Use Amazon Kinesis Firehose to deliver realtime streaming data to destinations such as Amazon S3 and Amazon Redshift • Use Amazon Kinesis Analytics to process and analyze streaming data with standard SQL Data analytics pipeline in AWS Process & Collect Store Visualize Analyze Real-time Object Store Hadoop Ecosystem Business Intelligence & Amazon Kinesis Firehose Amazon S3 Amazon EMR Data Visualization Amazon Glacier Amazon QuickSight Data Import Real-time Amazon Import/Export Real-time AWS Lambda Elastic Search Analytics Snowball Amazon Kinesis Streams Amazon Kinesis Analytics Amazon ElasticSearch Message Queuing RDBMS Data Warehousing Amazon SQS Amazon RDS Amazon Redshift Web/app Servers NoSQL Machine Learning Amazon EC2 DynamoDB Amazon Machine Learning Search Elastic Search Analytics Amazon CloudSearch Amazon ElasticSearch IoT Process & Move Data Amazon IoT AWS Data Pipeline Data Analytics Architecture Process & Collect Store Visualize Analyze Data Amazon S3 Amazon Kibana Source Elasticsearch AWS Services • Amazon S3 • AWS Lambda • Amazon Elasticsearch Service with Kibana • AWS Config • AWS SNS • IAM • CloudWatch • CloudTrail • VPC Collect and Store • The largest S3 object size is 5TB • Uploaded in a single PUT is 5GB • Use multipart upload when size is larger than 100MB Log Data Collection Amazon S3 Others S3 Buckets Process Data in S3 and Send to Elasticsearch File upload to S3, trigger a lambda function, and ingest content into Elasticsearch Content Transformation Using AWS Lambda "2018-01-30T08:13:13.200Z","192.168.2.6","prod_log: 202.248.84.158 vtechworks.lib.vt.edu [30/Jan/2018:03:13:15 -0500] GET /bitstream/handle/10919/33268/WGraf_Thesis_2005.pdf?sequence=1 HTTP/1.1 Mozilla/5.0 (Windows NT 6.3; WOW64; Trident/7.0; rv:11.0) like Gecko" Amazon Elasticsearch Service • AWS managed service • Log analytics • Full-text search • Integrated with Kibana • Integrated with other AWS services • Different instance type comes with different limits – EBS volume size: GB, TB – HTTP request payloads: 10MB, 100MB Amazon Elaticsearch Service Best Practices • Double the index size if deploying index replica • Size based on storage requirements – Amazon Elastic Block Store (EBS) per instance – Example: 2TB corpus will need 8 instances when using 512GB EBS per instance • Use General Purpose SSD (GP2) EBS volumes • Use 3 dedicated master nodes for production deployments • Enable zone awareness • Set indices.fielddata.cache.size = 40 AWS CLI, Console and SDK • AWS CLI Visualize - Kibana Amazon SNS Amazon S3 AWS Amazon Lambda Elasticsearch IAM topic S3 Bucket AWS Amazon EMR Lambda CloudWatch CloudTrail AWS Config Design Principles and Pattern in AWS • Decouple the data analytics pipeline • Use the right tools • Use managed services Data Insight Amazon Amazon Amazon Amazon S3 Amazon Amazon Kinesis EMR DynamoDB Redshift QuickSight Storage Process Storage Access Future Works • Resiliency, automation, and optimization • Batch/Real-time Analytics – Amazon EMR – Kinesis streams/analytics • Pattern learning and recognition and prediction – Machine learning – Deep learning Batch/Real-time Analytics SPARK Data Source Logs stored in Amazon S3 PRESTO HIVE TEZ Amazon Amazon EMR EMR Clusters Machine Learning Predictions Query for predictions with Amazon ML batch API Load predictions into Load dataset in Amazon Redshift Amazon S3 Store predictions results in Amazon S3 Send answers to user application Store / retrieve data Structured data in Dataset / Amazon Redshift Predictions in User application Amazon S3 Send answers to user application Other Cloud Platform • Google Cloud Platform vs AWS Q & A Supported by Virginia Tech Libraries – Beyond Boundaries project and AWS Cloud Credits for Research program Yinlin Chen [email protected].

Building a Cloud Platform Using AWS for Data Analysis of Digital Library

Elasticsearch Update Multiple Documents

15 Minutes Introduction to ELK (Elastic Search,Logstash,Kibana) Kickstarter Series

Red Hat Openstack Platform 9 Red Hat Openstack Platform Operational Tools

Pick Technologies & Tools Faster by Coding with Jhipster: Talk Page At

Google Cloud Search Benefits Like These Are All Within Reach

Optimizing Data Retrieval by Using Mongodb with Elasticsearch

Working-With-Mediawiki-Yaron-Koren.Pdf

Accessing Elasticsearch & Kibana

Solution Deploying Elasticsearch on Kubernetes Using

A Comparative Study of Elasticsearch and Couchdb Document Oriented Databases

Using Elasticsearch, Logstash and Kibana to Create Realtime Dashboards

ELK Stack: Elasticsearch, Logstash and Kibana