Distributed Workflow Execution: How Containers Saved the Day the Case of Openminted
Total Page:16
File Type:pdf, Size:1020Kb
Distributed workflow execution: how containers saved the day The case of OpenMinTeD Developers Team @ GRNET S.A. Who are we? GRNET: Greek Research and Technology Network, www.grnet.gr GRNET developers involved: • Stavros Sachtouris, [email protected] • Theodoros Sotiropoulos, [email protected] • Kostas Vogias, [email protected] • Nick Vathis, [email protected] About OpenMinTeD • Open Mining Infrastructure for Text & Data • www.openminted.eu • Funding: European Commission • Partners: many European academic/research institutes and enterprises • Goal: an open service to create, discover, share and re-use knowledge extracted from a wide range of scientific resources. OpenMinTeD GRNET • TDM tools: Text & Data mining • ~okeanos: large IaaS cloud software • thousands of VMs + storage • Corpora: large datasets • Synnefo: IaaS software • Workflows: chains of TDM tools developed by GRNET • Users: international academia and more... ~okeanos: a Synnefo-powered IaaS • okeanos.grnet.gr • Serves the Greek and European research and academic community since 2011 • Over 9000 VMs! • 3 Datacenters all over the country • One of the largest non- commercial clouds in EU ~okeanos: a Synnefo-powered IaaS Features: • Virtual Machines • OS Images (mostly Linux) • IPs (v4 and v6) • Virtual Private Networks • Movable volumes • Object storage • UI, API, CLI, libraries, Resizing, Quotas, Image creator, back-up tool, syncing client and more Does NOT feature: • End user services (e.g. wordpress) • Management panels (e.g. CPanel) • Service-level Monitoring • Service-level Accounting Multi-document text summarization: a text mining example Input: lots of texts, pdfs, etc. on the same subject ==> convert everything to text ==> segment into sentences ==> lemmatize ==> part-of-speech tagging (grammar tags on words) ==> find terms (term grammar, statistical tools) ==> find sentences that are summary worthy ==> remove redundancy Output: summary of multiple documents Text and Data Mining (TDM) Tools Output Input TDM tool (e.g., extracted (corpora) metadata) Can be small or big Metadata Huge bodies of text Diverse runtimes Some standards exist (e.g. annotations, (OS, PL, libs, etc.) frequencies, etc.) Various developers Corpora Cannot be trusted TDM Workflows Should be able to • edit • save • load • re-run Corp. Result Tool A Tool B Result A A B Result Corp. Tool C Result Tool D B C D OpenMinTeD Requirements Corpora Execution Environment Results * * Workflows Key Requirements • Scalable: how many workflows can run simultaneously? • Easy to migrate to other clouds • Replaceable service components • Custom task scheduler • Isolated tool execution Key Requirements Key Decisions • Scalable: how many workflows can • All TDM tools must be run simultaneously? dockerized • Easy to migrate to other clouds • No user management or accounting at cluster level • Replaceable service components • Custom task scheduler • Monitor the TDM tools (rather than the stack) • Isolated tool execution How to "chain" diverse tools • All tools must be docker images • Run tools as docker containers • Containers mount volumes from a shared space $ docker run tdm-tool-A \ -v /shared/input-for-A –v /shared/input-for-B $ docker run tdm-tool-B \ -v /shared/input-for-B –v /shared/input-for-C ... Key Requirements Key Decisions • Scalable: how many workflows can • All TDM tools must be run simultaneously? dockerized • Easy to migrate to other clouds • No user management or accounting at cluster level • Replaceable service components • Custom task scheduler • Monitor the TDM tools (rather than the stack) • Isolated tool execution Registry + UI AAI Corp Workflow Service Accounting REST API WF editor Cluster Tool Scheduler Monitor Manager Registry WF engine Worker Worker Worker Worker scale Shared FS scale Corp Out Workflows Accounting REST API Cluster Tool Scheduler Monitor Manager Registry WF engine Worker Worker Worker Worker scale Shared FS scale Key Requirements Key Decisions • Scalable: how many workflows can • Each component runs on run simultaneously? (probably) its own VM • Easy to migrate to other clouds • Open Source Components • Replaceable service components • Components with REST API • Custom task scheduler • Isolated tool execution Components Workflows Accounting Worker: Mesos slave Galaxy REST API Docker engine CAdvisor Mesos Docker Prometheus Chronos master Registry Grafana Galaxy Worker Worker Worker Worker scale NFS with LVM scale Galaxy Galaxy-chronos runner Pithos+ Object Store Mesos Docker Prometheus Chronos master Registry Grafana Galaxy Worker Worker Worker Worker Galaxy-pithos driver NFS with LVM Galaxy and NFS on the same host Key Requirements Key Decisions • Scalable: how many workflows can • Modular architecture run simultaneously? • Easy to migrate to other clouds • Choose something simple for now, we will refine it later • Replaceable service components • Custom task scheduler • Isolated tool execution Mesos Cluster • Modular • Short-lived tasks • Can use many different • Queue of tasks schedulers • There are issues... simultaneously • Scalability, high availability Mesos Chronos • Supports docker master Mesos Mesos Mesos Mesos slave slave slave slave NFS with LVM Mesos Cluster can expand Need more processing power? --> add Workers Mesos Chronos master Need more storage? --> add volumes Mesos Mesos Mesos Mesos scale slave slave slave slave NFS with LVM scale Why not Docker Swarm? No custom scheduler Scalability High availability Easy to manage Docker Swarm Swarm Swarm Swarm Swarm scale Node Node Node Node NFS with LVM scale Technologies we considered • Mesos/Marathon: good for persistent services, not for one-off tasks • Docker Swarm: meets all requirements, except custom scheduler • Kubernetes: meets all requirements, too complicated / an overkill • Hadoop-YARN/Slider: nice for custom scheduler, optimal for some TDM operations, optimizations not utilized with docker containers, cannot cooperate with many tools • Mesos/Chronos: meets all requirements, Chronos is a pain... • Mesos/Custom scheduler: future work... User's Storage local disk Pithos+ Mesos Docker Prometheus Object Chronos Store master Registry Grafana Galaxy Worker Worker Worker Worker scale NFS with LVM scale Storage Performance Q: Isn't NFS slow? A: Maybe, but it currently performs faster than other parts of the system! Q: What is the bottleneck, then? A: Upload/download from cloud and/or user's disk. Q: How can you optimize? A: Separate upload/download from execution, upload/download directly to shared FS. Any other ideas? Too many workflows to run? • Galaxy enforces resource usage limits • Galaxy and Chronos have job queues • Monitoring and alerts • Scale as soon as you get alerted! AAI Monitoring, Alerts Workflows Accounting Resource consumption Aggregate resource consumption on cluster level Mesos Docker Prometheus Chronos master Registry Grafana Galaxy Email alerts for overused resources Worker Worker Worker: Mesos slave Gather container resource consumption NFS Docker engine CAdvisor on VM level Are these alerts enough? Workflow-related alerts: • Cluster resources used simultaneously (CPU, RAM, Disk space) • Only interested on docker containers consumption System-wide alerts: • NFS is full • Host disks are full • Services break down AAI Accounting User ID Task ID * Workflows Accounting Relate: • Task ID • Task ID * Preserve task ID User ID on container metadata • Resource Consumption * Mesos Docker Prometheus Chronos master Registry Grafana Galaxy * Worker Worker CAdvisor gathers Worker: container resource consumption on VM Mesos slave level NFS Docker engine CAdvisor Scalability Need more containers? --> add Workers Mesos Chronos master Storage bandwidth ??? Mesos Mesos Mesos Mesos scale slave slave slave slave Need storage? scale NFS with LVM --> add volumes Scalability NFS is limited by the network bandwidth. Solution: • Tune NFS caching • Create a data reallocation service (defeats the purpose of mesos) • Make each tool request its data from a service (against design goals) • Use a more scalable object store (e.g. HDFS, Pithos+, S3) Docker images and containers clean up Workflows Cluster Tool Scheduler Monitor Manager Registry WF engine Worker Worker Worker Worker Shared FS Docker images and containers clean up • What if a host is stuffed with docker images and stopped containers? Solutions: • Manual deletion of images on alert... • Clean up everything periodically • Clean up when execution finishes (bad performance) • NFS storage instead of docker registry (does not seem possible) • Sophisticated garbage collection? • Combination of the above? Something else? Docker images and containers clean up • What if a new image has the same tag as the old one? Solutions: • Manual deletion of images on update • Clean up when execution finishes (bad performance) • Delete image from all nodes when updating docker registry Sophisticated Garbage Collection Why is it so hard? Because it has to take into account: • Size of each image • Usage patterns of each image (Least Recently Used) • Needs tight integration with docker registry to monitor image updates (abstract away docker registry with a RESTful service) • Statistics ideally should be cluster-wide and not node specific Setup and operations • Provision VMs with a kamaki (the de facto Synnefo API client) script • Ansible + manual ssh (but one day it's gonna be just ansible...) In the future • An extra Mesos/Marathon to keep components alive • Nagios alerts • Automatic scaling with triggers and bash/kamaki scripting • Support provisioning for all major clouds (OpenStack,