Distributed workflow execution: how containers saved the day The case of OpenMinTeD
Developers Team @ GRNET S.A. Who are we?
GRNET: Greek Research and Technology Network, www.grnet.gr
GRNET developers involved: • Stavros Sachtouris, [email protected] • Theodoros Sotiropoulos, [email protected] • Kostas Vogias, [email protected] • Nick Vathis, [email protected] About OpenMinTeD
• Open Mining Infrastructure for Text & Data • www.openminted.eu • Funding: European Commission • Partners: many European academic/research institutes and enterprises
• Goal: an open service to create, discover, share and re-use knowledge extracted from a wide range of scientific resources. OpenMinTeD GRNET
• TDM tools: Text & Data mining • ~okeanos: large IaaS cloud software • thousands of VMs + storage • Corpora: large datasets • Synnefo: IaaS software • Workflows: chains of TDM tools developed by GRNET • Users: international academia and more... ~okeanos: a Synnefo-powered IaaS
• okeanos.grnet.gr
• Serves the Greek and European research and academic community since 2011
• Over 9000 VMs!
• 3 Datacenters all over the country
• One of the largest non- commercial clouds in EU ~okeanos: a Synnefo-powered IaaS
Features: • Virtual Machines • OS Images (mostly Linux) • IPs (v4 and v6) • Virtual Private Networks • Movable volumes • Object storage • UI, API, CLI, libraries, Resizing, Quotas, Image creator, back-up tool, syncing client and more
Does NOT feature: • End user services (e.g. wordpress) • Management panels (e.g. CPanel) • Service-level Monitoring • Service-level Accounting Multi-document text summarization: a text mining example Input: lots of texts, pdfs, etc. on the same subject ==> convert everything to text ==> segment into sentences ==> lemmatize ==> part-of-speech tagging (grammar tags on words) ==> find terms (term grammar, statistical tools) ==> find sentences that are summary worthy ==> remove redundancy Output: summary of multiple documents Text and Data Mining (TDM) Tools
Output Input TDM tool (e.g., extracted (corpora) metadata)
Can be small or big Metadata Huge bodies of text Diverse runtimes Some standards exist (e.g. annotations, (OS, PL, libs, etc.) frequencies, etc.) Various developers Corpora Cannot be trusted TDM Workflows Should be able to • edit • save • load • re-run
Corp. Result Tool A Tool B Result A A B
Result Corp. Tool C Result Tool D B C D OpenMinTeD Requirements
Corpora Execution Environment Results
* *
Workflows Key Requirements
• Scalable: how many workflows can run simultaneously?
• Easy to migrate to other clouds
• Replaceable service components
• Custom task scheduler
• Isolated tool execution Key Requirements Key Decisions
• Scalable: how many workflows can • All TDM tools must be run simultaneously? dockerized
• Easy to migrate to other clouds • No user management or accounting at cluster level • Replaceable service components
• Custom task scheduler • Monitor the TDM tools (rather than the stack) • Isolated tool execution How to "chain" diverse tools
• All tools must be docker images • Run tools as docker containers • Containers mount volumes from a shared space
$ docker run tdm-tool-A \ -v /shared/input-for-A –v /shared/input-for-B $ docker run tdm-tool-B \ -v /shared/input-for-B –v /shared/input-for-C ... Key Requirements Key Decisions
• Scalable: how many workflows can • All TDM tools must be run simultaneously? dockerized
• Easy to migrate to other clouds • No user management or accounting at cluster level • Replaceable service components
• Custom task scheduler • Monitor the TDM tools (rather than the stack) • Isolated tool execution Registry + UI AAI
Corp
Workflow Service Accounting
REST API WF editor Cluster Tool Scheduler Monitor Manager Registry WF engine
Worker Worker Worker Worker scale
Shared FS scale Corp Out Workflows Accounting
REST API
Cluster Tool Scheduler Monitor Manager Registry WF engine
Worker Worker Worker Worker scale
Shared FS scale Key Requirements Key Decisions
• Scalable: how many workflows can • Each component runs on run simultaneously? (probably) its own VM
• Easy to migrate to other clouds • Open Source Components • Replaceable service components • Components with REST API • Custom task scheduler
• Isolated tool execution Components Workflows Accounting Worker:
Mesos slave Galaxy REST API Docker engine CAdvisor Mesos Docker Prometheus Chronos master Registry Grafana
Galaxy
Worker Worker Worker Worker scale
NFS with LVM scale Galaxy
Galaxy-chronos runner
Pithos+ Object Store Mesos Docker Prometheus Chronos master Registry Grafana
Galaxy
Worker Worker Worker Worker Galaxy-pithos driver
NFS with LVM Galaxy and NFS on the same host Key Requirements Key Decisions
• Scalable: how many workflows can • Modular architecture run simultaneously?
• Easy to migrate to other clouds • Choose something simple for now, we will refine it later • Replaceable service components
• Custom task scheduler
• Isolated tool execution Mesos Cluster • Modular • Short-lived tasks • Can use many different • Queue of tasks schedulers • There are issues... simultaneously • Scalability, high availability Mesos Chronos • Supports docker master
Mesos Mesos Mesos Mesos slave slave slave slave
NFS with LVM Mesos Cluster can expand
Need more processing power? --> add Workers
Mesos Chronos master Need more storage? --> add volumes
Mesos Mesos Mesos Mesos scale slave slave slave slave
NFS with LVM scale Why not Docker Swarm?
No custom scheduler Scalability High availability Easy to manage Docker Swarm
Swarm Swarm Swarm Swarm scale Node Node Node Node
NFS with LVM scale Technologies we considered
• Mesos/Marathon: good for persistent services, not for one-off tasks • Docker Swarm: meets all requirements, except custom scheduler • Kubernetes: meets all requirements, too complicated / an overkill • Hadoop-YARN/Slider: nice for custom scheduler, optimal for some TDM operations, optimizations not utilized with docker containers, cannot cooperate with many tools • Mesos/Chronos: meets all requirements, Chronos is a pain... • Mesos/Custom scheduler: future work... User's Storage local disk
Pithos+ Mesos Docker Prometheus Object Chronos Store master Registry Grafana
Galaxy
Worker Worker Worker Worker scale
NFS with LVM scale Storage Performance
Q: Isn't NFS slow? A: Maybe, but it currently performs faster than other parts of the system!
Q: What is the bottleneck, then? A: Upload/download from cloud and/or user's disk.
Q: How can you optimize? A: Separate upload/download from execution, upload/download directly to shared FS. Any other ideas? Too many workflows to run?
• Galaxy enforces resource usage limits
• Galaxy and Chronos have job queues
• Monitoring and alerts
• Scale as soon as you get alerted! AAI Monitoring, Alerts
Workflows Accounting
Resource consumption Aggregate resource consumption on cluster level Mesos Docker Prometheus Chronos master Registry Grafana
Galaxy Email alerts for overused resources Worker Worker Worker:
Mesos slave Gather container resource consumption NFS Docker engine CAdvisor on VM level Are these alerts enough?
Workflow-related alerts: • Cluster resources used simultaneously (CPU, RAM, Disk space) • Only interested on docker containers consumption
System-wide alerts: • NFS is full • Host disks are full • Services break down AAI
Accounting User ID
Task ID * Workflows Accounting Relate: • Task ID • Task ID * Preserve task ID User ID on container metadata • Resource Consumption
* Mesos Docker Prometheus Chronos master Registry Grafana
Galaxy * Worker Worker CAdvisor gathers Worker: container resource consumption on VM Mesos slave level NFS Docker engine CAdvisor Scalability
Need more containers? --> add Workers
Mesos Chronos master Storage bandwidth ???
Mesos Mesos Mesos Mesos scale slave slave slave slave
Need storage? scale NFS with LVM --> add volumes Scalability
NFS is limited by the network bandwidth. Solution: • Tune NFS caching • Create a data reallocation service (defeats the purpose of mesos) • Make each tool request its data from a service (against design goals) • Use a more scalable object store (e.g. HDFS, Pithos+, S3) Docker images and containers clean up Workflows
Cluster Tool Scheduler Monitor Manager Registry
WF engine Worker Worker Worker Worker
Shared FS Docker images and containers clean up
• What if a host is stuffed with docker images and stopped containers?
Solutions: • Manual deletion of images on alert... • Clean up everything periodically • Clean up when execution finishes (bad performance) • NFS storage instead of docker registry (does not seem possible) • Sophisticated garbage collection? • Combination of the above? Something else? Docker images and containers clean up
• What if a new image has the same tag as the old one?
Solutions: • Manual deletion of images on update • Clean up when execution finishes (bad performance) • Delete image from all nodes when updating docker registry Sophisticated Garbage Collection
Why is it so hard? Because it has to take into account: • Size of each image • Usage patterns of each image (Least Recently Used) • Needs tight integration with docker registry to monitor image updates (abstract away docker registry with a RESTful service) • Statistics ideally should be cluster-wide and not node specific Setup and operations
• Provision VMs with a kamaki (the de facto Synnefo API client) script • Ansible + manual ssh (but one day it's gonna be just ansible...)
In the future • An extra Mesos/Marathon to keep components alive • Nagios alerts • Automatic scaling with triggers and bash/kamaki scripting • Support provisioning for all major clouds (OpenStack, Azure, etc.) Security
• Application layer • Reverse Proxy • Web Application Firewall • Security updates • Network layer • Minimal exposure • Firewall rules • Physical layer • On-premises installation Network Layer
Three types of accesses: • Component to Component access • Operations Support access • External API access
Component to Component access can be restricted to internal network routing, as provided by the IaaS platform (e.g. Synnefo private networks) Network Layer
Operations Support needs access to: • Chronos UI • Mesos master UI • Grafana UI • SSH on all machines
Access to these services can be restricted with a VPN or firewall rules to allow only specific subnets. Network Layer
External API callers need access to: • Galaxy • Docker Registry • Prometheus
Access to these services can be restricted with firewall rules to allow only specific IP addresses. Network Layer Workflows Accounting
Galaxy REST API Docker registry API? Prometheus API
Mesos Docker Prometheus Chronos master Registry Grafana
Galaxy
Worker Worker Worker Worker
NFS with LVM Application Layer
A case against web application firewalls: • External APIs • Tentative APIs • Prometheus API has a query language that must be parsed
Also: • Operations must be in full control of WAF • All tools are open source – there is no extra overhead if the WAF is configured by an independent party A Special Case of Security
Someone might run arbitrary code on our platform:
• Bitcoin/Ethereum mining • TOR exit node • TCP proxy • DOS attack • Attack against infrastructure
Do we need to mitigate these? Probably not (except for the last one). Each user is AAI authenticated and each resource is accounted externally. Thank you Questions?