<<

Distributed workflow execution: how containers saved the day The case of OpenMinTeD

Developers Team @ GRNET S.A. Who are we?

GRNET: Greek Research and Technology Network, www.grnet.gr

GRNET developers involved: • Stavros Sachtouris, [email protected] • Theodoros Sotiropoulos, [email protected] • Kostas Vogias, [email protected] • Nick Vathis, [email protected] About OpenMinTeD

• Open Mining Infrastructure for Text & Data • www.openminted.eu • Funding: European Commission • Partners: many European academic/research institutes and enterprises

• Goal: an open service to create, discover, share and re-use knowledge extracted from a wide range of scientific resources. OpenMinTeD GRNET

• TDM tools: Text & Data mining • ~okeanos: large IaaS cloud software • thousands of VMs + storage • Corpora: large datasets • Synnefo: IaaS software • Workflows: chains of TDM tools developed by GRNET • Users: international academia and more... ~okeanos: a Synnefo-powered IaaS

• okeanos.grnet.gr

• Serves the Greek and European research and academic community since 2011

• Over 9000 VMs!

• 3 Datacenters all over the country

• One of the largest non- commercial clouds in EU ~okeanos: a Synnefo-powered IaaS

Features: • Virtual Machines • OS Images (mostly ) • IPs (v4 and v6) • Virtual Private Networks • Movable volumes • Object storage • UI, API, CLI, libraries, Resizing, Quotas, Image creator, back-up tool, syncing client and more

Does NOT feature: • End user services (e.g. wordpress) • Management panels (e.g. CPanel) • Service-level Monitoring • Service-level Accounting Multi-document text summarization: a text mining example Input: lots of texts, pdfs, etc. on the same subject ==> convert everything to text ==> segment into sentences ==> lemmatize ==> part-of-speech tagging (grammar tags on words) ==> find terms (term grammar, statistical tools) ==> find sentences that are summary worthy ==> remove redundancy Output: summary of multiple documents Text and Data Mining (TDM) Tools

Output Input TDM tool (e.g., extracted (corpora) metadata)

Can be small or big Metadata Huge bodies of text Diverse runtimes Some standards exist (e.g. annotations, (OS, PL, libs, etc.) frequencies, etc.) Various developers Corpora Cannot be trusted TDM Workflows Should be able to • edit • save • load • re-run

Corp. Result Tool A Tool B Result A A B

Result Corp. Tool C Result Tool D B C D OpenMinTeD Requirements

Corpora Execution Environment Results

* *

Workflows Key Requirements

• Scalable: how many workflows can run simultaneously?

• Easy to migrate to other clouds

• Replaceable service components

• Custom task scheduler

• Isolated tool execution Key Requirements Key Decisions

• Scalable: how many workflows can • All TDM tools must be run simultaneously? dockerized

• Easy to migrate to other clouds • No user management or accounting at cluster level • Replaceable service components

• Custom task scheduler • Monitor the TDM tools (rather than the stack) • Isolated tool execution How to "chain" diverse tools

• All tools must be images • Run tools as docker containers • Containers mount volumes from a shared space

$ docker run tdm-tool-A \ -v /shared/input-for-A –v /shared/input-for-B $ docker run tdm-tool-B \ -v /shared/input-for-B –v /shared/input-for-C ... Key Requirements Key Decisions

• Scalable: how many workflows can • All TDM tools must be run simultaneously? dockerized

• Easy to migrate to other clouds • No user management or accounting at cluster level • Replaceable service components

• Custom task scheduler • Monitor the TDM tools (rather than the stack) • Isolated tool execution Registry + UI AAI

Corp

Workflow Service Accounting

REST API WF editor Cluster Tool Scheduler Monitor Manager Registry WF engine

Worker Worker Worker Worker scale

Shared FS scale Corp Out Workflows Accounting

REST API

Cluster Tool Scheduler Monitor Manager Registry WF engine

Worker Worker Worker Worker scale

Shared FS scale Key Requirements Key Decisions

• Scalable: how many workflows can • Each component runs on run simultaneously? (probably) its own VM

• Easy to migrate to other clouds • Open Source Components • Replaceable service components • Components with REST API • Custom task scheduler

• Isolated tool execution Components Workflows Accounting Worker:

Mesos slave Galaxy REST API Docker engine CAdvisor Mesos Docker Chronos master Registry Grafana

Galaxy

Worker Worker Worker Worker scale

NFS with LVM scale Galaxy

Galaxy-chronos runner

Pithos+ Object Store Mesos Docker Prometheus Chronos master Registry Grafana

Galaxy

Worker Worker Worker Worker Galaxy-pithos driver

NFS with LVM Galaxy and NFS on the same host Key Requirements Key Decisions

• Scalable: how many workflows can • Modular architecture run simultaneously?

• Easy to migrate to other clouds • Choose something simple for now, we will refine it later • Replaceable service components

• Custom task scheduler

• Isolated tool execution Mesos Cluster • Modular • Short-lived tasks • Can use many different • Queue of tasks schedulers • There are issues... simultaneously • Scalability, high availability Mesos Chronos • Supports docker master

Mesos Mesos Mesos Mesos slave slave slave slave

NFS with LVM Mesos Cluster can expand

Need more processing power? --> add Workers

Mesos Chronos master Need more storage? --> add volumes

Mesos Mesos Mesos Mesos scale slave slave slave slave

NFS with LVM scale Why not Docker Swarm?

No custom scheduler Scalability High availability Easy to manage Docker Swarm

Swarm Swarm Swarm Swarm scale Node Node Node Node

NFS with LVM scale Technologies we considered

• Mesos/Marathon: good for persistent services, not for one-off tasks • Docker Swarm: meets all requirements, except custom scheduler • Kubernetes: meets all requirements, too complicated / an overkill • Hadoop-YARN/Slider: nice for custom scheduler, optimal for some TDM operations, optimizations not utilized with docker containers, cannot cooperate with many tools • Mesos/Chronos: meets all requirements, Chronos is a pain... • Mesos/Custom scheduler: future work... User's Storage local disk

Pithos+ Mesos Docker Prometheus Object Chronos Store master Registry Grafana

Galaxy

Worker Worker Worker Worker scale

NFS with LVM scale Storage Performance

Q: Isn't NFS slow? A: Maybe, but it currently performs faster than other parts of the system!

Q: What is the bottleneck, then? A: Upload/download from cloud and/or user's disk.

Q: How can you optimize? A: Separate upload/download from execution, upload/download directly to shared FS. Any other ideas? Too many workflows to run?

• Galaxy enforces resource usage limits

• Galaxy and Chronos have job queues

• Monitoring and alerts

• Scale as soon as you get alerted! AAI Monitoring, Alerts

Workflows Accounting

Resource consumption Aggregate resource consumption on cluster level Mesos Docker Prometheus Chronos master Registry Grafana

Galaxy Email alerts for overused resources Worker Worker Worker:

Mesos slave Gather container resource consumption NFS Docker engine CAdvisor on VM level Are these alerts enough?

Workflow-related alerts: • Cluster resources used simultaneously (CPU, RAM, Disk space) • Only interested on docker containers consumption

System-wide alerts: • NFS is full • Host disks are full • Services break down AAI

Accounting User ID

Task ID * Workflows Accounting Relate: • Task ID • Task ID * Preserve task ID User ID on container metadata • Resource Consumption

* Mesos Docker Prometheus Chronos master Registry Grafana

Galaxy * Worker Worker CAdvisor gathers Worker: container resource consumption on VM Mesos slave level NFS Docker engine CAdvisor Scalability

Need more containers? --> add Workers

Mesos Chronos master Storage bandwidth ???

Mesos Mesos Mesos Mesos scale slave slave slave slave

Need storage? scale NFS with LVM --> add volumes Scalability

NFS is limited by the network bandwidth. Solution: • Tune NFS caching • Create a data reallocation service (defeats the purpose of mesos) • Make each tool request its data from a service (against design goals) • Use a more scalable object store (e.g. HDFS, Pithos+, S3) Docker images and containers clean up Workflows

Cluster Tool Scheduler Monitor Manager Registry

WF engine Worker Worker Worker Worker

Shared FS Docker images and containers clean up

• What if a host is stuffed with docker images and stopped containers?

Solutions: • Manual deletion of images on alert... • Clean up everything periodically • Clean up when execution finishes (bad performance) • NFS storage instead of docker registry (does not seem possible) • Sophisticated garbage collection? • Combination of the above? Something else? Docker images and containers clean up

• What if a new image has the same tag as the old one?

Solutions: • Manual deletion of images on update • Clean up when execution finishes (bad performance) • Delete image from all nodes when updating docker registry Sophisticated Garbage Collection

Why is it so hard? Because it has to take into account: • Size of each image • Usage patterns of each image (Least Recently Used) • Needs tight integration with docker registry to monitor image updates (abstract away docker registry with a RESTful service) • Statistics ideally should be cluster-wide and not node specific Setup and operations

• Provision VMs with a kamaki (the de facto Synnefo API client) script • Ansible + manual ssh (but one day it's gonna be just ansible...)

In the future • An extra Mesos/Marathon to keep components alive • Nagios alerts • Automatic scaling with triggers and bash/kamaki scripting • Support provisioning for all major clouds (OpenStack, Azure, etc.) Security

• Application layer • Reverse Proxy • Web Application Firewall • Security updates • Network layer • Minimal exposure • Firewall rules • Physical layer • On-premises installation Network Layer

Three types of accesses: • Component to Component access • Operations Support access • External API access

Component to Component access can be restricted to internal network routing, as provided by the IaaS platform (e.g. Synnefo private networks) Network Layer

Operations Support needs access to: • Chronos UI • Mesos master UI • Grafana UI • SSH on all machines

Access to these services can be restricted with a VPN or firewall rules to allow only specific subnets. Network Layer

External API callers need access to: • Galaxy • Docker Registry • Prometheus

Access to these services can be restricted with firewall rules to allow only specific IP addresses. Network Layer Workflows Accounting

Galaxy REST API Docker registry API? Prometheus API

Mesos Docker Prometheus Chronos master Registry Grafana

Galaxy

Worker Worker Worker Worker

NFS with LVM Application Layer

A case against web application firewalls: • External • Tentative APIs • Prometheus API has a query language that must be parsed

Also: • Operations must be in full control of WAF • All tools are open source – there is no extra overhead if the WAF is configured by an independent party A Special Case of Security

Someone might run arbitrary code on our platform:

• Bitcoin/Ethereum mining • TOR exit node • TCP proxy • DOS attack • Attack against infrastructure

Do we need to mitigate these? Probably not (except for the last one). Each user is AAI authenticated and each resource is accounted externally. Thank you Questions?