Azure Databricks Operator github.com/microsoft/azure-databricks-operator
Azadeh Khojandi Senior Software Engineer @AzadehKhojandi
Jordan Knight Principal Software Engineer @jakkaj Agenda
Why, What, How
https://aka.ms/azure-databricks-operator
@AzadehKhojandi @jakkaj
Why? Highly-cohesive, loosely-coupled, highly- directional, complex-configuration, multi- component, multi-technology, multi-platform, high-scale, high-availability, low-latency, big- data system.
https://aka.ms/azure-databricks-operator A.K.A -> a stream processing pipeline
https://aka.ms/azure-databricks-operator Pipelines are complex
• Streams • Transforms • Storage • Security • Dev, test, prod Environments • External dependencies (DataBricks, PaaS services)
https://aka.ms/azure-databricks-operator Pipelines are complex
• Highly cohesive system, loosely coupled components • Relationship is king • Configuration can get messy, is based on relationships • Especially when it doesn’t exist at design time • … like external PaaS services etc.
https://aka.ms/azure-databricks-operator Pipelines are complex
• Build a re-usable catalog of components • Create many pipelines to suit particular use cases • … easily (maybe even using a UI?!)
https://aka.ms/azure-databricks-operator Problem: Building it
We had to build a pipeline system that: • Has unknown component config and relationships at build time • Is highly flexible and re-usable • Has 1-n custom transforms • Forks in data
https://aka.ms/azure-databricks-operator Idea: Central Configuration
• Single point of configuration for a pipeline • All components, relationships, parameters defined in the same place • Sounds like a DAG!
https://aka.ms/azure-databricks-operator DAG as config
• Start with a DAG view of the system • Can visualize the DAG, get it right • DevOps takes the DAG and “makes it real” • DAG is essentially converted to Helm Charts (using a special tool we’ve built)
https://aka.ms/azure-databricks-operator Kubernetes to the rescue
• Create desired state configuration platform • Flexible extension options (CRDs and operators!)
https://aka.ms/azure-databricks-operator Custom components
• … like DataBricks jobs, EventHubs etc • Operators to the rescue
https://aka.ms/azure-databricks-operator Operators
• Allow us to create a re-usable module simply • Represents as a first class object in the system • Easy to represent in the DAG • Easy to represent in Helm • Easy to deploy, update and remove • … are becoming well known, skills building in industry
https://aka.ms/azure-databricks-operator What? https://docs.microsoft.com/en-us/azure/azure-databricks
• Large-scale data processing • Speed • Scalable to petabytes of data and clusters of thousands of nodes • Connectors to S3, Azure Blob Storage, Redshift, Kafka • Security
• Unified platform and Collaboration https://aka.ms/azure-databricks-operator
Master Node 1 API Server Databricks Operator Databricks Job Node … Cluster: … Notebook Path:… Params:…
Node n
YAML File
https://github.com/Azure-Samples/twitter-databricks-analyzer-cicd
How? & Lessons learned Kubebuilder https://github.com/kubernetes-sigs/kubebuilder Adding custom resources
A resource is an endpoint in the Kubernetes API that stores a collection of API objects of a certain kind
Webhook admission controllers • Mutating (set values, e.g. defaults) • Validating (synchronous input validation) Schema CRDs • OpenAPI v3 schema definition • Declarative validation (simple)
https://kubernetes.io/docs/concepts/extend-kubernetes/api-extension/custom-resources/ Declarative API
A declarative API allows you to declare or specify the desired state of your resource and tries to keep the current state of Kubernetes objects in sync with the desired state
Custom resource Custom controller
https://kubernetes.io/docs/concepts/extend-kubernetes/api-extension/custom-resources/ CRD Being used for third-party extensions • Add-ons • Stateful orchestration (e.g. MySQL) • Domain-specific APIs (e.g. Istio) • Higher levels of abstraction (e.g. kNative) More recently in Kubernetes • Storage Snapshots (integrates w/ PersistentVolumes) • RuntimeClass (integrates w/ Kubelet and CRI) • CSI https://speakerdeck.com/thockin/crds-arent-just-for-add-ons Vision Now: most/all new APIs are CRDs Eventually: Everything becomes a CRD (except things to run CRDs) There should be nothing that we can do that you can’t
• Built-in: Namespaces, CRDs, Admission, etc. • CRDs: Pods, Services, Nodes, Deployments, ... • Kubernetes is a set of operators
Tim Hockin @thockin Principal Software Engineer https://speakerdeck.com/thockin/crds-arent-just-for-add-ons KubeCon NA, Seattle, 12/2018 Kind https://kind.sigs.k8s.io/
go (1.11+) and docker GO111MODULE="on" go get sigs.k8s.io/[email protected] && kind create cluster
Dev Containers
remote
- https://aka.ms/vscode Thank You
Star our Github repo, raise your requests or send PR https://github.com/microsoft/azure-databricks-operator
@AzadehKhojandi @jakkaj