<<

Azure Databricks Operator .com/microsoft/azure-databricks-operator

Azadeh Khojandi Senior Software Engineer @AzadehKhojandi

Jordan Knight Principal Software Engineer @jakkaj Agenda

Why, What, How

https://aka.ms/azure-databricks-operator

@AzadehKhojandi @jakkaj

Why? Highly-cohesive, loosely-coupled, highly- directional, complex-configuration, multi- component, multi-technology, multi-platform, high-scale, high-availability, low-latency, big- data system.

https://aka.ms/azure-databricks-operator A.K.A -> a stream processing pipeline

https://aka.ms/azure-databricks-operator Pipelines are complex

• Streams • Transforms • Storage • Security • Dev, test, prod Environments • External dependencies (DataBricks, PaaS services)

https://aka.ms/azure-databricks-operator Pipelines are complex

• Highly cohesive system, loosely coupled components • Relationship is king • Configuration can get messy, is based on relationships • Especially when it doesn’t exist at design time • … like external PaaS services etc.

https://aka.ms/azure-databricks-operator Pipelines are complex

• Build a re-usable catalog of components • Create many pipelines to suit particular use cases • … easily (maybe even using a UI?!)

https://aka.ms/azure-databricks-operator Problem: Building it

We had to build a pipeline system that: • Has unknown component config and relationships at build time • Is highly flexible and re-usable • Has 1-n custom transforms • Forks in data

https://aka.ms/azure-databricks-operator Idea: Central Configuration

• Single point of configuration for a pipeline • All components, relationships, parameters defined in the same place • Sounds like a DAG!

https://aka.ms/azure-databricks-operator DAG as config

• Start with a DAG view of the system • Can visualize the DAG, get it right • DevOps takes the DAG and “makes it real” • DAG is essentially converted to Helm Charts (using a special tool we’ve built)

https://aka.ms/azure-databricks-operator to the rescue

• Create desired state configuration platform • Flexible extension options (CRDs and operators!)

https://aka.ms/azure-databricks-operator Custom components

• … like DataBricks jobs, EventHubs etc • Operators to the rescue

https://aka.ms/azure-databricks-operator Operators

• Allow us to create a re-usable module simply • Represents as a first class object in the system • Easy to represent in the DAG • Easy to represent in Helm • Easy to deploy, update and remove • … are becoming well known, skills building in industry

https://aka.ms/azure-databricks-operator What? https://docs.microsoft.com/en-us/azure/azure-databricks

• Large-scale data processing • Speed • Scalable to petabytes of data and clusters of thousands of nodes • Connectors to S3, Azure Blob Storage, , Kafka • Security

• Unified platform and Collaboration https://aka.ms/azure-databricks-operator

Master Node 1 API Server Databricks Operator Databricks Job Node … Cluster: … Notebook Path:… Params:…

Node n

YAML File

https://github.com/Azure-Samples/twitter-databricks-analyzer-cicd

How? & Lessons learned Kubebuilder https://github.com/kubernetes-sigs/kubebuilder Adding custom resources

A resource is an endpoint in the Kubernetes API that stores a collection of API objects of a certain kind

Webhook admission controllers • Mutating (set values, e.g. defaults) • Validating (synchronous input validation) Schema CRDs • OpenAPI v3 schema definition • Declarative validation (simple)

https://kubernetes.io/docs/concepts/extend-kubernetes/-extension/custom-resources/ Declarative API

A declarative API allows you to declare or specify the desired state of your resource and tries to keep the current state of Kubernetes objects in sync with the desired state

Custom resource Custom controller

https://kubernetes.io/docs/concepts/extend-kubernetes/api-extension/custom-resources/ CRD Being used for third-party extensions • Add-ons • Stateful orchestration (e.g. MySQL) • Domain-specific (e.g. Istio) • Higher levels of abstraction (e.g. kNative) More recently in Kubernetes • Storage Snapshots (integrates w/ PersistentVolumes) • RuntimeClass (integrates w/ Kubelet and CRI) • CSI https://speakerdeck.com/thockin/crds-arent-just-for-add-ons Vision Now: most/all new APIs are CRDs Eventually: Everything becomes a CRD (except things to run CRDs) There should be nothing that we can do that you can’t

• Built-in: Namespaces, CRDs, Admission, etc. • CRDs: Pods, Services, Nodes, Deployments, ... • Kubernetes is a set of operators

Tim Hockin @thockin Principal Software Engineer https://speakerdeck.com/thockin/crds-arent-just-for-add-ons KubeCon NA, Seattle, 12/2018 Kind https://kind.sigs.k8s.io/

go (1.11+) and GO111MODULE="on" go get sigs.k8s.io/[email protected] && kind create cluster

Dev Containers

remote

- https://aka.ms/vscode Thank You

Star our Github repo, raise your requests or send PR https://github.com/microsoft/azure-databricks-operator

@AzadehKhojandi @jakkaj