
Azure Databricks Operator github.com/microsoft/azure-databricks-operator Azadeh Khojandi Senior Software Engineer @AzadehKhojandi Jordan Knight Principal Software Engineer @jakkaj Agenda Why, What, How https://aka.ms/azure-databricks-operator @AzadehKhojandi @jakkaj Why? Highly-cohesive, loosely-coupled, highly- directional, complex-configuration, multi- component, multi-technology, multi-platform, high-scale, high-availability, low-latency, big- data system. https://aka.ms/azure-databricks-operator A.K.A -> a stream processing pipeline https://aka.ms/azure-databricks-operator Pipelines are complex • Streams • Transforms • Storage • Security • Dev, test, prod Environments • External dependencies (DataBricks, PaaS services) https://aka.ms/azure-databricks-operator Pipelines are complex • Highly cohesive system, loosely coupled components • Relationship is king • Configuration can get messy, is based on relationships • Especially when it doesn’t exist at design time • … like external PaaS services etc. https://aka.ms/azure-databricks-operator Pipelines are complex • Build a re-usable catalog of components • Create many pipelines to suit particular use cases • … easily (maybe even using a UI?!) https://aka.ms/azure-databricks-operator Problem: Building it We had to build a pipeline system that: • Has unknown component config and relationships at build time • Is highly flexible and re-usable • Has 1-n custom transforms • Forks in data https://aka.ms/azure-databricks-operator Idea: Central Configuration • Single point of configuration for a pipeline • All components, relationships, parameters defined in the same place • Sounds like a DAG! https://aka.ms/azure-databricks-operator DAG as config • Start with a DAG view of the system • Can visualize the DAG, get it right • DevOps takes the DAG and “makes it real” • DAG is essentially converted to Helm Charts (using a special tool we’ve built) https://aka.ms/azure-databricks-operator Kubernetes to the rescue • Create desired state configuration platform • Flexible extension options (CRDs and operators!) https://aka.ms/azure-databricks-operator Custom components • … like DataBricks jobs, EventHubs etc • Operators to the rescue https://aka.ms/azure-databricks-operator Operators • Allow us to create a re-usable module simply • Represents as a first class object in the system • Easy to represent in the DAG • Easy to represent in Helm • Easy to deploy, update and remove • … are becoming well known, skills building in industry https://aka.ms/azure-databricks-operator What? https://docs.microsoft.com/en-us/azure/azure-databricks • Large-scale data processing • Speed • Scalable to petabytes of data and clusters of thousands of nodes • Connectors to S3, Azure Blob Storage, Redshift, Kafka • Security • Unified platform and Collaboration https://aka.ms/azure-databricks-operator Master Node 1 API Server Databricks Operator Databricks Job Node … Cluster: … Notebook Path:… Params:… Node n YAML File https://github.com/Azure-Samples/twitter-databricks-analyzer-cicd How? & Lessons learned Kubebuilder https://github.com/kubernetes-sigs/kubebuilder Adding custom resources A resource is an endpoint in the Kubernetes API that stores a collection of API objects of a certain kind Webhook admission controllers • Mutating (set values, e.g. defaults) • Validating (synchronous input validation) Schema CRDs • OpenAPI v3 schema definition • Declarative validation (simple) https://kubernetes.io/docs/concepts/extend-kubernetes/api-extension/custom-resources/ Declarative API A declarative API allows you to declare or specify the desired state of your resource and tries to keep the current state of Kubernetes objects in sync with the desired state Custom resource Custom controller https://kubernetes.io/docs/concepts/extend-kubernetes/api-extension/custom-resources/ CRD Being used for third-party extensions • Add-ons • Stateful orchestration (e.g. MySQL) • Domain-specific APIs (e.g. Istio) • Higher levels of abstraction (e.g. kNative) More recently in Kubernetes • Storage Snapshots (integrates w/ PersistentVolumes) • RuntimeClass (integrates w/ Kubelet and CRI) • CSI https://speakerdeck.com/thockin/crds-arent-just-for-add-ons Vision Now: most/all new APIs are CRDs Eventually: Everything becomes a CRD (except things to run CRDs) There should be nothing that we can do that you can’t • Built-in: Namespaces, CRDs, Admission, etc. • CRDs: Pods, Services, Nodes, Deployments, ... • Kubernetes is a set of operators Tim Hockin @thockin Principal Software Engineer https://speakerdeck.com/thockin/crds-arent-just-for-add-ons KubeCon NA, Seattle, 12/2018 Kind https://kind.sigs.k8s.io/ go (1.11+) and docker GO111MODULE="on" go get sigs.k8s.io/[email protected] && kind create cluster Dev Containers Dev https://aka.ms/vscode-remote Thank You Star our Github repo, raise your requests or send PR https://github.com/microsoft/azure-databricks-operator @AzadehKhojandi @jakkaj.
Details
-
File Typepdf
-
Upload Time-
-
Content LanguagesEnglish
-
Upload UserAnonymous/Not logged-in
-
File Pages45 Page
-
File Size-