Apache Airflow overview Viktor Kotliar

Data Knowledge Catalog Meeting (TPU/NRC KI) Thursday 23 Jul 2020, 11:00 → 12:00 Europe/Moscow introduction

● Apache Airflow is an open-source workflow management platform.

● It started at Airbnb in October 2014 as a solution to manage the company's increasingly complex workflows.

● From the beginning, the project was made open source, becoming an project in March 2016 and a Top-Level Apache Software Foundation project in January 2019.

● written in Python, and workflows are created via Python scripts

● designed under the principle of "configuration as code" https://github.com/apache/airflow https://airflow.apache.org/

2 Principles

● Airflow is a platform to programmatically author, schedule and monitor workflows.

● Use Airflow to author workflows as Directed Acyclic Graphs (DAGs) of tasks.

● The Airflow scheduler executes your tasks on an array of workers while following the specified dependencies.

● Rich command line utilities make performing complex surgeries on DAGs a snap.

● The rich makes it easy to visualize pipelines running in production, monitor progress, and troubleshoot issues when needed.

● When workflows are defined as code, they become more maintainable, versionable, testable, and collaborative. 3 Principles 2

● Airflow is not a data streaming solution.

● Tasks do not move data from one to the other (though tasks can exchange metadata!)

● Workflows are expected to be mostly static or slowly changing. You can think of the structure of the tasks in your workflow as slightly more dynamic than a database structure would be.

● Airflow workflows are expected to look similar from a run to the next, this allows for clarity around unit of work and continuity.

4 Concepts DAG

● In Airflow, a DAG – or a Directed Acyclic Graph – is a collection of all the tasks you want to run, organized in a way that reflects their relationships and dependencies.

● A DAG is defined in a Python script, which represents the DAGs structure (tasks and their dependencies) as code.

5 Concepts Task

● A Task defines a unit of work within a DAG; it is represented as a node in the DAG graph, and it is written in Python.

● Each task is an implementation of an Operator, for example a PythonOperator to execute some Python code, or a BashOperator to run a Bash command.

● The task implements an operator by defining specific values for that operator, such as a Python callable in the case of PythonOperator or a Bash command in the case of BashOperator.

with DAG('my_dag', start_date=datetime(2016, 1, 1)) as dag: task_1 = DummyOperator('task_1') task_2 = DummyOperator('task_2') task_1 >> task_2 # Define dependencies

6 Concepts Task

7 Concepts Operator

● While DAGs describe how to run a workflow, Operators determine what actually gets done by a task.

● An operator describes a single task in a workflow. Operators are usually (but not always) atomic, meaning they can stand on their own and don’t need to share resources with any other operators. The DAG will make sure that operators run in the correct order; other than those dependencies, operators generally run independently. In fact, they may run on two completely different machines.

● if two operators need to share information, like a filename or small amount of data, you should consider combining them into a single operator. If it absolutely can’t be avoided, Airflow does have a feature for operator cross-communication called Xcom

● Airflow provides operators for many common tasks, including: – BashOperator - executes a bash command – PythonOperator - calls an arbitrary Python function – EmailOperator - sends an email – SimpleHttpOperator - sends an HTTP request – MySqlOperator, SqliteOperator, PostgresOperator, MsSqlOperator, OracleOperator, JdbcOperator, etc. - executes a SQL command – Sensor - an Operator that waits (polls) for a certain time, file, database row, S3 key, etc…

8 Concepts executor

● Executors are the mechanism by which task instances get run.

● Airflow has support for various executors. Current used is determined by the executor option in the core section of the configuration file.

● Sequential Executor ( This executor will only run one task instance at a time. )

● Debug Executor (The DebugExecutor is meant as a debug tool and can be used from IDE)

● Local Executor (LocalExecutor runs tasks by spawning processes in a controlled fashion in different modes)

● Dask Executor (allows you to run Airflow tasks in a Dask Distributed cluster.)

● Celery Executor ( is one of the ways you can scale out the number of workers. For this to work, you need to setup a Celery backend (RabbitMQ, Redis, …))

Executor (will create a new pod for every task instance.)

● Scaling Out with Mesos (community contributed) Running airflow tasks directly on mesos slaves, requiring each mesos slave to have airflow installed and configured. Running airflow tasks inside a docker container that has airflow installed, which is run on a mesos slave.

9 Architecture

● Workers - Execute the assigned tasks

● Scheduler - Responsible for adding the necessary tasks to the queue

● Web server - HTTP Server provides access to DAG/task status information

● Database - Contains information about the status of tasks, DAGs, Variables, connections, etc.

● Celery - Queue mechanism Please note that the queue at Celery consists of two components:

● Broker - Stores commands for execution

● Result backend - Stores status of completed commands

10 Pros and cons

небольшой, но полноценный инструментарий создания процессов обработки данных и управления ими — 3 вида операторов (сенсоры, обработчики и трансферы), расписание запусков для каждой цепочки задач, логгирование сбоев графический веб-интерфейс для создания конвейеров данных (data pipeline), который обеспечивает относительно низкий порог входа в технологию. Пользователь может наглядно отслеживать жизненный цикл данных в цепочках связанных задач, представленных в виде направленного ациклического графа Расширяемый REST API, который позволяет относительно легко интегрировать AirFlow в существующий ИТ-ландшафт корпоративной инфраструктуры и гибко настраивать конвейеры данных, например, передавать POST-параметры в DAG Программный код на Python Интеграция со множеством источников и сервисов Наличие собственного репозитория метаданных Масштабируемость за счет модульной архитектуры и очереди сообщений для неограниченного числа DAG’ов

11 Pros and cons

наличие неявных зависимостей при установке большие накладные расходы (временная задержка 5–10 секунд) на постановку DAG’ов в очередь и приоритезицию задач при запуске пост-фактум оповещения о сбоях в конвейере данных, в частности, в интерфейсе Airflow логи появятся только после того, как задание, к примеру, Spark-job, отработано. Поэтому следить в режиме онлайн, как выполняется data pipeline, приходится из других мест, например, веб-интерфейса YARN. Именно такое решение по работе с и Airflow было принято в онлайн-кинотеатре IVI

12