Apache Airflow is an open source distributed workflow management platform built for data orchestration. Maxime Beauchemin first started his Airflow project on his Airbnb. After the project’s success, the Apache Software Foundation quickly adopted his Airflow project. Initially, he was hired as an incubator project in 2016 and later as a top-level project in 2019. Airflow allows users to write programmatically, schedule, and control the data pipelines they monitor. A key feature of Airflow is that it will enable users to create planned data pipelines using a flexible Python framework easily.
Introduction to DAG – Directed Acyclic Graph
DAG is the essential component of Apache Airflow. DAG stands for Directed Acyclic Graph. A Directed Acyclic Graph defined in a Python script representing the DAG structure (tasks and their dependencies) as code. This graph has nodes and edges; the edges should always be direct, so they shouldn’t have loops. In other words, a DAG is a data pipeline. A node in a DAG is a task such as downloading a file from S3, querying a MySQL database, or sending an email, and we’ll cover planning a DAG and other basics, covering these topics step-by-step later. Get your first workflow to develop. A simple operator monitored and scheduled by Airflow.
Defining DAG
A DAG is a collection of tasks organized to reflect relationships and dependencies. One advantage of this DAG model is that it provides a relatively simple technique for executing pipelines. Alice says that another benefit of task-based channels is that they are partitioned cleanly into discrete tasks instead of relying on a single monolithic script for all work.
Acyclic functions are essential because they are simple and prevent tasks from getting caught up in circular dependencies. Airflow takes advantage of the acyclic nature of DAGs to solve and execute these task graphs efficiently.
Airflow DAGs Best Practices
Follow the steps below to implement an Airflow DAG on your system.
- Writing Clean DAGs
- Designing Reproducible Tasks
- Handling Data Efficiently
- Managing the Resources
1. Writing Clean DAGs
It’s easy to get confused when creating an Airflow DAG. For example, DAG code can quickly become unnecessarily complex or difficult to understand, especially when the team members make their DAG vastly different programming styles.
- Use style conventions: Adopting a consistent and clean programming style and applying it consistently to all Apache Airflow DAGs is one of the first steps to creating clean and consistent DAGs. It’s one. When writing code, the easiest way to make it more transparent and understandable is to use commonly used styles.
- Centralized credential management: As the Airflow DAG interacts with various systems, many other credentials generate, such as databases, cloud storage, etc. Fortunately, getting connection data from the Airflow connection store makes it easy to persist credentials for custom code.
- Group tasks together using task groups: Complex DAG airflow can be challenging to understand due to the number of operations required. The new Airflow 2 workgroup feature helps manage these complex systems. Task groups effectively divide tasks into smaller groups to make the DAG structure more manageable and easier to understand.
2. Designing Reproducible Tasks
Aside from developing good DAG code, one of the most demanding parts of creating a successful DAG is making tasks reproducible. It means the user can rerun the job and get the same results, even if the study ran at a different time.
- Tasks must always be idempotent: Idempotence is one of the essential properties of a good airflow task. Idempotence ensures consistency and resilience in the face of failures. No matter how often you run an idempotent job, the result is always the same.
- Task results should be deterministic: Creating reproducible tasks and DAGs should be deterministic. A deterministic job should always return the same output given its input.
- Designing studies using a functional paradigm: It’s easier to create jobs using a functional programming paradigm. Practical programming is writing computer programs that treat computation primarily as an application of mathematical functions and avoid modifying data or changing states.
3. Handling Data Efficiently
Airflow DAGs that process large amounts of data should be carefully designed to be as efficient as possible.
- Limit the data processed: Limiting data processing to the minimum amount of data necessary to achieve the intended result is the most effective approach to data management. It includes thoroughly examining data sources and assessing whether they are required.
- Incremental Processing: The main idea behind incremental processing is to split the data into (time-based) ranges and process each DAG run independently. Users can take advantage of incremental processing by running filter/aggregate processes at the overall process stage and performing extensive analysis of the reduced output.
- Don’t store data on your local file system: When working with data in Airflow, you may want to write data to your local system. As a result, Airflow runs multiple tasks in parallel, and downstream tasks may need help accessing it. The easiest way around this problem is to use a shared memory that all her Airflow employees can access and perform tasks simultaneously.
4. Managing the Resources
When dealing with large volumes of data, it can overburden the Airflow Cluster. As a result, properly managing resources can aid in the reduction of this burden.
- Managing Concurrency using Pools: When performing many processes in parallel, numerous tasks may require access to the same resource. Airflow uses resource pools to regulate how many jobs can access a resource. Each collection has a set number of slots that offer access to the associated resource.
- Detect long-running tasks with SLAs and alerts: Airflow’s SLA (Service Level Agreement) mechanism allows users to track job performance. This mechanism enables a user to assign her SLA timeout to her DAG effectively, and if even one of the DAG tasks takes longer than her SLA timeout specified, he will tell Airflow the user will notify you.
Creating your first DAG
In Airflow, a DAG is a Python script containing a set of tasks and their dependencies. What each task does is determined by the task’s operator. For example, defining a job with a Python Operator means that the investigation consists of running Python code.
To create our first DAG, let’s first start by importing the necessary modules:
# We start by importing the DAG object
from airflow import DAG
# We need to import the operators we will use in our tasks From
Airflow.operators.bash_operator import BashOperator
# Then the days ago
function from Airflow.utils Import from date import days ago
Then you can define a dictionary containing the default arguments you want to pass to the DAG. Then apply these arguments to all operators in the DAG.
# Initialize the default arguments passed to the DAG
default_args = {
‘owner’: ‘airflow,’
‘start_date’: days_ago(5),
’email’: [‘airflow@my_first_dag.com’],
’email_on_failure’: False,
’email_on_retry’: False,
‘retry’: 1,
‘retry_delay’: time lag (minutes=5),
}
As you can see, Airflow provides several standard arguments that make DAG configuration even more accessible. For example, you can easily define the number of retries and the retry delay for DAG execution.
Then you can use the DAG itself. Like:
my_first_dag = DAG(
‘first_dag,’
default_args=default_args,
description=’First DAG’,
schedule_interval=timedelta(days=1),
)
The first parameter, first_dag, represents the ID of the DAG, and the scheduling interval represents the interval between two executions of the DAG.
The next step is to define the DAG’s tasks:
task_1 = BashOperator(
task_id=’ first_task ‘,
bash_command=’echo 1′,
dag=my_first_dag,
)
task_2 = BashOperator(
task_id=’second_task’,
bash_command=’echo 2′,
dag=my_first_dag,
)
Finally, we need to specify the dependencies. In this case task_2 must run after task_1:
task_1.set_downstream(task_2)
Running the DAG
DAGs should default in the ~/airflow/dags folder. After first testing various tasks using the ‘airflow test’ command to ensure everything configures correctly, you can run the DAG for a specific date range using the ‘airflow backfill’ command:
airflow backfill my_first_dag -s 2020-03-01 -e 2020-03-05
Finally, start the airflow scheduler with the airflow scheduler command, and Airflow will ensure that the DAG runs at the defined interval. Increase.
Conclusion
As we saw today, Apache Airflow is very simple for implementing essential ETL pipelines. Now that you’ve seen the most common Python operators, you know how to run arbitrary Python functions in DAG tasks.
I also know how to transfer data between tasks using XCOM. It is a concept you need to know in Airflow.
The article covered the basics of DAG and the properties required for running DAG and explained in-depth DAG scheduling. The upcoming articles will present details on Operators.
In this article, we have seen the features of Apache Airflow and its user interface components, and we have created a simple DAG. In the upcoming blogs, we will discuss some more concepts like variables and branching and will create a more complex workflow.
This article taught us that an Apache Airflow workflow is a DAG that clearly defines tasks and their dependencies. Similarly, we learned some best practices while creating Airflow DAGs. Many large organizations today rely on Airflow to coordinate many critical data processes. Integrating data from Airflow and other data sources into your cloud data warehouse or destination for further business analysis is essential.
A place for big ideas.
Reimagine organizational performance while delivering a delightful experience through optimized operations.