Amazon Managed Workflows for Apache Airflow (Amazon MWAA) is a managed orchestration service for Apache Airflow that makes it simple to arrange and function end-to-end information pipelines within the cloud.
Organizations use Amazon MWAA to reinforce their enterprise workflows. For instance, C2i Genomics makes use of Amazon MWAA of their information platform to orchestrate the validation of algorithms processing most cancers genomics information in billions of data. Twitch, a stay streaming platform, manages and orchestrates the coaching and deployment of its advice fashions for over 140 million lively customers. They use Amazon MWAA to scale, whereas considerably bettering safety and decreasing infrastructure administration overhead.
At present, we’re asserting the supply of Apache Airflow model 2.8.1 environments on Amazon MWAA. On this submit, we stroll you thru a number of the new options and capabilities of Airflow now accessible in Amazon MWAA, and how one can arrange or improve your Amazon MWAA setting to model 2.8.1.
Object storage
As information pipelines scale, engineers battle to handle storage throughout a number of programs with distinctive APIs, authentication strategies, and conventions for accessing information, requiring customized logic and storage-specific operators. Airflow now presents a unified object storage abstraction layer that handles these particulars, letting engineers concentrate on their information pipelines. Airflow object storage makes use of fsspec to allow constant information entry code throughout totally different object storage programs, thereby streamlining infrastructure complexity.
The next are a number of the characteristic’s key advantages:
- Moveable workflows – You possibly can swap storage providers with minimal adjustments in your Directed Acyclic Graphs (DAGs)
- Environment friendly information transfers – You possibly can stream information as a substitute of loading into reminiscence
- Decreased upkeep – You don’t want separate operators, making your pipelines simple to keep up
- Acquainted programming expertise – You should use Python modules, like shutil, for file operations
To make use of object storage with Amazon Easy Storage Service (Amazon S3), it’s essential to set up the bundle further s3fs with the Amazon supplier (apache-airflow-providers-amazon[s3fs]==x.x.x
).
Within the pattern code under, you may see the right way to transfer information straight from Google Cloud Storage to Amazon S3. As a result of Airflow’s object storage makes use of shutil.copyfileobj
, the objects’ information is learn in chunks from gcs_data_source
and streamed to amazon_s3_data_target
.
For extra data on Airflow object storage, confer with Object Storage.
XCom UI
XCom (cross-communications) permits for the passing of knowledge between duties, facilitating communication and coordination between them. Beforehand, builders needed to swap to a diffferent view to see XComs associated to a process. With Airflow 2.8, XCom key-values are rendered straight on a tab inside the Airflow Grid view, as proven within the following screenshot.
The brand new XCom tab gives the next advantages:
- Improved XCom visibility – A devoted tab within the UI gives a handy and user-friendly method to see all XComs related to a DAG or process.
- Improved debugging – With the ability to see XCom values straight within the UI is useful for debugging DAGs. You possibly can shortly see the output of upstream duties with no need to manually pull and examine them utilizing Python code.
Activity context logger
Managing process lifecycles is essential for the graceful operation of knowledge pipelines in Airflow. Nevertheless, sure challenges have endured, significantly in eventualities the place duties are unexpectedly stopped. This could happen as a result of varied causes, together with scheduler timeouts, zombie duties (duties that stay in a working state with out sending heartbeats), or cases the place the employee runs out of reminiscence.
Historically, such failures, significantly these triggered by core Airflow parts just like the scheduler or executor, weren’t recorded inside the process logs. This limitation required customers to troubleshoot exterior the Airflow UI, complicating the method of pinpointing and resolving points.
Airflow 2.8 launched a major enchancment that addresses this downside. Airflow parts, together with the scheduler and executor, can now use the brand new TaskContextLogger to ahead error messages on to the duty logs. This characteristic means that you can see all of the related error messages associated to a process’s run in a single place. This simplifies the method of determining why a process failed, providing an entire perspective of what went fallacious inside a single log view.
The next screenshot exhibits how the duty is detected as zombie
, and the scheduler log is being included as a part of the duty log.
You should set the setting configuration parameter enable_task_context_logger
to True
, to allow the characteristic. As soon as it’s enabled, Airflow can ship logs from the scheduler, the executor, or callback run context to the duty logs, and make them accessible within the Airflow UI.
Listener hooks for datasets
Datasets had been launched in Airflow 2.4 as a logical grouping of knowledge sources to create data-aware scheduling and dependencies between DAGs. For instance, you may schedule a client DAG to run when a producer DAG updates a dataset. Listeners allow Airflow customers to create subscriptions to sure occasions taking place within the setting. In Airflow 2.8, listeners are added for 2 datasets occasions: on_dataset_created and on_dataset_changed, successfully permitting Airflow customers to put in writing customized code to react to dataset administration operations. For instance, you may set off an exterior system, or ship a notification.
Utilizing listener hooks for datasets is simple. Full the next steps to create a listener for on_dataset_changed
:
- Create the listener (
dataset_listener.py
): - Create a plugin to register the listener in your Airflow setting (
dataset_listener_plugin.py
):
For extra data on the right way to set up plugins in Amazon MWAA, confer with Putting in customized plugins.
Arrange a brand new Airflow 2.8.1 setting in Amazon MWAA
You possibly can provoke the setup in your account and most well-liked Area utilizing the AWS Administration Console, API, or AWS Command Line Interface (AWS CLI). Should you’re adopting infrastructure as code (IaC), you may automate the setup utilizing AWS CloudFormation, the AWS Cloud Growth Package (AWS CDK), or Terraform scripts.
Upon profitable creation of an Airflow model 2.8.1 setting in Amazon MWAA, sure packages are mechanically put in on the scheduler and employee nodes. For an entire checklist of put in packages and their variations, confer with Apache Airflow supplier packages put in on Amazon MWAA environments. You possibly can set up extra packages utilizing a necessities file.
Improve from older variations of Airflow to model 2.8.1
You possibly can benefit from these newest capabilities by upgrading your older Airflow model 2.x-based environments to model 2.8.1 utilizing in-place model upgrades. To study extra about in-place model upgrades, confer with Upgrading the Apache Airflow model or Introducing in-place model upgrades with Amazon MWAA.
Conclusion
On this submit, we mentioned some vital options launched in Airflow model 2.8, comparable to object storage, the brand new XCom tab added to the grid view, process context logging, listener hooks for datasets, and how one can begin utilizing them. We additionally offered some pattern code to point out implementations in Amazon MWAA. For the whole checklist of adjustments, confer with Airflow’s launch notes.
For extra particulars and code examples on Amazon MWAA, go to the Amazon MWAA Consumer Information and the Amazon MWAA examples GitHub repo.
Apache, Apache Airflow, and Airflow are both registered logos or logos of the Apache Software program Basis in the US and/or different international locations.
In regards to the Authors
Mansi Bhutada is an ISV Options Architect based mostly within the Netherlands. She helps prospects design and implement well-architected options in AWS that deal with their enterprise issues. She is enthusiastic about information analytics and networking. Past work, she enjoys experimenting with meals, enjoying pickleball, and diving into enjoyable board video games.
Hernan Garcia is a Senior Options Architect at AWS based mostly within the Netherlands. He works within the monetary providers trade, supporting enterprises of their cloud adoption. He’s enthusiastic about serverless applied sciences, safety, and compliance. He enjoys spending time with household and buddies, and attempting out new dishes from totally different cuisines.