airflow  by apache

Workflow orchestration platform for programmatic scheduling and monitoring

Created 11 years ago
45,517 stars

Top 0.7% on SourcePulse

GitHubView on GitHub
Project Summary

Apache Airflow is a robust platform designed for programmatically authoring, scheduling, and monitoring complex workflows. It targets engineers and power users who require a maintainable, versionable, testable, and collaborative approach to managing data pipelines and other task sequences. By defining workflows as code, Airflow enhances operational efficiency and reduces the risk of errors in production environments.

How It Works

Airflow represents workflows as Directed Acyclic Graphs (DAGs), written in Python. The core components include a scheduler that triggers and monitors tasks, a metadata database, and a web-based user interface. Tasks within a DAG are executed by workers, respecting defined dependencies. Airflow emphasizes idempotent tasks and uses its XCom feature for passing small amounts of metadata between tasks, recommending delegation of heavy data processing to external systems.

Quick Start & Requirements

Installation is primarily supported via pip install apache-airflow, with official Docker images also available. For repeatable installations, users should leverage constraint files. Key requirements include:

  • Python: 3.10 or higher (specific versions vary by Airflow release).
  • Databases: PostgreSQL (14+ recommended), MySQL (8.0+), or SQLite (for testing only).
  • Kubernetes: Support for versions 1.26+ (depending on Airflow version).
  • Operating System: POSIX-compliant systems (Linux, macOS) are recommended. Windows users can utilize WSL2 or Linux Containers. Production environments must be Linux-based.
  • Documentation: Official documentation is available at https://airflow.apache.org/docs/.

Highlighted Details

  • Workflows defined as code enable dynamic generation and parameterization.
  • A rich UI provides visualization of DAGs, task statuses, and historical runs.
  • The framework is highly extensible with a wide array of built-in operators and customizability via Jinja templating.
  • While not a streaming solution, Airflow can process real-time data in batches.

Maintenance & Community

As an Apache Software Foundation project, Airflow benefits from a strong community-driven development model. Contributions are managed via a detailed process, including agent-assisted PR management. The project lists approximately 500 known organizational adopters and receives sponsorship for its CI infrastructure. Community interaction is facilitated through official documentation, chat channels, and community information pages.

Licensing & Compatibility

Apache Airflow is distributed under the Apache License 2.0, which generally permits commercial use and integration into closed-source projects with standard attribution requirements. Production environments are officially supported on Linux-based operating systems.

Limitations & Caveats

Airflow is not designed for real-time streaming workloads but can handle batch processing of streaming data. Native Windows support is not a high priority, requiring workarounds like WSL2 or Linux Containers. SQLite is explicitly not recommended for production use, and MariaDB is not tested or recommended.

Health Check
Last Commit

4 hours ago

Responsiveness

Inactive

Pull Requests (30d)
1,346
Issues (30d)
239
Star History
392 stars in the last 30 days

Explore Similar Projects

Feedback? Help us improve.