Skip to main content

stop paying for free software with your Mondays.

Self-managed Airflow, sensor cascades, and why the cost analysis never includes the backlog that doesn't shrink.

April 28, 2025

The senior engineer on my team had the entire DAG dependency graph memorized. Every upstream sensor. Every downstream dependency. Every pipeline that would cascade red if one specific table was late on Tuesday morning.

I thought that was impressive. It is not impressive. It is a warning sign.

That knowledge should live in the system. When it lives in a person, that person is the single point of failure for your entire data platform. And they are asleep at 2am when the sensor times out.

Here is the thing about self-managed Airflow that nobody puts in the cost analysis. It is free to deploy. It is not free to operate. Every DAG you add is another file the scheduler parses on every heartbeat. Every pipeline you build is another row accumulating state in a Postgres metadata database that you will tune, capacity-plan, and eventually crisis-manage. The workers, the webserver, Redis for the Celery executor, the upgrade path from one major version to the next — all of that is yours. You own it. It does not appear on a line item. It appears in the backlog of projects that never got built because your team was doing something else.

"But managed services like MWAA take care of the infrastructu—" They take care of some of it. They do not take care of the limitations. MWAA runs months behind the latest Airflow release. You cannot use the KubePodOperator natively. You are locked to the Celery executor. When you want to upgrade from one Airflow version to another, you provision a new environment and migrate your existing installation over. There is no turnkey upgrade. There is a project you did not budget for.

The sensor cascade is the failure mode everyone who has run Airflow at scale knows. You connect DAGs to each other with sensors. One sensor times out. Everything downstream refuses. The blast radius is invisible unless you have the graph memorized. You clear the tasks in the right sequence, you trigger things in the right order, you spend three hours on a Monday morning being a manually-operated restart button. I did this more times than I want to admit. The last time I did it I moved to Astro within six weeks.

Airflow 3 is the actual reason. Not the managed infrastructure. Not the support. Airflow 3.

Ten years of Airflow and this is the most significant architectural change in the project's history. Asset-based scheduling. DAG versioning. Remote execution so tasks run in your infrastructure instead of the platform's workers. Backfills that work without tribal knowledge. An architecture that decouples task execution from the metadata database so the database stops being the bottleneck.

I wanted to be on it when it shipped. I did not want to be waiting for MWAA to validate the release eight months later.

I switched to asset scheduling and the sensor cascade stopped being my problem. Each DAG declares the assets it produces. Downstream runs when the assets update. Not when a sensor decides to check. The failure mode changed structurally, not because I got better at operating sensors.

What I did not expect: I lost my visibility. The sensor failures had been, accidentally, my blast-radius monitoring system. Less red, but I no longer knew what was downstream of any given failure. So I built a Control DAG. A DAG to monitor all the other DAGs. Airflow observing Airflow. This is as unhinged as it sounds and it worked perfectly.

Astro Observe replaced it. Task-level lineage. Downstream impact visible immediately. AI-powered root cause analysis. SLA monitoring without standing up Prometheus separately.

This is going to sound like a pitch. It is a pitch. It is also what I actually run. I know that is exactly what someone pitching something would say. I do not have a way around that. The 8am Monday I described stopped happening when I moved. I cannot make that sound neutral.

The support is different too. When you have a scheduler problem on Astro, the person who answers sometimes committed to the scheduler. That is not a guarantee. It happens enough to matter. The people who built the thing maintain the thing. With MWAA you file a ticket and receive a link to documentation that exists because people like you filed the same ticket before you.

89% of Airflow users in the 2026 State of Airflow report expect to use it for revenue-generating solutions this year. The orchestration layer is becoming the AI layer. The pipelines that feed models, the context that makes AI work in production — data engineers are building the architecture that the next five years run on.

The question is how much of that time is spent building versus how much is spent maintaining the platform they are supposed to be building on.

Most teams pay more in infrastructure hours than they would have paid in a subscription. They never add it up. They just have a backlog that does not shrink.

the senior engineer who had the dependency graph memorized left the company.

we had two bad mondays before we figured out where everything was.

i do not have anyone with the graph memorized now.

it lives in the system.

that is how it should have been from the start.

i write these when i have something worth saying. no schedule. no algorithm. if you want to know when the next one goes up -- leave your email.

no spam. no sequence. just the note, when it exists.