PyData Meetup Vilnius · November 28, 2024 · Vilnius

Data Lineage: Where "It Depends" Finally Gets an Answer

Everyone agrees data lineage is important. Nobody agrees on how to actually get it. This talk walks through the real options: dbt's built-in lineage, Airflow's OpenLineage integration, Dagster's asset-based approach, and the DIY path using SQLGlot. Each one has tradeoffs, and I have opinions about all of them.

Download slides (PDF)

The Lineage Market Is Real Money

The data lineage market sits at around $1.5 to $2 billion. Companies are paying serious money for tools that answer "where did this number come from?" and "what breaks if I change this table?" As data ecosystems get more complex, the need for visibility keeps growing.

So the question is not whether you need lineage. It is how you get it without overcomplicating your stack or blowing your budget.

dbt: Lineage That Comes Free (Until It Doesn't)

If you already use dbt, you get lineage for free. The DAG view shows you which models depend on which. For smaller projects, it might be all you need.

But the UI gets messy once your project grows. Column-level lineage, impact analysis, and anything beyond the basic graph requires dbt Cloud. The pattern is familiar: the basic feature is free, the useful features cost money.

Also: dbt lineage only covers your transformation layer. It does not know about your ingestion pipelines, your BI tools, or anything outside the dbt project.

Airflow + OpenLineage: Promising, With Caveats

OpenLineage is an open standard for collecting lineage metadata, and Airflow has an integration for it. In theory, this gives you lineage across your entire orchestration layer, not just transformations.

In practice: not all operators have OpenLineage integrations. If you wrote custom operators (and who hasn't), you need to implement the lineage extraction yourself. The SQL parsing does not handle non-standard queries well. Column-level lineage is not ready yet. Combine dynamic tasks with SQL and things get complicated fast.

It is improving, but it is not a plug-and-play solution today.

Dagster: Built-in, But at a Cost

Dagster's asset-based model means lineage is baked into how you define your pipelines. Every asset knows its upstream and downstream dependencies. This is genuinely nice if you are starting fresh.

The community is still relatively small compared to Airflow. It requires a real mindset shift from task-based orchestration. And the advanced lineage features follow the same pattern as everyone else: extra features, extra money.

DIY with SQLGlot

SQLGlot is a SQL parser that understands 24+ dialects, has zero dependencies, and is pure Python. You give it a SQL query, it gives you back an Abstract Syntax Tree (AST) that you can walk to extract which tables are read from and which are written to. That is lineage.

The talk includes a walkthrough of increasing complexity. Start with a simple SELECT. Then add joins. Then CTEs. Then subqueries. Then the real world shows up: SELECT *, dynamic SQL, and queries that were clearly written at 2am. Each step adds complexity to the parsing logic, but SQLGlot handles most of it.

If you use dbt, compile your models first and then parse the resolved SQL. The full pipeline: extract SQL, parse with SQLGlot, push results to a database, add an API on top. Not a polished product. But it is yours, and it does not come with a monthly bill.

Key takeaways

"It depends" is the honest answer to "how should I do lineage?" But now you know what it depends on.
dbt lineage works for the transformation layer. Beyond that, you need something else.
OpenLineage is promising but not production-ready for all use cases.
Dagster gives you lineage for free if you buy into the asset model.
SQLGlot lets you build lineage yourself if you are willing to maintain it.
Every vendor eventually charges for the features you actually need. Plan accordingly.