PyCon Lithuania 2024 · April 5, 2024 · Vilnius

Write-Audit-Publish Pattern in Modern Data Pipelines

Your pipeline ran successfully. The data looks wrong. Sound familiar? The Write-Audit-Publish pattern is a simple idea: write data to a staging area, run quality checks on it, and only then promote it to where consumers can see it. If the checks fail, the bad data never reaches production.

This talk covers why the pattern exists, how to implement it yourself with SQL, and how Snowflake Dynamic Tables and Apache Iceberg handle it natively.

Download slides (PDF)

A Brief History of "Trust Me, the Data Is Fine"

ETL was the original pattern: extract data, transform it, load it into the warehouse. Then ELT flipped the order because cloud warehouses made it cheap to load raw data first and transform later. Along the way, databases gave us constraints (NOT NULL, UNIQUE, FOREIGN KEY) to catch bad data at write time.

But here is the thing. Most modern data warehouses do not enforce constraints. Snowflake has them, but they are informational only. BigQuery has them, but they are not enforced. You can declare a NOT NULL column and still insert NULLs. The safety net looks real, but it has holes.

So the question becomes: if the database will not protect you, who will?

Write, Audit, Publish

The pattern is three steps. Write: load your data into a staging area. Not the production table. A separate location where nobody is querying it yet. Audit: run your quality checks. Row counts, null checks, schema validation, business rules, whatever matters for this dataset. Publish: if the checks pass, promote the data to the production table. If they fail, stop. Alert someone. Do not let bad data through.

Simple concept. The tricky part is implementing it in a way that does not double your pipeline complexity.

The DIY Approach

Building WAP yourself with SQL: write to a staging table, run audit queries against it (row counts match expectations, no nulls in required columns, no duplicate keys, values within expected ranges), and if all checks pass, swap the staging table into production using a table rename or INSERT INTO ... SELECT.

You can orchestrate this in Airflow, Dagster, or any scheduler. The audit step is just a set of SQL queries that return pass/fail. If any query fails, the DAG stops. The production table stays untouched.

It works. It is not fancy. But it works, and you control every part of it.

Snowflake Dynamic Tables and Streams

Snowflake gives you tools that make WAP easier without building everything from scratch. Dynamic Tables automatically refresh when upstream data changes. Streams track changes to a table (inserts, updates, deletes) so you can process only the delta.

Combine these and you get a pattern where data lands in a raw table, a Stream captures the new rows, a Dynamic Table applies transformations, and your audit logic runs before the data is visible to consumers. Snowflake handles the orchestration of refreshes. You handle the quality rules. A good middle ground between fully DIY and fully managed.

Apache Iceberg: Branch, Audit, Merge

Apache Iceberg supports table branching, similar to how git works with code. You create a branch of your table, write new data to the branch, run audits against the branch, and if everything passes, merge it into the main branch. If audits fail, drop the branch. The main table never sees the bad data.

This is the cleanest implementation of WAP available today. The branch is isolated. Consumers querying the main branch see consistent data throughout. And you get time travel for free, so you can always go back to a previous state if something slips through.

The downside: you need to be running Iceberg. If you are on a traditional data warehouse without Iceberg support, this option is not available to you.

Key takeaways

Most warehouse constraints are informational, not enforced. Do not rely on them to catch bad data.
Write-Audit-Publish is a simple pattern: stage data, check it, promote it only if it passes.
You can build WAP yourself with SQL and any orchestrator. It does not require special tooling.
Snowflake Dynamic Tables and Streams make WAP easier if you are already on Snowflake.
Apache Iceberg's branching model is the cleanest WAP implementation available, but requires Iceberg.
The pattern costs almost nothing to implement and saves you from the "the data looked fine when I loaded it" conversation.

Bad data that never reaches production is infinitely cheaper to fix than bad data that does.