Unique keys are not optional in analytical incrementals

Incremental analytical models need an explicit notion of row identity. Without it, merges drift, updates go missing, and review of correctness turns into guesswork.

Operating principle Data

By Ivan RichterLinkedIn

Last updated: Mar 24, 2026

4 min read

dataform incremental-models data-modeling

On this page

The rule

Analytical incrementals need an explicit unique key.

If we can’t say what one row represents and how that row should be matched on a later run, we don’t have a safe incremental model. We have a table that’s hoping append logic will somehow behave like state management.

That’s part of the broader reason we prefer structured modeling over ad hoc transformations in reviewable transformations. Once shared data products start changing over time, structure stops being aesthetic and starts being operational.

Row identity is the contract

A unique key is the identity contract behind the merge statement.

If a row represents one order, the key should identify that order. If it represents one order line, the key should identify that order line. If it represents one user-day state, the key should identify that user and that day. The key should match the grain of the model instead of being a random technical convenience somebody pasted in to make the merge compile.

Without that contract, incremental behavior gets vague fast. Which row gets updated? Which row should survive? What counts as a correction versus a new fact? Reviewers can’t answer those questions from the model because the model never made identity explicit in the first place.

A merge only works if the key is honest

The identity rule determines whether a merge is sound. If the key doesn’t actually represent the analytical entity in the table, the merge just applies confidence to the wrong boundary. The system still updates rows. It just isn’t clear that it’s updating the right ones.

That’s why weak keys are so dangerous. They make the model look more disciplined than it is. The SQL seems formal. The materialization seems intentional. Meanwhile the table is drifting because the identity rule was never strong enough to support the behavior people expected from it.

Late change still needs a path

A unique key is necessary, but it isn’t enough by itself.

If source records can change after first arrival, the model also needs a reliable way to notice those changes. Late-arriving dimensions, corrected statuses, refunded orders, changed child entities, and source replays do not care that you wrote a merge. They only care whether the incremental logic knows which existing rows need another pass.

That’s why change detection matters. A key tells you how to match. Change detection tells you when to revisit. You need both.

Without that second piece, the model might preserve identity cleanly and still miss the moments where an old row stopped being current.

Weak identity creates ugly failure modes

A lot of ugly incremental behavior starts with pretending a key is close enough.

Maybe the model keys on order_id even though one order can fan out to multiple rows. Maybe it keys on a synthetic hash that changes whenever a non-essential field moves. Maybe it appends blindly because nobody could agree what should count as the same record. None of those are harmless shortcuts. They’re how teams end up with duplicated facts, partial updates, and tables that quietly get less trustworthy over time.

By the time people notice, the symptom is usually stale rows. The warehouse still returns answers. The answers just get increasingly wrong around the edges where real systems actually change.

The grain has to make the key obvious

The right key starts with the right grain.

We try to model around the business entity or decision the row is supposed to represent, not around whatever cleanup residue happened to fall out of staging. That’s the same stance behind decision boundaries. If a row exists only because a landing step split, flattened, or reformatted something in a convenient way, the key is often compensating for the wrong model shape.

A healthy key usually feels obvious once the model intent is clear. When it doesn’t, that usually means the table still hasn’t decided what one row is supposed to mean.

What we do when there is no real key

If the source doesn’t provide a clean identifier, we don’t wave the problem away and call the model incremental anyway.

Sometimes we build a stable composite key. Sometimes we remodel the grain. Sometimes we decide the transformation hasn’t earned an incremental path yet. Those options may be inconvenient, but they’re still cheaper than carrying a table whose identity rules are fuzzy and only seem to work while nobody looks too closely.

The worst option is pretending ambiguity is acceptable because the table “is mostly append-only.” That’s usually how a temporary shortcut gets promoted into platform behavior.

Why this becomes a review problem

A weak or missing key creates bad data and makes the model harder to review.

Once identity is fuzzy, reviewers can’t inspect incremental behavior with confidence because the most basic question stays unresolved. What exactly is being updated over time? If the answer is hand-wavy, the rest of the logic becomes hand-wavy too. Merge rules, replacement logic, late change handling, and stale-row prevention all end up sitting on top of a boundary the model never made explicit.

Unique keys make the model legible by stating its identity contract.

The point

Unique keys are required because an incremental must process less data while still updating the right record on purpose.

If a model doesn’t have a real identity contract, it hasn’t earned an incremental path yet.

More in this domain: Data

Browse all

BigQuery cost guardrails that won't break your teams

BigQuery cost control works when guardrails are designed around workload shape and blast radius, not around shaming whoever happened to run the last expensive query.

On-demand vs slots: the SME decision boundary

For SMEs, the question is not which BigQuery pricing model is more sophisticated. The question is when workload classes have become distinct enough to deserve different compute lanes.

Partitioning defaults for event tables that don't lie

Partitioning is not just a performance tweak. It is one of the cheapest ways to control scan blast radius, but only if the partition contract matches how the table is actually queried.

Physical vs logical storage: a dataset classification rule for SMEs

Physical versus logical storage billing is not a warehouse philosophy debate. It is a dataset classification choice based on change rate, retention behavior, and how much storage churn the table creates.

Reservations for workload isolation: the minimal setup

Reservation design for SMEs is usually not an enterprise org chart. It is a small blast-radius pattern that keeps BI, batch, and sandbox work from bullying each other.

Related patterns

How we prevent stale rows in incremental fact models

Incremental fact models stay trustworthy only when record identity, reprocessing rules, and cleanup boundaries are designed on purpose instead of patched after drift shows up.

Constraints without enforcement: still worth it?

Non-enforced constraints are useful when they tell the truth. They act as semantic contracts and optimizer hints, but they become actively dangerous the moment the warehouse is asked to trust a lie.

BigQuery cost spikes usually come from table shape, not queries

When BigQuery spend jumps, the cause is usually in model shape, weak incremental design, or unnecessary reprocessing long before it's a single bad query.

Incremental models are only safe when change detection is explicit

Incremental models are trustworthy only when they can deliberately identify which records need another pass after late or changed upstream data shows up.