← Back to Patterns

Physical vs logical storage: a dataset classification rule for SMEs

Physical versus logical storage billing is not a warehouse philosophy debate. It is a dataset classification choice based on change rate, retention behavior, and how much storage churn the table creates.

By Ivan Richter LinkedIn

Last updated: Mar 29, 2026

4 min read

On this page

Treat storage billing as classification, not doctrine

Physical versus logical storage is one of those BigQuery decisions that gets framed far too dramatically. It isn’t a statement about maturity, taste, or warehouse philosophy. It’s a dataset classification choice. The only useful question is which billing model matches the way the data actually behaves.

It belongs under cost guardrails, not under performance tuning or architectural symbolism. Different datasets create different patterns of storage churn, retained history, and rewrite pressure. The billing model should follow that behavior. If it doesn’t, the warehouse ends up with a neat explanation and a messier bill.

Classify by change pattern, not by ownership or sentiment

The first split that matters is how the data changes. Append-heavy raw zones, mutation-heavy curated datasets, and overwrite-heavy staging layers do not produce the same storage economics, even when they sit in the same platform and support the same downstream reporting.

That becomes clearer once the classification stays close to the table behavior instead of drifting toward org-chart logic. A landing zone shaped by streaming-first ingestion is not the same storage problem as a curated layer that gets updated in place, and neither of those is the same as staging data that is routinely replaced. Put all three under one billing assumption and the decision is already wrong before the invoice arrives.

The dataset class should be explicit. Not because the warehouse needs more labels, but because billing choices age badly when the data behavior behind them stays vague.

Retention and time travel are part of the cost model

Storage billing discussions often stay too focused on the active table surface and not focused enough on historical behavior. That’s where the decision usually gets blurry. Time travel matters. Retention matters. Rewrite frequency matters. A table that mostly accumulates rows and stays still behaves very differently from one that is regularly overwritten or mutated in place.

That doesn’t mean every SME warehouse needs to start trimming time travel windows aggressively on day one. It does mean the billing choice is incomplete until the dataset has an explicit retention story. If the system can’t say how much history is actually needed and why, it isn’t really tuning storage policy. It’s carrying default history and calling it intentional.

The same applies to overwrite-heavy zones that only exist to support a short transformation step. Those datasets often don’t need generous history, and leaving them with one anyway can quietly inflate cost without buying much protection.

This has nothing to do with performance

One useful thing about this decision is how much cleaner it gets once the boundary is held properly. Physical versus logical storage is not a latency tool. It won’t make dashboards faster. It won’t calm a busy compute path. It won’t rescue a weak serving model. If the warehouse feels slow, the problem belongs somewhere else.

This choice should stay separate from reservation design. Storage billing is about how bytes are accounted for over time. Compute design is about how work gets executed. Mixing those discussions usually produces bad decisions in both directions. A storage lever gets asked to solve a runtime problem, and the actual runtime problem stays right where it was.

A small classification default is enough

This doesn’t need a grand taxonomy. A few stable dataset classes are usually enough to make the decision useful.

dataset_classes:
  raw_append_only:
    billing_model: physical
    time_travel_window: 2_days

  curated_slowly_changing:
    billing_model: logical
    time_travel_window: 7_days

  staging_overwrite_heavy:
    billing_model: logical
    time_travel_window: 2_days

That’s not scripture. It’s a starting posture. Append-heavy raw data often fits physical billing because the table mostly grows and churn in place stays limited. Curated datasets with slower change patterns often fit logical billing cleanly enough. Overwrite-heavy or mutation-heavy layers deserve more suspicion before any easy savings story gets accepted.

The point isn’t to memorize a mapping. The point is to force the warehouse to describe dataset behavior in a way that billing policy can actually follow.

Don’t optimize this before the categories are stable

A lot of SME warehouses should leave this decision alone for a while. If storage is still a minor share of total cost, if datasets are still moving between layers, or if mutation and retention patterns still aren’t clear, flipping billing models early usually adds complexity before it adds clarity.

We’d rather wait until the categories are real. Once the warehouse can say which datasets are append-heavy, which ones churn, which ones get rewritten, and which ones need retained history, the choice gets much simpler. Before that, it is mostly guesswork.

That delay is not a sign that the platform is behind. It’s usually the disciplined move. Storage optimization only helps when the warehouse has already become legible enough to classify.

The rule

Choose storage billing per dataset, based on how the data changes and how much history it actually needs. Keep that decision separate from performance and compute discussions. And if the warehouse still can’t describe its dataset classes cleanly, wait.

More in this domain: Data

Browse all

Related patterns