Physical vs logical storage: a dataset classification rule for SMEs

Physical versus logical storage billing is not a warehouse philosophy debate. It is a dataset classification choice based on change rate, retention behavior, and how much storage churn the table creates.

Decision memo Data

By Ivan RichterLinkedIn

Last updated: Mar 29, 2026

4 min read

bigquery storage cost-control

On this page

Treat storage billing as classification, not doctrine

Physical versus logical storage is a dataset classification choice. The useful question is which billing model matches the way the data actually behaves.

It belongs under cost guardrails, not under performance tuning or architectural symbolism. Different datasets create different patterns of storage churn, retained history, and rewrite pressure. The billing model should follow that behavior. If it doesn’t, the warehouse ends up with a neat explanation and a messier bill.

Classify by change pattern, not by ownership or sentiment

The first split that matters is how the data changes. Append-heavy raw zones, mutation-heavy curated datasets, and overwrite-heavy staging layers do not produce the same storage economics, even when they sit in the same platform and support the same downstream reporting.

That becomes clearer once the classification stays close to the table behavior instead of drifting toward org-chart logic. A landing zone shaped by streaming-first ingestion is not the same storage problem as a curated layer that gets updated in place, and neither of those is the same as staging data that is routinely replaced. Put all three under one billing assumption and the decision is already wrong before the invoice arrives.

Make the dataset class explicit because billing choices age badly when the data behavior behind them stays vague.

Retention and time travel are part of the cost model

Storage billing discussions often stay too focused on the active table surface and not focused enough on historical behavior. That’s where the decision usually gets blurry. Time travel matters. Retention matters. Rewrite frequency matters. A table that mostly accumulates rows and stays still behaves very differently from one that is regularly overwritten or mutated in place.

The billing choice needs an explicit retention story. That requirement doesn’t force every SME warehouse to trim time travel windows aggressively on day one. If the system can’t say how much history is actually needed and why, it’s carrying default history and calling it intentional.

The same applies to overwrite-heavy zones that only exist to support a short transformation step. Those datasets often don’t need generous history, and leaving them with one anyway can quietly inflate cost without buying much protection.

This has nothing to do with performance

Physical versus logical storage is not a latency tool. It won’t make dashboards faster. It won’t calm a busy compute path. It won’t rescue a weak serving model. If the warehouse feels slow, the problem belongs somewhere else.

This choice should stay separate from reservation design. Storage billing is about how bytes are accounted for over time. Compute design is about how work gets executed. Mixing those discussions usually produces bad decisions in both directions. A storage lever gets asked to solve a runtime problem, and the actual runtime problem stays right where it was.

A small classification default is enough

This doesn’t need a grand taxonomy. A few stable dataset classes are usually enough to make the decision useful.

dataset_classes:
  raw_append_only:
    billing_model: physical
    time_travel_window: 2_days

  curated_slowly_changing:
    billing_model: logical
    time_travel_window: 7_days

  staging_overwrite_heavy:
    billing_model: logical
    time_travel_window: 2_days

Treat this mapping as a starting posture. Append-heavy raw data often fits physical billing because the table mostly grows and churn in place stays limited. Curated datasets with slower change patterns often fit logical billing cleanly enough. Overwrite-heavy or mutation-heavy layers deserve more suspicion before any easy savings story gets accepted.

The mapping forces the warehouse to describe dataset behavior in a way that billing policy can actually follow.

Don’t optimize this before the categories are stable

A lot of SME warehouses should leave this decision alone for a while. If storage is still a minor share of total cost, if datasets are still moving between layers, or if mutation and retention patterns still aren’t clear, flipping billing models early usually adds complexity before it adds clarity.

We’d rather wait until the categories are real. Once the warehouse can say which datasets are append-heavy, which ones churn, which ones get rewritten, and which ones need retained history, the choice gets much simpler. Before that, it is mostly guesswork.

Waiting is usually the disciplined move. Storage optimization only helps when the warehouse has already become legible enough to classify.

The rule

Choose storage billing per dataset, based on how the data changes and how much history it actually needs. Keep that decision separate from performance and compute discussions. And if the warehouse still can’t describe its dataset classes cleanly, wait.

More in this domain: Data

Browse all

BigQuery cost guardrails that won't break your teams

BigQuery cost control works when guardrails are designed around workload shape and blast radius, not around shaming whoever happened to run the last expensive query.

On-demand vs slots: the SME decision boundary

For SMEs, the question is not which BigQuery pricing model is more sophisticated. The question is when workload classes have become distinct enough to deserve different compute lanes.

Partitioning defaults for event tables that don't lie

Partitioning is not just a performance tweak. It is one of the cheapest ways to control scan blast radius, but only if the partition contract matches how the table is actually queried.

Reservations for workload isolation: the minimal setup

Reservation design for SMEs is usually not an enterprise org chart. It is a small blast-radius pattern that keeps BI, batch, and sandbox work from bullying each other.

Streaming buffer is your hidden constraint

When BigQuery streaming pain shows up as a DML error, the real problem is usually workload shape. Streaming wants append-and-reconcile thinking, not row-by-row sync fantasies.

Related patterns

BigQuery cost spikes usually come from table shape, not queries

When BigQuery spend jumps, the cause is usually in model shape, weak incremental design, or unnecessary reprocessing long before it's a single bad query.

Constraints without enforcement: still worth it?

Non-enforced constraints are useful when they tell the truth. They act as semantic contracts and optimizer hints, but they become actively dangerous the moment the warehouse is asked to trust a lie.

How we decide whether a transformation belongs in SQLX, code, or orchestration

We keep transformations in SQLX by default, move to code when the logic truly stops being legible in SQL, and keep orchestration for sequencing rather than business meaning.

Why declarative data models scale better than script-driven pipelines

Declarative modeling scales better because it keeps business shape, dependencies, and reviewable intent visible as the platform and team both grow.