Physical vs logical storage: a dataset classification rule for SMEs
Physical versus logical storage billing is not a warehouse philosophy debate. It is a dataset classification choice based on change rate, retention behavior, and how much storage churn the table creates.
On this page
Treat storage billing as classification, not doctrine
Physical versus logical storage is one of those BigQuery decisions that gets framed far too dramatically. It isn’t a statement about maturity, taste, or warehouse philosophy. It’s a dataset classification choice. The only useful question is which billing model matches the way the data actually behaves.
It belongs under cost guardrails, not under performance tuning or architectural symbolism. Different datasets create different patterns of storage churn, retained history, and rewrite pressure. The billing model should follow that behavior. If it doesn’t, the warehouse ends up with a neat explanation and a messier bill.
Classify by change pattern, not by ownership or sentiment
The first split that matters is how the data changes. Append-heavy raw zones, mutation-heavy curated datasets, and overwrite-heavy staging layers do not produce the same storage economics, even when they sit in the same platform and support the same downstream reporting.
That becomes clearer once the classification stays close to the table behavior instead of drifting toward org-chart logic. A landing zone shaped by streaming-first ingestion is not the same storage problem as a curated layer that gets updated in place, and neither of those is the same as staging data that is routinely replaced. Put all three under one billing assumption and the decision is already wrong before the invoice arrives.
The dataset class should be explicit. Not because the warehouse needs more labels, but because billing choices age badly when the data behavior behind them stays vague.
Retention and time travel are part of the cost model
Storage billing discussions often stay too focused on the active table surface and not focused enough on historical behavior. That’s where the decision usually gets blurry. Time travel matters. Retention matters. Rewrite frequency matters. A table that mostly accumulates rows and stays still behaves very differently from one that is regularly overwritten or mutated in place.
That doesn’t mean every SME warehouse needs to start trimming time travel windows aggressively on day one. It does mean the billing choice is incomplete until the dataset has an explicit retention story. If the system can’t say how much history is actually needed and why, it isn’t really tuning storage policy. It’s carrying default history and calling it intentional.
The same applies to overwrite-heavy zones that only exist to support a short transformation step. Those datasets often don’t need generous history, and leaving them with one anyway can quietly inflate cost without buying much protection.
This has nothing to do with performance
One useful thing about this decision is how much cleaner it gets once the boundary is held properly. Physical versus logical storage is not a latency tool. It won’t make dashboards faster. It won’t calm a busy compute path. It won’t rescue a weak serving model. If the warehouse feels slow, the problem belongs somewhere else.
This choice should stay separate from reservation design. Storage billing is about how bytes are accounted for over time. Compute design is about how work gets executed. Mixing those discussions usually produces bad decisions in both directions. A storage lever gets asked to solve a runtime problem, and the actual runtime problem stays right where it was.
A small classification default is enough
This doesn’t need a grand taxonomy. A few stable dataset classes are usually enough to make the decision useful.
dataset_classes:
raw_append_only:
billing_model: physical
time_travel_window: 2_days
curated_slowly_changing:
billing_model: logical
time_travel_window: 7_days
staging_overwrite_heavy:
billing_model: logical
time_travel_window: 2_days That’s not scripture. It’s a starting posture. Append-heavy raw data often fits physical billing because the table mostly grows and churn in place stays limited. Curated datasets with slower change patterns often fit logical billing cleanly enough. Overwrite-heavy or mutation-heavy layers deserve more suspicion before any easy savings story gets accepted.
The point isn’t to memorize a mapping. The point is to force the warehouse to describe dataset behavior in a way that billing policy can actually follow.
Don’t optimize this before the categories are stable
A lot of SME warehouses should leave this decision alone for a while. If storage is still a minor share of total cost, if datasets are still moving between layers, or if mutation and retention patterns still aren’t clear, flipping billing models early usually adds complexity before it adds clarity.
We’d rather wait until the categories are real. Once the warehouse can say which datasets are append-heavy, which ones churn, which ones get rewritten, and which ones need retained history, the choice gets much simpler. Before that, it is mostly guesswork.
That delay is not a sign that the platform is behind. It’s usually the disciplined move. Storage optimization only helps when the warehouse has already become legible enough to classify.
The rule
Choose storage billing per dataset, based on how the data changes and how much history it actually needs. Keep that decision separate from performance and compute discussions. And if the warehouse still can’t describe its dataset classes cleanly, wait.
More in this domain: Data
Browse allBigQuery cost guardrails that won't break your teams
BigQuery cost control works when guardrails are designed around workload shape and blast radius, not around shaming whoever happened to run the last expensive query.
Constraints without enforcement: still worth it?
Non-enforced constraints are useful when they tell the truth. They act as semantic contracts and optimizer hints, but they become actively dangerous the moment the warehouse is asked to trust a lie.
On-demand vs slots: the SME decision boundary
For SMEs, the question is not which BigQuery pricing model is more sophisticated. The question is when workload classes have become distinct enough to deserve different compute lanes.
Partitioning defaults for event tables that don't lie
Partitioning is not just a performance tweak. It is one of the cheapest ways to control scan blast radius, but only if the partition contract matches how the table is actually queried.
Reservations for workload isolation: the minimal setup
Reservation design for SMEs is usually not an enterprise org chart. It is a small blast-radius pattern that keeps BI, batch, and sandbox work from bullying each other.
Related patterns
BigQuery cost spikes usually come from table shape, not queries
When BigQuery spend jumps, the cause is usually in model shape, weak incremental design, or unnecessary reprocessing long before it's a single bad query.
Streaming buffer is your hidden constraint
When BigQuery streaming pain shows up as a DML error, the real problem is usually workload shape. Streaming wants append-and-reconcile thinking, not row-by-row sync fantasies.
How we decide whether a transformation belongs in SQLX, code, or orchestration
We keep transformations in SQLX by default, move to code when the logic truly stops being legible in SQL, and keep orchestration for sequencing rather than business meaning.
Why declarative data models scale better than script-driven pipelines
Declarative modeling scales better because it keeps business shape, dependencies, and reviewable intent visible as the platform and team both grow.