Storage layout, partitioning and compression matter more than database choice for time‑series cost and performance

News

5/12/2026, 9:27:01 AM

Storage layout, partitioning and compression matter more than database choice for time‑series cost and performance

On May 12, 2026 Nirmesh Khandelwal published an analysis showing that per‑row layout, compression and partitioning decisions have a larger impact on time‑series storage cost and query performance than the choice of database engine.

Nirmesh Khandelwal published an analysis on May 12, 2026 arguing that how you lay out rows, compress data and partition time‑series datasets typically determines storage cost and query performance more than which database engine you pick. The report tests trade‑offs from first principles using commonly available systems, and its main practical implication is that teams should tune schema and retention strategies before switching databases. Builders and operators who serve heavy time‑series workloads are the primary audience because these design choices directly affect storage bills, write concurrency and read latencies.

The author demonstrates concrete experiments with PostgreSQL and Apache Parquet. Normalizing series identity into a separate metadata table and referencing it with a compact integer key cut storage by about 42% in the reported experiment, because repeated dimension strings (device, region, location) are stored once per series instead of per row. Khandelwal also shows the reverse effect: including high‑cardinality fields such as request IDs or session tokens inside series identity causes normalization gains to collapse, so those tokens should be excluded from the series key.

On schema flexibility, storing series dimensions as jsonb (PostgreSQL) lets tags evolve without frequent migrations, but it requires a deliberate indexing policy. The piece recommends targeted partial or expression indexes to support common filters and aggregations while avoiding index sprawl and type drift; indiscriminate indexing of jsonb fields can greatly increase index size and storage costs. Choosing which fields belong to stable series dimensions versus volatile metrics should follow typical query patterns, not schema convenience.

Time partitioning is presented as an operationally powerful pattern: partition pruning enables O(1) data expiration by dropping whole partitions, simplifying retention and compaction. The trade‑off is a write hotspot on the current partition window; adding a second partition axis-for example, by series identity — spreads writes across partitions and narrows read scan widths. For workloads with intense recent writes, the report recommends combining time partitions with a series‑based distribution strategy to avoid single‑writer bottlenecks on the active partition.

Downsampling and rollups are shown to be the dominant lever for reducing row counts: converting five‑second samples to one‑hour aggregates reduces the number of rows by 720×. Khandelwal recommends retaining full resolution only for the recent window where detail matters and serving older queries from pre‑aggregated rollups. That approach significantly lowers storage, indexing and scan costs while preserving query performance for historical analysis.

The practical takeaway is to measure these design choices on your own workload instead of immediately changing databases. Key recommended actions: normalize stable identifiers into a compact registry; exclude high‑cardinality tokens from series identity; adopt targeted jsonb indexing; use time partitions with an additional distribution axis; and apply downsampling policies aligned to query needs. The article includes step‑by‑step experiments and numbers so teams can make these trade‑offs measurable.

Sources

InfoQ AI/ML · 5/12/2026

Replies (0)

No replies in this topic yet.

Back