Data Lake Cost Optimization: 3 Levers
Data lake cost optimization comes down to three levers: partition pruning, file compaction, lifecycle tiering. How to tune each one in production.
The first thing to fix about data lake cost optimization is the mental model of where the money goes. Most teams open the conversation with “storage is too expensive” and aim straight at S3 or ADLS storage tiers. Storage is almost never the real problem. Object storage at scale is a rounding error compared to what the lake actually charges you for: API requests against billions of small files, full-scan queries that read partitions that should have been pruned, and snapshots that nobody expired. On a typical data lake build we’re asked to audit, 30–70% of the bill is request and scan charges that don’t show up under “storage” on the invoice and don’t get touched by lifecycle-tier toggles.
That’s why “we’ll just move cold data to Glacier” rarely moves the number. It addresses the smallest slice. The biggest savings sit in three operational levers — partitioning, compaction, lifecycle — and they’re all engineering disciplines, not storage-class settings. This post is the framework we use to cut a lake’s bill by 40–60% without changing the platform, the cloud, or the table format. The structure is intentional: partitioning controls what gets scanned, compaction controls how efficiently you scan it, lifecycle controls what stays around to be scanned at all. Get all three right and the bill compounds down; get one wrong and the others can’t save you.
What actually drives the bill in 2026
Storage prices are flat-to-falling. S3 Standard sits around $0.023/GB-month, ADLS Hot is in the same neighborhood, GCS Standard slightly higher. At a petabyte that’s ~$23k/month for raw storage — a real number, but rarely the line item that surprises a CFO.
The line items that surprise people are:
- Request charges. Every
GET,PUT,LIST, and metadata call on the object store is metered. At small-file scale this dominates. Reading a million 1MB files costs roughly 1,000× the request charges of reading a thousand 1GB files for the same logical data — Glovo’s data team famously documented a 10× cost reduction on heavily queried tables just from compacting them, with most of the savings coming from fewer S3 GET requests rather than less storage. - Scan charges. Athena, BigQuery, Redshift Spectrum, and Synapse Serverless all bill by bytes scanned. A poorly partitioned table forces the engine to read 90% of the data it doesn’t need on every query. Bad partitioning isn’t a performance problem first; it’s a bill problem first that also shows up as latency.
- Egress and retrieval. Cross-region replication, BI tools pulling result sets, the data-science pod that’s accidentally in a different VPC — Finout’s 2026 cloud-storage analysis put request + egress + retrieval at 30–70% of an enterprise’s effective storage spend.
- Compute for table maintenance. A real cost on Iceberg/Delta lakes if you’re running compaction and snapshot expiration as one-off Spark jobs against the whole table. Managed compaction services have changed the economics here, and we’ll come back to that.
The shape of the problem: ~10–20% of the bill is storage you can tier down, ~30–50% is request and scan volume you can engineer down, and the rest is compute attached to the queries that the first two levers determine. Optimize for the storage line and you save 5%. Optimize for the request and scan lines and you save 40–60%. Same lake, same data, fundamentally different bill.
Lever 1: Partitioning that prunes (not partitioning that exists)
Every data-lake design includes partitioning. Most of them include the wrong partitioning, which is worse than no partitioning at all. Wrong partitioning still incurs the metadata cost of partitioning while gaining none of the scan-reduction benefit, and on Iceberg or Hive-style catalogs it can actively slow query planning.
The discipline is this: partitions exist to serve query predicates, not to organize the folder tree. A column belongs in the partition key only if a meaningful share of queries filter on it with high selectivity. Three failure modes are everywhere:
- Partitioning by high-cardinality columns.
partition by user_idorpartition by order_idlooks like it’ll speed up point lookups. What it actually does is shatter the dataset into millions of tiny partitions — each one a directory with a tiny Parquet file — and now you have the small-files problem permanently encoded into the table layout. The catalog is overwhelmed, listing operations become billable, and the engine spends more time enumerating partitions than reading data. - Partitioning by low-selectivity columns.
partition by countryon a US-only business orpartition by statuswhere 95% of rows arecompletedoesn’t reduce scans meaningfully. The partition column has to actually split the data into chunks queries can skip. - Partitioning by date at the wrong grain.
partition by event_date(one folder per day) on a table that retains 5 years and gets queried mostly by month creates 1,825 partitions when 60 would have done. Iceberg’s transform-based partitioning —days(event_ts),months(event_ts),hours(event_ts)— lets the engine compute the right partition for each query rather than committing to one grain at table-creation time. If you’re on Hive-style partitions and stuck, that’s a migration we cover in the data engineering work; the savings usually pay it back in a quarter.
The honest test for whether partitioning is working: pull a representative week of queries from the engine’s history, calculate the average bytes-scanned-per-query, and compare it to the total table size. If queries are scanning more than ~10% of the table on average, the partition key is wrong for the workload — irrespective of how clever the partition scheme looked at design time.
A pattern that’s underused: partition specification evolution, which Iceberg natively supports. You’re allowed to change the partition strategy as the access pattern shifts without rewriting historical data. New partitions go in under the new scheme; old data stays under the old. Most teams treat partition design as a one-shot decision at table creation. It’s not. It’s a tuning knob you should expect to revisit twice a year as the workload changes.
Lever 2: Compaction (and when to stop running it yourself)
The small-files problem isn’t subtle when you actually measure it. Compacting 1,440 small Parquet files into 48 files of ~150 MB each is a documented 9× query speedup on Athena, and most of that gain is request reduction, not better columnar pruning. AWS reported up to 3.2× query acceleration and 8.5× fewer read requests just from going from uncompacted ~1 MB files to right-sized files in their S3 Tables service.
The mechanics: object storage charges per request and engines plan queries per file. A 100 GB table in 100,000 1 MB files needs 100,000 file opens, 100,000 metadata reads, and 100,000 footer parses just to plan. The same 100 GB in 1,000 100 MB files needs 1,000 of each. The data is identical; the cost is not. And the small files keep arriving — every streaming write, every incremental ingest, every Kafka micro-batch produces them by default. Compaction isn’t a one-off cleanup, it’s a permanent operational responsibility.
The targets that work in practice:
- 128–512 MB per file for most engines. Spark and Trino are happy at 256–512 MB; BigQuery prefers 100 MB–1 GB; Snowflake reads from external Iceberg tables comfortably at 100–250 MB. The exact number is less important than landing somewhere in the 100 MB–1 GB band consistently.
- Trigger compaction by file-count thresholds, not time. A daily compaction job on a table that takes 100 small files per hour is leaving 23 hours of fragmented query performance on the table. Trigger when the small-file ratio crosses a threshold (we use “more than 30% of files under 32 MB”) and the work stays bounded.
- Don’t compact across access patterns. A common mistake is one giant compaction job per table. Better: compact recent partitions hot (they’re queried often, files arrived as a trickle from streaming), and leave older partitions alone unless they’re being queried. Most lake workloads are heavily time-skewed and a uniform compaction policy wastes compute on cold data.
Iceberg adds three maintenance operations beyond compaction that quietly burn money if unowned:
- Snapshot expiration. Iceberg keeps every commit as a snapshot for time-travel and rollback. A high-write table accumulates hundreds of snapshots a day, each holding references that prevent the underlying data files from being garbage-collected. We’ve seen single tables holding 120 TB of snapshot-referenced data with zero analytics value — pure storage waste because nobody ran
expire_snapshots. Retain 7–30 days for audit and rollback, then expire aggressively. - Orphan file removal. Failed writes and concurrent jobs leak files that are no longer referenced by any snapshot. These don’t get deleted automatically. A
remove_orphan_filesoperation reclaims them; on a busy lake, this is a real number monthly. - Manifest rewriting. Hundreds of small manifests from individual commits inflate query planning time. Rewriting them into a few large manifests can collapse 480 unnecessary I/O round trips during planning on a single query. This is a planner-cost optimization more than a storage one, but the dollars are real on high-QPS tables.
Now the part nobody talks about: for many teams in 2026, running compaction yourself is no longer the right call. Two managed offerings have changed the math.
- Amazon S3 Tables (GA at re:Invent 2024) ship fully managed Iceberg tables with built-in compaction, snapshot expiration, and orphan cleanup. The performance numbers AWS publishes (3.2× query acceleration, 8.5× fewer requests) match what we measure in customer environments. There’s a real per-table management fee, and at the largest scales the math sometimes goes the other way — Onehouse documented a “20× surprise” where unmanaged costs on S3 Tables exceeded a self-managed equivalent for a specific workload shape. But for most teams below petabyte-scale who would otherwise hire half an FTE to babysit table maintenance, S3 Tables is now cheaper than DIY.
- Databricks Predictive Optimization (GA mid-2024, default for new accounts since) uses an ML model to decide what to compact, when, and with which clustering keys. The numbers Databricks publishes (20× faster selective queries, 2× storage cost reduction) match audit results we’ve seen.
The decision: if your team can name one engineer whose job is partly “keep the lake’s table maintenance jobs healthy,” self-managed compaction is a fine choice and gives you tight control. If that engineer doesn’t exist or that role keeps slipping to whoever has bandwidth, pay the managed service. The cost of bad compaction — silently degrading query latency, ballooning request charges, occasional query failures from too many planner-tracked files — is worse than the service fee. The relevant variant of the medallion-architecture pitfalls we wrote about earlier is the silver/gold layers degrading because compaction is “scheduled but failing quietly” — a managed service fixes that failure mode by removing the option to skip it.
Lever 3: Lifecycle tiering (the lever everyone reaches for, often wrong)
Lifecycle is the most visible lever and the most commonly misused. The setup is intuitive: move objects to Infrequent Access or Glacier (S3), Cool or Archive (ADLS), Coldline or Archive (GCS) after N days. Done well, it cuts 50–80% off the storage portion of the bill on cold data. Done poorly, it costs more than it saves because retrieval charges and minimum-storage-duration penalties eat the savings.
Three patterns that hold:
- Tier by access, not by age. “Move to IA after 30 days” is a heuristic that’s right for some tables and wrong for others. A clickstream table from 2023 might still be hot because the data-science team trains models on it. A bronze landing table from last week might be cold the moment it’s promoted to silver. Use access-pattern monitoring — S3 Storage Lens, Azure Storage Insights, BigQuery audit logs — to decide which prefixes are actually cold. The 80% case is “tier by partition,” because partitions correlate with access pattern in time-series lakes.
- Don’t tier files smaller than the minimum storage charge. S3 IA charges a minimum of 128 KB per object regardless of actual size; Glacier Instant Retrieval has its own minimums. Tiering 10 KB files into IA increases the effective per-byte cost. This is exactly why compaction (Lever 2) has to come before lifecycle (Lever 3) — the order matters. Small files in cold tiers are a doubled mistake.
- Model the retrieval cost on real recall. If 5% of “cold” data gets retrieved in a typical month — a single ML retraining run can do this with one query — the retrieval and request charges for Glacier Flexible can exceed the storage savings. Glacier Instant Retrieval costs more to store but charges nothing extra to read; it’s the right tier for “cold but might be queried” data. Glacier Deep Archive is right for “we will retrieve this only under regulatory subpoena.”
The under-used lifecycle action: delete. Most lakes are retaining data nobody will ever query because no retention policy was ever written down. Deleting 30% of a lake that nobody touches saves 100% of the cost of that 30%, which beats any tier transition. We did this audit on a 4 PB lake last quarter and removed 1.3 PB of unaccessed-in-18-months data after stakeholder sign-off. The lifecycle policy that did the work was one sentence: if not accessed in 18 months and not on the legal-hold list, delete. No tiering, no compaction, just delete. Some of the highest-leverage moves in cost optimization are this banal.
The 50% cut math, and where it actually comes from
When we audit a lake and project the savings, the breakdown is almost always shaped like this:
| Lever | Typical contribution to bill cut | Where it shows on the invoice |
|---|---|---|
| Partition pruning (Lever 1) | 15–25% | Query/scan charges (Athena, BigQuery, Spectrum) |
| Compaction + Iceberg maintenance (Lever 2) | 15–25% | Request charges + compute on the query engine |
| Lifecycle tiering + deletion (Lever 3) | 10–20% | Storage-class line items |
Add them honestly and you land in the 40–60% range, which matches what teams who’ve done all three publicly report. The mistake teams make is treating the levers as independent: they’re not. Compaction enables lifecycle to actually save money on cold data. Partitioning controls how much data compaction has to work on. Lifecycle deletion shrinks what partitioning has to organize. The three reinforce each other when sequenced correctly — and undercut each other when you skip steps.
The order to run them in matters too. Don’t start with lifecycle, even though it feels easiest. Start with partitioning (the design question), then compaction (the operational discipline), then lifecycle (the optimization on what’s left). That’s the order in which earlier work amplifies later work.
How we actually run this
Three things, in order, on a new audit:
- Pull a month of query telemetry and bill detail. Bytes-scanned-per-query, file count per table, request volumes by prefix, retrieval volumes by storage class. The bill alone doesn’t tell the story; the bill plus query logs does. Skip this step and you’ll fix the wrong thing.
- Triage by Pareto. Identify the five tables driving 80% of cost. They’re almost always not the five tables anyone expected. The cost lake is power-law; the audit must follow it.
- Sequence the three levers per table, not per lake. Each table has its own access pattern, its own right partition strategy, its own compaction cadence. A blanket policy across the lake under-optimizes every table. The work is per-table; the discipline is shared.
This is the loop we run on a data lake consulting engagement, and it usually takes 4–6 weeks end-to-end on a mid-size estate. Most of the savings show up in the first month after the partitioning and compaction changes; lifecycle deletion lags by a quarter because of stakeholder sign-off. The bigger payoff isn’t the bill cut — it’s that a lake that’s been tuned along these three axes also runs faster, fails less, and stops being the line item finance asks about every quarter. Cost optimization done well stops looking like cost optimization and starts looking like the lake just works.
If a lake’s drifting toward the swamp on a different axis — zone design, governance, lifecycle ownership — that’s the companion read on the three failure modes and how they compound. Cost is only one of the ways a lake degrades. It’s the most measurable one, which is why it’s where most rescues start.
Data Engineer
Mukesh is a Data Engineer at Algoscale building the deep-plumbing pieces of enterprise data platforms across AWS and Azure — MDM ledgers, CDC pipelines, Lake Formation access controls, Fabric semantic models. Writes from the production side of the stack.