Data Lake or Data Swamp: 3 Failure Modes
Most data lakes drift into swamps within 18 months. A practitioner's breakdown of three failure modes — zones, governance, lifecycle — and the fixes.
“Data swamp” has been a punchline for a decade. The frustrating thing about clichés is that they’re usually clichés because they keep being true — and the data-lake-to-swamp drift is still the single most common reason a data lake build we’re asked to rescue stopped being useful. The shape of the failure has barely changed since 2018. The tooling around it has gotten dramatically better. The failure rate has not.
That gap is the interesting part. If catalogs, table formats, and lifecycle policies have all improved this much, why does the swamp still happen? Because the failure isn’t tooling. It’s the set of small, reasonable-looking decisions teams make in months 1–3 of standing the lake up — decisions that compound into a degraded estate by month 18. Nobody plans a swamp; they ship a lake that looks fine and then watch trust in it erode, one untracked dataset and one expired retention rule at a time.
This post is the cut we use to assess a lake that’s halfway there: the three failure modes that predict the swamp, why teams walk into each one, and the patterns that pull a lake back. The framing — zones, governance, lifecycle — comes from where the rot actually starts in practice. Most posts treat them as a checklist. They’re not a checklist. They’re three control planes, and any one of them being missing is enough.
What a data swamp actually is (and isn’t)
The shorthand is wrong: a data swamp is not “a data lake with a lot of data.” Size doesn’t cause it. Petabyte-scale lakes are healthy all the time when the three control planes hold. Sub-terabyte lakes go swampy in 12 months when they don’t.
A swamp is a data lake that’s failed at three specific things at once:
- Findability — practitioners can’t reliably locate the right dataset for a question without asking someone who knows. The lake stops being a self-serve foundation and becomes a tribal-knowledge graph.
- Trust — even when the right dataset is found, downstream consumers don’t believe the values without re-validating against source. The lake adds work instead of removing it.
- Cost predictability — storage and query spend creep without a clear story about why. Cleanups get scheduled and then quietly skipped because nobody owns the impact.
The instructive bit: each of those three failures maps cleanly to one of the three control planes. Findability dies when zone design is loose. Trust dies when governance is aspirational. Cost predictability dies when lifecycle is unowned. The diagnosis is precise; the fix is precise; the discipline to apply the fix is the part that’s hard.
Failure mode 1: Zone design that’s filenames, not contracts
Every data-lake reference architecture you’ve ever seen has zones. Raw, staging, curated, sandbox, archive — or bronze, silver, gold, or whatever your vendor’s slide called them. Teams copy the diagram, create the folders, and assume zone design is done. Six months later the raw zone has a dozen partial transformations sitting in it because somebody needed “just this one cleaned slice” and the path was easier to write than the explanation. The curated zone has tables nobody’s curated since the engineer who built them left. The sandbox has become production by accident, because three dashboards now point at it.
The mistake under all of this is treating zones as locations rather than contracts. A folder name doesn’t enforce anything. What needs to be true at each zone boundary is what makes the zone worth having.
Concretely, a zone design that holds has four things written down and enforced, not implied:
- The promise the zone makes to readers. Raw is “faithful capture of the source, no transformations, immutable.” Curated is “conformed, validated, modeled to the business — safe to build on without re-checking.” Sandbox is “experimentation, no downstream consumers allowed.” If the promise isn’t written, every engineer invents their own version of it and the zone drifts.
- The promotion criteria. What checks must a dataset pass before it can move from raw to curated? Schema validation, row-count sanity, key uniqueness, referential integrity, freshness SLA — the specific set matters less than the existence of one. Without promotion criteria, “curated” is a folder name, not a quality claim.
- Write permissions per zone. Raw is written only by ingest jobs. Curated is written only by approved transformation pipelines. Sandbox is wide open and explicitly not for downstream consumption. If analysts can write to curated because it was easier than getting a pipeline approved, the contract is dead within a quarter.
- The naming and partitioning convention. Dataset names that encode source, entity, and grain (
raw_netsuite_orders_daily) tell a reader what’s inside without opening it. Partitioning by ingest date in raw, business date in curated. Boring conventions, enforced, are the difference between a lake you can navigate and one you can’t.
A subtle but high-leverage pattern in 2026: pair the zone boundary with a table format. Raw can stay as bare Parquet or JSON files — that’s appropriate, the contract is “what the source sent us.” But the curated zone should be Iceberg or Delta, not Parquet directories. The transactional format isn’t decorative; it’s the enforcement mechanism for the curated contract. Schema enforcement, ACID writes, time travel for audit, and compaction-as-a-feature all live in the table format. A curated zone on bare Parquet is a curated zone in name only — we covered the format-level differences in our Iceberg vs Delta vs Hudi breakdown.
The signal you’ve got this failure mode: ask three different engineers what’s in the curated zone and get three different answers. The fix is rarely a rebuild; it’s writing the four bullets above down, naming the existing datasets against them, and either promoting them up or pushing them back down. Painful for two weeks, worth it for three years.
Failure mode 2: Governance that’s a wiki page, not enforcement
Governance is where most lake rescues we run start. Not because governance is the most popular failure — it’s the second-most-popular — but because it’s the one that’s most often technically present and operationally absent. There’s a catalog. There’s a wiki. There’s a steward role on someone’s job description. The lake is still untrusted. Something between the artifact and the operating model didn’t get built.
The pattern of failure has a few recurring shapes:
The catalog that’s nobody’s job to update. Most lakes have a catalog by month 6 — Unity Catalog, Glue, Polaris, Atlan, OpenMetadata, something. Six months later half the entries are stale, a quarter are missing, and the rest don’t have owners. The catalog becomes the appearance of metadata rather than the substance of it. Nobody planned this. It happens because cataloging was a project, not a job — populated during onboarding and not maintained after.
The fix isn’t more catalog tooling. It’s CI gates on data products: a transformation pipeline can’t be deployed without a catalog entry, an owner, a description, and tagged classification. The catalog populates as a side effect of shipping, the same way code documentation does in healthy engineering teams. If your catalog has to be updated as a separate step, it won’t be.
Ownership that’s a role, not a rotation. “Data steward” assigned to a person who’s already 110% allocated. Tickets queue up, nothing gets resolved, the steward stops being CC’d because no one expects a response. The cure is to make ownership per-dataset (or per-domain, in a domain-oriented architecture) and publicly visible — every dataset’s lineage page shows the owning team, and unresolved issues against a dataset show on that team’s queue. Ownership without accountability decays; ownership with accountability holds.
Access control that’s coarse-grained. Every analyst in the company has read access to everything in the curated zone. This is convenient until it isn’t — until a regulator asks who’s read what PII, until the data team wants to expose finance numbers selectively, until a layoff turns into a credential audit. Fine-grained access control through the catalog/governance layer (Unity Catalog row/column masking, Lake Formation tag-based policies) is now table stakes. Teams skip it because coarse-grained works at small scale and they don’t go back to retrofit — except the retrofit is then forced by an incident, on a deadline, against a lake too big to easily reshape. We see this most acutely in regulated estates running on data governance services, where retrofitting access control under audit pressure is one of the most expensive ways to discover you should’ve done it earlier.
The shared shape: governance fails when it’s treated as a layer you add to the lake instead of a property of the lake. The catalog, the ownership map, and the access controls have to be part of the lake’s operating definition — what it means to be in the lake at all — not a parallel system maintained by hope.
Failure mode 3: Lifecycle that nobody owns
The third failure mode is the quietest and the most expensive. It looks like this: the lake’s storage cost was reasonable in year one, doubled in year two, and the CFO is now asking what changed. Nobody can say precisely. The technical answer — “we kept everything forever, the small-file count exploded, and a third of the storage is cold partitions of datasets nobody queries” — is correct but doesn’t survive contact with the budget meeting.
Lifecycle has three sub-failures, and a healthy lake handles all three:
Retention without a policy. Every dataset has a retention requirement — even “infinite” is a requirement, and worth saying explicitly so it can be defended. Most teams skip this entirely. Datasets land, partitions accumulate, and the implicit retention is “until someone notices the bill.” A real policy looks like a table: dataset → owner → retention class → archive destination → deletion trigger. Without the table, the answer to “can we drop the 2022 partitions?” is “probably, but let’s check with everybody” — which means the partitions stay.
Tiering that doesn’t happen. Object storage offers tiered pricing that genuinely cuts 40–70% off cold-data costs, but only if you move the data. Most lakes leave everything in the hot tier because the tooling to move it requires either dataset-level decisions (which means an owner and a policy) or automated lifecycle rules (which require a bucket layout that supports them — usually one not designed for it). A lake that lifts the tiering question into zone design from day one — raw archived to cold after 90 days, curated tiered by last-access date, sandbox capped at 30 days — pays for the design discipline in storage bills within a year.
File and partition hygiene that’s optional. This one is mechanical and unsexy and matters enormously. The small-file problem — tens of thousands of 4MB files in a partition that should have a hundred 100MB files — quietly destroys query performance, inflates listing costs, and gets worse every week if nobody’s compacting. Iceberg, Delta, and Hudi all ship maintenance operations (compaction, manifest cleanup, snapshot expiration); none of them run themselves by default. They have to be scheduled, monitored, and owned, the same way a database team owns vacuum and reindex jobs. Treat compaction as “the job that pays for itself” — it does, in query cost and BI latency — and it gets done. Treat it as “we’ll get to it” and you’ll be staring at a 9x query slowdown twelve months in.
The pattern under all three: lifecycle work has no champion by default. Storage costs are diffuse, performance degradation is gradual, and nobody on the team woke up wanting to spend Friday running compaction jobs. The fix is assigning lifecycle as a named responsibility — a platform engineer whose quarterly goals include retention coverage, tier ratios, and small-file counts as measured KPIs. Anything less and lifecycle becomes the work that’s always next quarter’s problem.
How to tell you’re already a swamp
Four signals, any one of which is enough to take seriously:
- Analysts ask the data team before querying anything. Self-serve is the whole point of a lake. If consumers don’t trust the dataset list, the catalog and the curated-zone contract have both failed.
- You can’t answer “what’s in this dataset?” from the catalog in under a minute. Schema, owner, last update, freshness, source, classification — if any of those require asking a person, the catalog is decorative.
- Storage growth doesn’t track business activity. Year-over-year storage is growing faster than data inflows would explain. Either retention isn’t enforced or the small-file count is exploding (often both).
- Three engineers have three answers for what’s in the curated zone. The zone contract has decayed and the lake has split into per-team mental models — exactly the silo problem the lake was supposed to dissolve.
If two or more of those are true, the lake is already swampy. If all four, the question isn’t “how do we prevent a swamp” but “how do we drain one” — and that’s a different, longer engagement.
Fixing a lake in place
A working assumption that saves a lot of pain: you almost never rebuild the lake. Rebuilds are expensive, take a year, and recreate the same failures unless the operating model changes. Almost every successful remediation we run is in-place, in three phases:
- Inventory and triage. Crawl the lake. List every dataset, its size, last write, last read (if your storage supports access logs), and assign each one to a zone and an owner — or to “orphan” if nobody claims it. Orphans get a 60-day deletion clock with email warnings. The inventory itself often surfaces 20–40% of the storage as dead, and dropping it pays for the engagement.
- Restore the contracts. Write down the zone promises, promotion criteria, and write permissions. Where the existing layout is wrong (a curated dataset that’s really a sandbox dataset), reclassify and either harden or move it. Lock the curated zone to pipeline writes only. This is where the lake starts to be navigable again.
- Re-staff the operating model. Name the catalog owner, the steward rotation, and the platform engineer who owns lifecycle. Put the KPIs on someone’s quarterly goals. Without this, phases 1 and 2 decay within a year and you’re rescuing the same lake again.
The first two phases are technical and finite. The third phase is the one most rescues skip and the one that determines whether the rescue holds. A lake’s operating model is the artifact; the lake itself is the byproduct.
The honest summary
Data lakes don’t become swamps because the technology was wrong. They become swamps because the operating model was missing — zones treated as folders, governance treated as documentation, lifecycle treated as optional. The fixes are unglamorous, finite, and well-understood. The work is in committing to operate them.
If you’re standing a lake up and want to keep the swamp out, write the three control planes down before the first pipeline lands and treat them as the lake’s definition. If you’re staring at one that’s already drifted, the two-to-three-week inventory-and-contract reset we run before any rebuild is almost always cheaper than the rebuild you were about to scope. The lake doesn’t have to be rebuilt — it has to be operated.
Data Engineer
Mukesh is a Data Engineer at Algoscale building the deep-plumbing pieces of enterprise data platforms across AWS and Azure — MDM ledgers, CDC pipelines, Lake Formation access controls, Fabric semantic models. Writes from the production side of the stack.