Algoscale

Iceberg vs Delta vs Hudi in 2026

After years of open table format wars, the 2026 picture is clear: Iceberg has won, but the catalog choice is now where vendor lock-in lives.

Mukesh V

Data Engineer

The “Iceberg vs Delta vs Hudi” question is the one we get most often when scoping a new data lake engagement — usually from a platform team that read three vendor blog posts in a row and ended up more confused than when they started.

Most of those posts were written between 2022 and 2024, when the answer genuinely depended on workload shape. That window is closing. As of 2026, the format choice itself is the least interesting part of the decision. The interesting part — the part that actually decides whether you can move workloads between engines five years from now — is the catalog you put in front of it.

This post is the cut we use on real engagements. It covers what changed in 2025-2026, where each format still legitimately wins, and the catalog question that most comparison posts skip entirely.

What 2026 actually looks like

Three concrete things shifted between mid-2024 and now, and they reset the entire conversation:

1. Iceberg is the format every major platform reads and writes. Snowflake writes Iceberg natively. Databricks added first-class Iceberg support in Runtime 16.4 and made Iceberg v3 public preview earlier this year. AWS Athena, Glue, EMR, Trino, Dremio, StarRocks, DuckDB, and Flink all speak it. BigQuery added external Iceberg tables as a managed surface. There is no major engine in 2026 that doesn’t.

2. Snowflake’s Polaris catalog donation closed the catalog gap. Polaris was donated to the Apache Foundation, now supports Delta Lake as well as Iceberg, and implements the Iceberg REST Catalog spec. The same spec is implemented by Unity Catalog (Databricks), AWS Glue, Nessie, and Tabular. The catalog API is no longer vendor-specific.

3. Cross-format reads are real, not aspirational. Delta UniForm exposes a Delta table to an Iceberg client without rewriting the data. Apache XTable (formerly OneTable) translates metadata between Iceberg, Delta, and Hudi. The “pick one and live with it” framing from 2023 is genuinely obsolete in workloads where two engines need to read the same physical files.

The headline: format choice has become a soft decision, catalog choice has become a hard one. That’s the inversion of what most comparison articles still tell you.

Where each format still legitimately wins

The “Iceberg has won” framing is true at the platform-support layer but misleading at the workload layer. Each format still has a shape it does materially better than the other two.

Iceberg wins on engine portability and schema evolution

Iceberg’s specification is engine-independent in a way Delta’s never quite was and Hudi’s still isn’t. Hidden partitioning, partition evolution without rewriting data, and the manifest-list architecture that lets you skip whole branches of metadata at query time are the architectural advantages that put it on top of multi-engine analytics workloads.

If your estate has — or will have — more than one query engine reading the same tables, Iceberg is the only honest choice. We see this most often on:

  • Multi-cloud or hybrid estates where Snowflake serves BI and Databricks serves ML against the same data
  • Migration scenarios where you’re moving off a warehouse but want to keep optionality on the next platform
  • Federated data architectures where domain teams pick their own engines

Iceberg’s weakness is still write performance under high-frequency mutations. The default planning step is heavier than Delta’s, and committing thousands of small writes per minute punishes the metadata layer. This is what the v3 spec and Puffin statistics partially address, but it’s still real.

Delta still wins inside Databricks (and only inside Databricks)

Delta Lake is the format Databricks ships, optimizes, and tunes against. If your workload is wholly inside Databricks — Spark transforms, Photon SQL, Mosaic AI on the same tables — Delta with Liquid Clustering, predictive optimization, and deletion vectors is genuinely faster than running Iceberg on the same platform. Databricks publishes the benchmarks; we see roughly 15-30% better throughput on mixed read/write workloads in our own engagements.

The catch: that performance edge evaporates the moment a non-Databricks engine joins the picture. Reading Delta from Trino, Dremio, or Snowflake works through compatibility layers (UniForm, manifest sidecars) that are functional but second-class. If you can credibly commit to Databricks-only forever, Delta is fine. If “forever” feels strong, Iceberg is the safer call.

Hudi still wins on CDC-heavy mutable workloads

Hudi’s record-level index, copy-on-write/merge-on-read tradeoff, and the DeltaStreamer ingestion path are not matched by either Iceberg or Delta for one specific shape: high-frequency upserts and deletes against record-level keys with tight freshness budgets. Real-time personalization stores, GDPR-driven deletion pipelines, IoT device-state tables — Hudi is materially better at this, and the gap hasn’t closed.

Hudi 1.0 added native Iceberg-format output, which is interesting because it means you can run Hudi’s write path against tables that are read as Iceberg downstream. That’s the cleanest path for teams who want the streaming write characteristics without the multi-engine penalty.

The honest 2026 read on Hudi: it’s the right call when CDC is the primary workload and both other formats would force compromises. It’s the wrong call when CDC is one of several workloads and the rest are query-heavy.

The catalog is where lock-in actually lives now

This is the part most format-comparison posts skip, and it’s the part that matters in 2026.

A table format is a file layout specification. It tells engines how to read and write data files and metadata. It does not, by itself, tell you who owns the table, who can query it, what schema version is current, or how to discover it. That’s the catalog.

Five years ago, the catalog was usually Hive Metastore (or AWS Glue, which is Hive-compatible). It was a thin layer and nobody worried about it. In 2026 the catalog has become the substantive integration surface. The choices look like this:

  • Unity Catalog (Databricks) — three-level namespace, attribute-based access, fine-grained governance, automatic lineage. Most capable on paper. Federation to external catalogs is now mature. Read access from non-Databricks engines is supported via the open Iceberg REST endpoint, but Databricks remains the primary write surface.
  • Apache Polaris — open-source Iceberg REST catalog originally donated by Snowflake, now an Apache project. Vendor-neutral by design. Lighter governance feature set than Unity, but the interop story is the cleanest.
  • AWS Glue Data Catalog — Iceberg REST adapter ships with Glue. Tightly integrated with IAM and Lake Formation. Best fit for AWS-anchored estates that don’t want a separate catalog service to operate.
  • Nessie (Project Nessie) — Git-style branching and versioning at the catalog level. Niche but compelling for teams doing data engineering experiments where you want PR-style review on data changes.
  • Snowflake Open Catalog — managed Polaris service from Snowflake. Same API surface as Apache Polaris, with managed operations.

The lock-in question that used to be “what format am I writing?” is now “whose catalog am I writing through?” Catalogs implementing the open Iceberg REST spec are mostly portable. Catalogs that don’t are the new lock-in.

We’ve started flagging this explicitly in data architecture engagements: if a vendor’s catalog is the path of least resistance for a given platform but the catalog itself is closed, that’s a strategic risk worth at least documenting before signing.

The decision matrix that actually matters in 2026

Here’s the cut we use with clients:

SituationFormatCatalog
Multi-engine, multi-cloud, want maximum portabilityIcebergPolaris or Glue REST
Databricks-committed, no realistic exitDeltaUnity Catalog
CDC-heavy mutable workload, single-engine readsHudiGlue or Nessie
Snowflake-anchored, want lake optionalityIcebergSnowflake Open Catalog (Polaris)
AWS-native, broad engine mix (Athena/EMR/Trino)IcebergGlue Data Catalog
Mixed Databricks + Snowflake readsIceberg via UniFormUnity, federated
Greenfield, no platform commitment yetIcebergPolaris (defer commitment)

The pattern: Iceberg is the default for almost everything except where there’s a specific architectural reason not to. Delta when you’re inside Databricks and staying. Hudi when CDC is the dominant workload shape.

What about XTable and UniForm — do they make this moot?

Partially. The interop layer is real and works, but it’s not a free lunch.

Delta UniForm publishes Iceberg-compatible metadata alongside Delta files. Reading is functional. Writing through both formats simultaneously is not, and the metadata translation has a refresh latency that disqualifies it for real-time consumers. Good for “Databricks writes, Snowflake reads occasionally.” Not good for symmetric multi-engine writes.

Apache XTable translates metadata in either direction across all three formats. The translation runs as a separate process and adds operational complexity. For teams already running Spark jobs on a schedule, it slots in cleanly. For teams who want zero-touch interop, it’s a yes-but.

The honest framing: interop layers buy you out of bad format choices made earlier. They’re not a substitute for picking the right format up front.

What this means for a 2026 data lake build

The pattern we’ve converged on for new enterprise data lake builds in 2026 looks like this:

  • Iceberg as the default table format, regardless of which engine is “primary.” The portability premium pays for itself the first time leadership wants to evaluate a different platform.
  • Polaris or Glue Data Catalog as the catalog, not Unity Catalog or Snowflake’s proprietary catalog — even when Databricks or Snowflake is the main consumer. The catalog is the interop surface; keeping it open keeps optionality real.
  • Hudi as a targeted choice for specific CDC pipelines, materializing into Iceberg downstream. Use Hudi for what it’s uniquely good at; don’t force the rest of the estate to live with its quirks.
  • UniForm or XTable used sparingly, only where dual-engine reads against the same physical data are non-negotiable.

This is the pattern behind the S.C.A.L.E. data foundation we deploy on enterprise engagements: the storage layer stays open, the catalog stays open, and the engine choice becomes a workload-level decision rather than a platform-level commitment.

Where the format conversation goes next

The next 18 months are about three things, in roughly this order:

  1. Catalog feature parity. Polaris and Glue need to close the governance gap with Unity Catalog. Lineage, RBAC, policy enforcement — these are still uneven across catalogs and they’re what enterprise buyers actually care about. Expect this gap to narrow fast.
  2. Iceberg v3 stabilization. Row lineage, deletion vectors, and VARIANT type are public preview today. They’ll go GA across engines through 2026. Once they do, the small-write performance gap with Delta largely closes.
  3. Vendor positioning around the catalog. Watch for Databricks and Snowflake to add features that work materially better with their own catalog than with Polaris or Glue. The format itself is open; the optimizations on top of it are where commercial differentiation will live.

The good news for buyers: the open-format era is genuinely here. The work is no longer choosing between formats — it’s choosing the catalog that keeps the format open.

If you’re picking a format for a new lake or trying to figure out whether your existing one is in the wrong place, the data architecture assessment we run typically takes 2-3 weeks and ends with a written recommendation that names the format, the catalog, and the migration path if one is needed. Worth doing before the platform RFP, not after.

Mukesh V

Data Engineer

Mukesh is a Data Engineer at Algoscale building the deep-plumbing pieces of enterprise data platforms across AWS and Azure — MDM ledgers, CDC pipelines, Lake Formation access controls, Fabric semantic models. Writes from the production side of the stack.

Related reading

More on this topic

Pick your starting point

Two quick diagnostics for the two questions we get most

No sales calls required to get real answers. Both tools return dedicated output in under 5 minutes.