Algoscale

Watermark Bugs in Fabric Incremental Loads

A watermark incremental load in Microsoft Fabric silently duplicated 3 months of Gold-layer data. The fix: idempotent MERGE plus a row-count assertion.

Mukesh V

Data Engineer

The pipeline ran perfectly for months. Then we found three months of duplicate records sitting silently in the Gold layer. No alerts. No errors. Just wrong numbers that nobody questioned because they were close enough to reasonable.

That last part is the dangerous one. A pipeline that crashes gets fixed in an hour. A pipeline that quietly inflates a KPI by a few percent can run for a quarter before someone notices the board deck doesn’t reconcile with finance — and by then you’re not debugging code, you’re explaining to leadership why the number they’ve been steering on was wrong. This is the postmortem on an embarrassingly small bug in a Microsoft Fabric pipeline, and the two-line change in thinking that makes the entire class of bug impossible.

The bug: a watermark and an inclusive boundary

The load was a textbook watermark-based incremental, the pattern every data engineering team reaches for first. Store the high-water mark from the last run, pull everything newer next time:

SELECT *
FROM source
WHERE modified_date > @last_watermark

Looks correct. Passed review. Worked perfectly in dev, where the test data had nicely spaced timestamps. Then it went to production, where real systems write dozens of records that share the exact same modified_date down to the second.

Here’s the failure, concretely. Run N pulls everything up to its high-water mark and stores @last_watermark = '2026-02-01 03:00:00'. But several records were modified at exactly 03:00:00. Depending on how the watermark is captured and compared — inclusive boundaries, a stored precision that’s coarser than the source’s, or a watermark set to the job’s start time while rows land mid-run — those exact-boundary records get selected by run N and again by run N+1. The boundary row appears twice.

And because the load was a plain SELECT appended into the Gold Delta table, two selections meant two rows. Append writes are not idempotent. Run the same boundary record twice, get two copies. Do that on every run for three months and you’ve quietly doubled a sliver of every day’s data.

Why “just fix the operator” is the wrong fix

The tempting fix is to fight the boundary: switch > to >= (no — that re-pulls the boundary every single run, making it worse), or add millisecond precision, or store the watermark as max(modified_date) - 1 second with overlap, or track exact IDs already seen. People build elaborate watermark machinery to get exactly-once selection.

It’s a losing game. Watermark logic looks simple and is deceptively fragile, because it has to survive every one of these in production:

  • Equal timestamps at the boundary (the bug above).
  • Clock/precision mismatches — source writes microseconds, your watermark column stores seconds.
  • Late-arriving data — a record with an old modified_date committed after the watermark already moved past it.
  • Timezone and DST drift between the source clock and the pipeline clock.
  • Restarts and reruns — a backfill that re-reads a window you already loaded.

You cannot make a SELECT-and-append robust against all of those. Every fix adds a special case, and the special cases interact. The operator genuinely matters — but chasing the perfect operator is treating the symptom.

The real fix is to stop caring whether a row gets selected once or twice, by making the write idempotent.

Fix part 1: MERGE makes the write idempotent

We replaced the append with a MERGE INTO against the Gold Delta table, keyed on the business ID:

MERGE INTO gold.sales AS g
USING staged_batch AS s
ON g.sales_id = s.sales_id
WHEN MATCHED THEN UPDATE SET *
WHEN NOT MATCHED THEN INSERT *

Now whether a boundary record arrives once or twice — or ten times across reruns — the result is identical: exactly one row per business key, holding the latest version. The pipeline became idempotent. Re-running a window is safe. A double-selected boundary row updates itself to the same value instead of spawning a duplicate. The watermark can be sloppy and the data stays correct, which is exactly the property you want, because the watermark will be sloppy eventually.

One real caveat worth stating, because it bit us in testing: MERGE errors if the source batch contains more than one row per key. If your extract itself has duplicates within a single run (very common with CDC feeds), you have to dedup the staging set first — typically a ROW_NUMBER() OVER (PARTITION BY sales_id ORDER BY modified_date DESC) and keep rank 1 — before the merge. MERGE guarantees exactly-once into the target; it assumes you’ve already collapsed the batch to one row per key.

Fix part 2: a row-count assertion that fails loud

Idempotent writes stop new duplicates. They don’t tell you the day the assumption breaks anyway. So we added a post-load assertion in the Fabric notebook — a hard gate that compares what the window should have produced against what landed:

src = spark.sql("SELECT COUNT(*) FROM staged_batch").collect()[0][0]
gold = spark.sql("""
    SELECT COUNT(*) FROM gold.sales
    WHERE modified_date >= '{window_start}'
      AND modified_date <  '{window_end}'
""").collect()[0][0]

assert src == gold, f"Row-count mismatch: source={src}, gold={gold}"

If the counts don’t reconcile for the window, the pipeline fails the run instead of writing silently. The whole reason the original bug survived three months is that nothing ever said no. An assertion turns a silent contamination into a red pipeline and a notification at 6am — which is annoying exactly once and then saves you a quarter.

A note on rigor: a raw COUNT(*) equality is the coarse version and it works when staging is already deduped per the merge requirement above. For higher-stakes tables we go further — reconcile a SUM() of the money column, or compare distinct-key counts — because two offsetting errors can leave the row count right while the values are wrong. Pick the invariant that actually protects the KPI, not just the one that’s easy to write.

The general lesson

Two ideas, both larger than this one bug:

Prefer idempotency over clever incremental logic. The fragile part of almost every incremental pipeline is the attempt to compute exactly the new rows. The moment your write is idempotent — MERGE on a business key, or a deterministic overwrite of a partition — that whole problem evaporates. You can afford to re-read an overlapping window on purpose, which also makes late-arriving data and reruns non-events. Idempotency converts a correctness problem into a performance footnote.

Assert your invariants, loudly. Every pipeline has things that must be true — row counts reconcile, totals tie out, no nulls in the join key, no key appears twice. Encode them as assertions that fail the run. “Close enough to reasonable” is not a data quality strategy; it’s the exact condition under which silent bugs live for months. A pipeline should be unable to publish numbers it can’t defend.

This is the kind of failure we find constantly when we audit a Microsoft Fabric implementation — not dramatic outages, just a watermark and an append that looked right in review and quietly drifted in production. The pattern of the fix is always the same: make the write idempotent, then make the pipeline prove its own output before it trusts it.

Have you hit subtle data quality bugs in incremental loads? The watermark is usually innocent. The operator, and the append behind it, usually aren’t.

Mukesh V

Data Engineer

Mukesh is a Data Engineer at Algoscale building the deep-plumbing pieces of enterprise data platforms across AWS and Azure — MDM ledgers, CDC pipelines, Lake Formation access controls, Fabric semantic models. Writes from the production side of the stack.

Related reading

More on this topic

Pick your starting point

Two quick diagnostics for the two questions we get most

No sales calls required to get real answers. Both tools return dedicated output in under 5 minutes.