Serverless MDM: Lambda + Postgres on AWS

Enterprise MDM tools - Profisee, Tamr, Reltio, Informatica - start in the low-six figures per year and scale up from there. They earn their price when your requirement is genuinely multi-domain, regulated-audit-trail, stewardship-UI, ML-auto-tuning master data management at tens of millions of records.

Most enterprises don’t have that requirement. They have one domain (customer, or product), one acute identity-resolution problem (three CRMs with overlapping accounts), and 10-50 million records to reconcile. For that shape of problem, a serverless MDM ledger built on AWS Lambda + RDS PostgreSQL costs roughly a tenth as much, runs inside the ETL pipelines rather than alongside them, and - when built the way this post describes - returns canonical IDs in under 50ms per row at production scale.

This is the pattern we built for the same fire and security systems integrator described in the AWS data-warehouse modernization case study - 14+ subsidiary ERPs feeding a shared customer ledger, ~11M enrichment calls per day, running on infrastructure that bills around $380 per month. Here is the architecture and the three design decisions that make it work.

Why roll your own (and when not to)

The serverless MDM pattern fits when:

You have one or two domains. Customer identity is the common one. Product and vendor are close seconds. If you need five separate MDM domains with cross-references between them, commercial tools earn their keep.
You’re in the 10K - 50M master record range. Above that, commercial ML auto-tuning and distributed processing start to matter. Below 10K you barely need a ledger - a dbt model does the job.
Your team owns ETL. MDM works best when it’s an embedded call inside the pipeline, not a separate system with its own operators.
You want vendor independence. The entire stack here is AWS primitives + open-source - no Profisee license, no Informatica account, no Tamr contract.

It does not fit when:

You need a stewardship UI out of the box - Profisee and Reltio ship one; you’d build yours in Retool or Streamlit
You need vendor-supported audit trails for a regulator - rolling your own puts that ownership on your team
You’re operating at 100M+ records with sub-100ms requirements across multiple domains
You don’t have the engineering capacity to own the matching logic long-term

If your shop is the AWS/Azure kind and identity resolution is what’s blocking a consolidated data warehouse, keep reading.

Architecture, top-down

Source ERPs / CRMs  (Business Central, Dynamics 365, Salesforce, NetSuite, ...)
        │
        │  AWS DMS (CDC)
        ▼
S3 landing zone  ←─ Glue Catalog
        │
        ▼
Glue / PySpark ETL
        │
        │  HTTP invoke, batch of up to 500 rows
        ▼
AWS Lambda  (MDM match/enrich service)
        │
        ├──► ElastiCache Redis  (hot-path lookup, 24h TTL)
        │
        └──► RDS Proxy ──► RDS PostgreSQL  (the ledger)
                                │
                                ├── golden_records
                                ├── source_records
                                ├── resolutions
                                ├── survivorship_rules
                                └── audit_log
        │
        ▼
ETL writes enriched row back to Silver zone  (golden_id attached)

Two architectural decisions are doing the heavy lifting here - the Postgres ledger schema (next section) and the cache-in-front pattern (further down). Everything else is plumbing.

The Postgres ledger schema

Postgres, not DynamoDB. The MDM workload is heavy on JOINs (customer + address + phone + tax ID, all reconciled together), fuzzy string search (trigram and GIN indexes), and transactional writes when resolutions are committed. DynamoDB is great for key-value; it’s the wrong choice when 70% of your queries involve multi-column fuzzy matching.

The core tables:

-- Canonical customer. One row per real-world entity.
CREATE TABLE golden_records (
  golden_id        UUID PRIMARY KEY DEFAULT gen_random_uuid(),
  canonical_name   TEXT NOT NULL,
  canonical_name_normalized TEXT NOT NULL,  -- post 7-phase canonicalisation
  address_line1    TEXT,
  city             TEXT,
  state            TEXT,
  postal_code      TEXT,
  country_code     CHAR(2),
  tax_id_canonical TEXT,  -- EIN / VAT / equivalent
  survivorship_score INTEGER NOT NULL DEFAULT 0,
  created_at       TIMESTAMPTZ NOT NULL DEFAULT NOW(),
  updated_at       TIMESTAMPTZ NOT NULL DEFAULT NOW()
);

-- One row per occurrence of the entity in a source system.
-- Many source_records → one golden_record.
CREATE TABLE source_records (
  source_record_id    UUID PRIMARY KEY DEFAULT gen_random_uuid(),
  golden_id           UUID NOT NULL REFERENCES golden_records(golden_id),
  source_system       TEXT NOT NULL,  -- 'bc-integration', 'd365-fire', etc
  source_primary_key  TEXT NOT NULL,  -- the natural key in the source
  raw_name            TEXT NOT NULL,
  raw_address         JSONB,
  match_confidence    NUMERIC(4,3),   -- 0.000 - 1.000
  matched_on          TEXT[],         -- ['canonical_name', 'tax_id']
  first_seen_at       TIMESTAMPTZ NOT NULL DEFAULT NOW(),
  last_seen_at        TIMESTAMPTZ NOT NULL DEFAULT NOW(),
  UNIQUE (source_system, source_primary_key)
);

-- Survivorship rules per field - decides which source wins on merge.
CREATE TABLE survivorship_rules (
  field_name        TEXT PRIMARY KEY,  -- 'address_line1', 'tax_id', etc
  priority_order    TEXT[] NOT NULL    -- ['bc-integration', 'd365-fire', 'onestream']
);

-- Append-only record of every resolution decision for audit.
CREATE TABLE audit_log (
  id                BIGSERIAL PRIMARY KEY,
  source_record_id  UUID NOT NULL,
  golden_id         UUID NOT NULL,
  action            TEXT NOT NULL,   -- 'matched', 'new', 'split', 'merged'
  confidence        NUMERIC(4,3),
  phase_matched     SMALLINT,        -- which of the 7 phases triggered the match
  created_at        TIMESTAMPTZ NOT NULL DEFAULT NOW()
);

Two index decisions that make this fast:

-- Trigram index for fuzzy name search. The primary lookup.
CREATE EXTENSION IF NOT EXISTS pg_trgm;
CREATE INDEX idx_golden_canonical_name_trgm
  ON golden_records USING gin (canonical_name_normalized gin_trgm_ops);

-- Tax-ID hard match. Hits this first; falls through to trigram if miss.
CREATE UNIQUE INDEX idx_golden_tax_id
  ON golden_records (tax_id_canonical)
  WHERE tax_id_canonical IS NOT NULL;

Tax ID (or VAT/equivalent) is the one true hard identifier. If two rows share a tax ID, they’re the same company - period. About 60% of incoming source records carry one; for those the match is deterministic and the trigram index never runs. The remaining 40% - mostly sub-divisions, DBA names, or informal B2B sellers without a registered tax ID - flow through the 7-phase canonicalisation.

The 7-phase name canonicalisation

Source name columns are a mess. From one morning’s batch: ACME MANUFACTURING INC., Acme Mfg., Acme Manufacturing, Inc, ACME MFG (formerly Widgets Ltd), Acmé Manufacturing. All the same company. None match on equality.

The 7 phases run as a pipeline of pure functions. Each returns a (progressively more canonical) string. Matching is tried against the ledger after each phase. If a phase hits a match with confidence above threshold, we stop; otherwise the next phase runs.

from typing import Callable
import unicodedata, re

PHASES: list[Callable[[str], str]] = [
    phase_1_normalize_whitespace_case,
    phase_2_strip_legal_suffixes,
    phase_3_expand_abbreviations,
    phase_4_strip_dba_parenthetical,
    phase_5_unicode_nfd_accent_strip,
    phase_6_token_sort,
    phase_7_fuzzy_match,  # returns a match result, not a string
]

Phase 1 - whitespace + case

def phase_1_normalize_whitespace_case(s: str) -> str:
    return re.sub(r"\s+", " ", s).strip().lower()

Collapses runs of whitespace, strips leading/trailing, lowercases. ACME MFG becomes acme mfg.

Phase 2 - strip legal suffixes

LEGAL_SUFFIXES = {
    "inc", "inc.", "incorporated",
    "llc", "l.l.c.", "llc.",
    "ltd", "ltd.", "limited",
    "corp", "corp.", "corporation",
    "co", "co.", "company",
    "gmbh", "ag", "sa", "s.a.",
    "pvt ltd", "pvt. ltd.", "private limited",
    "sa de cv", "s.a. de c.v.",
    "plc",
}

def phase_2_strip_legal_suffixes(s: str) -> str:
    tokens = s.split(",")[0].split()  # drop anything after first comma
    while tokens and tokens[-1].lower().rstrip(".") in LEGAL_SUFFIXES:
        tokens.pop()
    return " ".join(tokens)

acme manufacturing, inc becomes acme manufacturing. Comma-split is a cheap trick - most , Inc, , LLC variants sit after a comma.

Phase 3 - expand abbreviations

ABBREV_MAP = {
    "mfg": "manufacturing",
    "mfgr": "manufacturing",
    "mfrs": "manufacturers",
    "intl": "international",
    "inds": "industries",
    "assn": "association",
    "tech": "technology",
    "&": "and",
    "n.": "north",
    "s.": "south",
}

def phase_3_expand_abbreviations(s: str) -> str:
    return " ".join(ABBREV_MAP.get(tok.rstrip("."), tok) for tok in s.split())

acme mfg becomes acme manufacturing. The abbreviation map started at ~40 entries and grew to ~200 over 6 months of production - expand as you see new ones in the audit log.

Phase 4 - strip DBA / parenthetical

PAREN_PATTERN = re.compile(r"\s*\([^)]*\)\s*")
DBA_PATTERN = re.compile(r"\s+(?:dba|d/b/a|aka|formerly)\s+.*$", re.IGNORECASE)

def phase_4_strip_dba_parenthetical(s: str) -> str:
    s = PAREN_PATTERN.sub(" ", s)
    s = DBA_PATTERN.sub("", s)
    return re.sub(r"\s+", " ", s).strip()

acme manufacturing (formerly widgets ltd) becomes acme manufacturing. Preserving the (formerly ...) value in a separate field is sometimes valuable; we do it in the source_records table’s raw JSONB, not the canonical.

Phase 5 - unicode NFD + accent strip

def phase_5_unicode_nfd_accent_strip(s: str) -> str:
    decomposed = unicodedata.normalize("NFD", s)
    return "".join(c for c in decomposed if unicodedata.category(c) != "Mn")

acmé manufacturing becomes acme manufacturing. Unicode normalisation decomposes accented characters into base+diacritic; filtering out combining-mark category (Mn) strips the diacritics while keeping the base letter.

Phase 6 - token sort

def phase_6_token_sort(s: str) -> str:
    return " ".join(sorted(s.split()))

acme manufacturing inc and manufacturing acme inc are almost certainly the same company with reordered words (a common CRM data-entry variant, especially from phone-based data capture). Sorting tokens makes them compare-equal. This is the last deterministic phase before fuzzy matching.

Phase 7 - fuzzy match

from rapidfuzz import fuzz

def phase_7_fuzzy_match(normalized: str, candidates: list[dict]) -> dict | None:
    best = None
    best_score = 0
    for cand in candidates:
        jw = fuzz.WRatio(normalized, cand["canonical_name_normalized"])
        if jw > best_score:
            best, best_score = cand, jw
    return best if best_score >= 88 else None

WRatio is rapidfuzz’s composite scorer - Jaro-Winkler plus partial-ratio plus token-sort - with a good default balance for short business-name strings. Threshold 88 (out of 100) matches ~95% of real matches with under 1% false positives in our production data; below that, the match goes to manual review queue instead of auto-resolving.

Each phase’s match is recorded in audit_log.phase_matched. Over 11M enrichment calls/day, the distribution looks like:

Phase	% of matches	Latency
Exact tax ID (pre-phase 1)	58%	3ms
Phase 1-3 (deterministic, case/suffix/abbrev)	31%	8ms
Phase 4-6 (DBA/accent/reorder)	7%	14ms
Phase 7 (fuzzy)	3%	35ms
Manual review queue	1%	async

The “easy” 89% of matches never touch fuzzy logic.

The Lambda function

Stack: Python 3.12 Lambda, 1024MB, 10s timeout. Cold-start latency matters because ETL calls this from across the pipeline - provisioned concurrency of 5 keeps 95% of invocations warm.

# handler.py (skeleton)
import json, os, asyncpg, redis, boto3
from canonicaliser import canonicalise
from matcher import match

RDS_PROXY_ENDPOINT = os.environ["RDS_PROXY_ENDPOINT"]
CACHE_ENDPOINT = os.environ["CACHE_ENDPOINT"]

pool = None
cache = None

async def get_pool():
    global pool
    if pool is None:
        pool = await asyncpg.create_pool(
            host=RDS_PROXY_ENDPOINT,
            database="mdm",
            user=os.environ["DB_USER"],
            password=os.environ["DB_PASSWORD"],
            max_size=3,
        )
    return pool

def get_cache():
    global cache
    if cache is None:
        cache = redis.Redis(host=CACHE_ENDPOINT, port=6379)
    return cache

async def handle(event, context):
    records = event["records"]  # batch of up to 500
    responses = []
    pool = await get_pool()
    cache_client = get_cache()

    for rec in records:
        canonical = canonicalise(rec["raw_name"])
        # Cache check (see next section)
        cached_id = cache_client.get(f"mdm:{canonical}")
        if cached_id:
            responses.append({
                "source_primary_key": rec["source_primary_key"],
                "golden_id": cached_id.decode(),
                "action": "cache_hit",
                "confidence": 1.0,
            })
            continue
        # Cache miss: hit the ledger
        result = await match(pool, rec, canonical)
        cache_client.setex(f"mdm:{canonical}", 86400, result["golden_id"])
        responses.append(result)

    return {"resolved": responses}

Connection pooling via RDS Proxy

Lambda’s concurrency model is hostile to direct Postgres connections - a spike to 200 concurrent invocations can exhaust even a generous max_connections. RDS Proxy sits between Lambda and RDS, multiplexing connections so each Lambda invocation borrows an existing connection from a pool. Without RDS Proxy, this pattern does not scale past a few hundred invocations per second.

Batch shape

The contract:

// Request
{
  "records": [
    {"source_system": "bc-integration", "source_primary_key": "CUST-001234",
     "raw_name": "ACME MANUFACTURING INC.", "tax_id": "12-3456789",
     "raw_address": {"line1": "100 Main St", "city": "Austin", ...}},
    ... up to 500 rows
  ]
}

// Response
{
  "resolved": [
    {"source_primary_key": "CUST-001234",
     "golden_id": "b3a8f2c0-7e1a-4d9f-8e63-1f2a3b4c5d6e",
     "action": "matched",
     "confidence": 0.940,
     "phase_matched": 2}
  ]
}

Batching to 500 keeps each Lambda invocation busy for ~3-4 seconds - enough to amortise cold-start cost and pool overhead, small enough that API Gateway doesn’t time out.

The ledger-hit caching pattern

The design decision that takes this from “works” to “production-scale” is caching.

Without caching. Every ETL row hits the Lambda, Lambda hits RDS via Proxy, Postgres runs the trigram/GIN query, returns. Typical p95: 85-110ms. At 11M calls/day spread across 16 hours, that’s a peak of ~200 calls/second - well within what RDS can handle, but the tail latency creates backpressure in the ETL orchestrator and 2-hour pipelines stretch to 4.

With caching. 80%+ of enrichment calls within any 24-hour window are repeat lookups of the same canonicalised name. ElastiCache Redis sits in front of the ledger with:

Cache key: mdm:<canonicalised_name> (i.e., the Phase 6 output)
Cache value: the golden_id UUID
TTL: 24 hours

Hit rate in production: 82%. Those 82% of calls complete in 6-12ms end-to-end (network + Redis roundtrip). The remaining 18% hit the ledger and take the 85-110ms path - but 82% at 8ms + 18% at 100ms yields a weighted average of ~25ms, p95 of ~42ms, across the full workload.

Cache invalidation without the complexity

Caches without invalidation lie. The risk: a write to the ledger updates a golden record, cache still returns the stale ID, ETL pipeline writes wrong data downstream.

The pattern we use - the only one simple enough to trust:

-- Postgres trigger on golden_records write
CREATE OR REPLACE FUNCTION notify_mdm_cache_invalidate() RETURNS trigger AS $$
BEGIN
  PERFORM pg_notify(
    'mdm_cache_invalidate',
    json_build_object(
      'golden_id', NEW.golden_id,
      'canonical_name', NEW.canonical_name_normalized
    )::text
  );
  RETURN NEW;
END;
$$ LANGUAGE plpgsql;

CREATE TRIGGER tr_golden_records_cache_invalidate
  AFTER INSERT OR UPDATE ON golden_records
  FOR EACH ROW EXECUTE FUNCTION notify_mdm_cache_invalidate();

A tiny ECS listener holds a LISTEN mdm_cache_invalidate on Postgres. On notification it deletes the matching cache key (and any aliases stored against it). Total time from ledger write to cache invalidation: ~50ms.

If you prefer not to run the listener: SNS → SQS → Lambda also works, at the cost of about 2-3 seconds of invalidation lag - still well under any ETL pipeline’s cycle time.

As a drop-in replacement for Profisee or Tamr

The Profisee / Tamr integration surface is an HTTP match API: “here is a customer record, give me the canonical ID.” Our Lambda exposes the same contract. Migrating an existing Profisee integration looks like:

Stand up the Lambda + ledger alongside Profisee (parallel, zero traffic)
Backfill the ledger from Profisee’s existing golden records (one-time Python script)
Route traffic from one source system at a time through the Lambda (A/B test)
Compare Lambda vs Profisee match decisions in a shadow log; converge the thresholds
Once parity is >99%, cut Profisee off that source; repeat for the next source

The migration took 6 months for the integrator customer. The Profisee license came up for renewal and did not get renewed.

What you trade vs the commercial tools

Honest comparison:

Capability	Serverless MDM (this post)	Profisee / Tamr / Reltio
Multi-domain (customer + product + vendor + contract)	One ledger per domain, separate deploys	Single platform
Data stewardship UI	Build in Retool / Streamlit (~1 week)	Shipped out of the box
ML auto-tuning of thresholds	You tune manually against the audit log	Vendor auto-tunes
Vendor-supported audit for regulator	You own it; vendor-signed report not available	Vendor signs audit
Entity-level 360 lineage	Custom queries on `audit_log`	Built-in dashboard
Cost at 10M records, 10M calls/day	~$400/month AWS infra	$150K-500K/year license
Time to first production load	2-3 months	6-9 months (implementation partner)
Engineering ownership	Yours	Shared with vendor

The capabilities are not equivalent. But when the missing capabilities aren’t actually requirements, the cost gap is worth a hard look.

Production numbers

From the security integrator engagement, steady state after 9 months:

~11M enrichment calls/day, peaking around 250/sec during the nightly ETL window
Ledger size: 2.3M golden customer records, 7.8M source records
Cache hit rate: 82% (24h TTL)
Latency: p50 8ms, p95 42ms, p99 78ms
Infra cost: Lambda ~$120/month + RDS db.t4g.medium ~$80 + ElastiCache cache.t4g.micro ~$15 + RDS Proxy ~$15 + misc ~$150 = ~$380/month
Manual review queue: ~1% of calls, handled daily by a single data steward in a Streamlit app

Compare that to the Profisee renewal quote it replaced and the conversation with the CFO got short.

Where to start

If you’re on AWS and evaluating Profisee / Tamr / Informatica for a single-domain customer-identity problem, run this pattern as a prototype first - 3 week build, ledger + canonicalisation phases 1-3. That gets you 89% of the matches at ~15ms latency, on a few hundred dollars of monthly infra.

If it clears your accuracy bar, you can push to phase 7 + caching later (another 3-4 weeks). If it doesn’t, you’ve spent 3 weeks on something you can keep as your staging / dev-tier MDM even after buying the vendor tool.

Either way, Algoscale’s data warehouse consulting practice ships this MDM pattern as part of every modernization engagement - the ledger and the warehouse get built together so identity resolution is baked in, not bolted on. It’s how the 180-day M&A data consolidation playbook resolves identity across acquired entities at Phase 3, and how the hybrid row-level security architecture knows which Subsidiary to filter on. When you need deeper data integration consulting on top of the warehouse - CDC, streaming, reverse ETL to operational systems - the same team owns that surface. MDM isn’t a separate purchase on the stack; it’s a component in it.

Mukesh V

Data Engineer

April 21, 2026

Mukesh is a Data Engineer at Algoscale building the deep-plumbing pieces of enterprise data platforms across AWS and Azure — MDM ledgers, CDC pipelines, Lake Formation access controls, Fabric semantic models. Writes from the production side of the stack.