Algoscale

Serverless MDM: Lambda + Postgres on AWS

A production MDM pattern with Lambda + RDS PostgreSQL. Multi-ERP canonicalisation, ledger-hit caching, sub-50ms enrichment - without Profisee or Tamr.

Mukesh V

Data Engineer

Enterprise MDM tools - Profisee, Tamr, Reltio, Informatica - start in the low-six figures per year and scale up from there. They earn their price when your requirement is genuinely multi-domain, regulated-audit-trail, stewardship-UI, ML-auto-tuning master data management at tens of millions of records.

Most enterprises don’t have that requirement. They have one domain (customer, or product), one acute identity-resolution problem (three CRMs with overlapping accounts), and 10-50 million records to reconcile. For that shape of problem, a serverless MDM ledger built on AWS Lambda + RDS PostgreSQL costs roughly a tenth as much, runs inside the ETL pipelines rather than alongside them, and - when built the way this post describes - returns canonical IDs in under 50ms per row at production scale.

This is the pattern we built for the same fire and security systems integrator described in the AWS data-warehouse modernization case study - 14+ subsidiary ERPs feeding a shared customer ledger, ~11M enrichment calls per day, running on infrastructure that bills around $380 per month. Here is the architecture and the three design decisions that make it work.

Why roll your own (and when not to)

The serverless MDM pattern fits when:

  • You have one or two domains. Customer identity is the common one. Product and vendor are close seconds. If you need five separate MDM domains with cross-references between them, commercial tools earn their keep.
  • You’re in the 10K - 50M master record range. Above that, commercial ML auto-tuning and distributed processing start to matter. Below 10K you barely need a ledger - a dbt model does the job.
  • Your team owns ETL. MDM works best when it’s an embedded call inside the pipeline, not a separate system with its own operators.
  • You want vendor independence. The entire stack here is AWS primitives + open-source - no Profisee license, no Informatica account, no Tamr contract.

It does not fit when:

  • You need a stewardship UI out of the box - Profisee and Reltio ship one; you’d build yours in Retool or Streamlit
  • You need vendor-supported audit trails for a regulator - rolling your own puts that ownership on your team
  • You’re operating at 100M+ records with sub-100ms requirements across multiple domains
  • You don’t have the engineering capacity to own the matching logic long-term

If your shop is the AWS/Azure kind and identity resolution is what’s blocking a consolidated data warehouse, keep reading.

Architecture, top-down

Source ERPs / CRMs  (Business Central, Dynamics 365, Salesforce, NetSuite, ...)

        │  AWS DMS (CDC)

S3 landing zone  ←─ Glue Catalog


Glue / PySpark ETL

        │  HTTP invoke, batch of up to 500 rows

AWS Lambda  (MDM match/enrich service)

        ├──► ElastiCache Redis  (hot-path lookup, 24h TTL)

        └──► RDS Proxy ──► RDS PostgreSQL  (the ledger)

                                ├── golden_records
                                ├── source_records
                                ├── resolutions
                                ├── survivorship_rules
                                └── audit_log


ETL writes enriched row back to Silver zone  (golden_id attached)

Two architectural decisions are doing the heavy lifting here - the Postgres ledger schema (next section) and the cache-in-front pattern (further down). Everything else is plumbing.

The Postgres ledger schema

Postgres, not DynamoDB. The MDM workload is heavy on JOINs (customer + address + phone + tax ID, all reconciled together), fuzzy string search (trigram and GIN indexes), and transactional writes when resolutions are committed. DynamoDB is great for key-value; it’s the wrong choice when 70% of your queries involve multi-column fuzzy matching.

The core tables:

-- Canonical customer. One row per real-world entity.
CREATE TABLE golden_records (
  golden_id        UUID PRIMARY KEY DEFAULT gen_random_uuid(),
  canonical_name   TEXT NOT NULL,
  canonical_name_normalized TEXT NOT NULL,  -- post 7-phase canonicalisation
  address_line1    TEXT,
  city             TEXT,
  state            TEXT,
  postal_code      TEXT,
  country_code     CHAR(2),
  tax_id_canonical TEXT,  -- EIN / VAT / equivalent
  survivorship_score INTEGER NOT NULL DEFAULT 0,
  created_at       TIMESTAMPTZ NOT NULL DEFAULT NOW(),
  updated_at       TIMESTAMPTZ NOT NULL DEFAULT NOW()
);

-- One row per occurrence of the entity in a source system.
-- Many source_records → one golden_record.
CREATE TABLE source_records (
  source_record_id    UUID PRIMARY KEY DEFAULT gen_random_uuid(),
  golden_id           UUID NOT NULL REFERENCES golden_records(golden_id),
  source_system       TEXT NOT NULL,  -- 'bc-integration', 'd365-fire', etc
  source_primary_key  TEXT NOT NULL,  -- the natural key in the source
  raw_name            TEXT NOT NULL,
  raw_address         JSONB,
  match_confidence    NUMERIC(4,3),   -- 0.000 - 1.000
  matched_on          TEXT[],         -- ['canonical_name', 'tax_id']
  first_seen_at       TIMESTAMPTZ NOT NULL DEFAULT NOW(),
  last_seen_at        TIMESTAMPTZ NOT NULL DEFAULT NOW(),
  UNIQUE (source_system, source_primary_key)
);

-- Survivorship rules per field - decides which source wins on merge.
CREATE TABLE survivorship_rules (
  field_name        TEXT PRIMARY KEY,  -- 'address_line1', 'tax_id', etc
  priority_order    TEXT[] NOT NULL    -- ['bc-integration', 'd365-fire', 'onestream']
);

-- Append-only record of every resolution decision for audit.
CREATE TABLE audit_log (
  id                BIGSERIAL PRIMARY KEY,
  source_record_id  UUID NOT NULL,
  golden_id         UUID NOT NULL,
  action            TEXT NOT NULL,   -- 'matched', 'new', 'split', 'merged'
  confidence        NUMERIC(4,3),
  phase_matched     SMALLINT,        -- which of the 7 phases triggered the match
  created_at        TIMESTAMPTZ NOT NULL DEFAULT NOW()
);

Two index decisions that make this fast:

-- Trigram index for fuzzy name search. The primary lookup.
CREATE EXTENSION IF NOT EXISTS pg_trgm;
CREATE INDEX idx_golden_canonical_name_trgm
  ON golden_records USING gin (canonical_name_normalized gin_trgm_ops);

-- Tax-ID hard match. Hits this first; falls through to trigram if miss.
CREATE UNIQUE INDEX idx_golden_tax_id
  ON golden_records (tax_id_canonical)
  WHERE tax_id_canonical IS NOT NULL;

Tax ID (or VAT/equivalent) is the one true hard identifier. If two rows share a tax ID, they’re the same company - period. About 60% of incoming source records carry one; for those the match is deterministic and the trigram index never runs. The remaining 40% - mostly sub-divisions, DBA names, or informal B2B sellers without a registered tax ID - flow through the 7-phase canonicalisation.

The 7-phase name canonicalisation

Source name columns are a mess. From one morning’s batch: ACME MANUFACTURING INC., Acme Mfg., Acme Manufacturing, Inc, ACME MFG (formerly Widgets Ltd), Acmé Manufacturing. All the same company. None match on equality.

The 7 phases run as a pipeline of pure functions. Each returns a (progressively more canonical) string. Matching is tried against the ledger after each phase. If a phase hits a match with confidence above threshold, we stop; otherwise the next phase runs.

from typing import Callable
import unicodedata, re

PHASES: list[Callable[[str], str]] = [
    phase_1_normalize_whitespace_case,
    phase_2_strip_legal_suffixes,
    phase_3_expand_abbreviations,
    phase_4_strip_dba_parenthetical,
    phase_5_unicode_nfd_accent_strip,
    phase_6_token_sort,
    phase_7_fuzzy_match,  # returns a match result, not a string
]

Phase 1 - whitespace + case

def phase_1_normalize_whitespace_case(s: str) -> str:
    return re.sub(r"\s+", " ", s).strip().lower()

Collapses runs of whitespace, strips leading/trailing, lowercases. ACME MFG becomes acme mfg.

LEGAL_SUFFIXES = {
    "inc", "inc.", "incorporated",
    "llc", "l.l.c.", "llc.",
    "ltd", "ltd.", "limited",
    "corp", "corp.", "corporation",
    "co", "co.", "company",
    "gmbh", "ag", "sa", "s.a.",
    "pvt ltd", "pvt. ltd.", "private limited",
    "sa de cv", "s.a. de c.v.",
    "plc",
}

def phase_2_strip_legal_suffixes(s: str) -> str:
    tokens = s.split(",")[0].split()  # drop anything after first comma
    while tokens and tokens[-1].lower().rstrip(".") in LEGAL_SUFFIXES:
        tokens.pop()
    return " ".join(tokens)

acme manufacturing, inc becomes acme manufacturing. Comma-split is a cheap trick - most , Inc, , LLC variants sit after a comma.

Phase 3 - expand abbreviations

ABBREV_MAP = {
    "mfg": "manufacturing",
    "mfgr": "manufacturing",
    "mfrs": "manufacturers",
    "intl": "international",
    "inds": "industries",
    "assn": "association",
    "tech": "technology",
    "&": "and",
    "n.": "north",
    "s.": "south",
}

def phase_3_expand_abbreviations(s: str) -> str:
    return " ".join(ABBREV_MAP.get(tok.rstrip("."), tok) for tok in s.split())

acme mfg becomes acme manufacturing. The abbreviation map started at ~40 entries and grew to ~200 over 6 months of production - expand as you see new ones in the audit log.

Phase 4 - strip DBA / parenthetical

PAREN_PATTERN = re.compile(r"\s*\([^)]*\)\s*")
DBA_PATTERN = re.compile(r"\s+(?:dba|d/b/a|aka|formerly)\s+.*$", re.IGNORECASE)

def phase_4_strip_dba_parenthetical(s: str) -> str:
    s = PAREN_PATTERN.sub(" ", s)
    s = DBA_PATTERN.sub("", s)
    return re.sub(r"\s+", " ", s).strip()

acme manufacturing (formerly widgets ltd) becomes acme manufacturing. Preserving the (formerly ...) value in a separate field is sometimes valuable; we do it in the source_records table’s raw JSONB, not the canonical.

Phase 5 - unicode NFD + accent strip

def phase_5_unicode_nfd_accent_strip(s: str) -> str:
    decomposed = unicodedata.normalize("NFD", s)
    return "".join(c for c in decomposed if unicodedata.category(c) != "Mn")

acmé manufacturing becomes acme manufacturing. Unicode normalisation decomposes accented characters into base+diacritic; filtering out combining-mark category (Mn) strips the diacritics while keeping the base letter.

Phase 6 - token sort

def phase_6_token_sort(s: str) -> str:
    return " ".join(sorted(s.split()))

acme manufacturing inc and manufacturing acme inc are almost certainly the same company with reordered words (a common CRM data-entry variant, especially from phone-based data capture). Sorting tokens makes them compare-equal. This is the last deterministic phase before fuzzy matching.

Phase 7 - fuzzy match

from rapidfuzz import fuzz

def phase_7_fuzzy_match(normalized: str, candidates: list[dict]) -> dict | None:
    best = None
    best_score = 0
    for cand in candidates:
        jw = fuzz.WRatio(normalized, cand["canonical_name_normalized"])
        if jw > best_score:
            best, best_score = cand, jw
    return best if best_score >= 88 else None

WRatio is rapidfuzz’s composite scorer - Jaro-Winkler plus partial-ratio plus token-sort - with a good default balance for short business-name strings. Threshold 88 (out of 100) matches ~95% of real matches with under 1% false positives in our production data; below that, the match goes to manual review queue instead of auto-resolving.

Each phase’s match is recorded in audit_log.phase_matched. Over 11M enrichment calls/day, the distribution looks like:

Phase% of matchesLatency
Exact tax ID (pre-phase 1)58%3ms
Phase 1-3 (deterministic, case/suffix/abbrev)31%8ms
Phase 4-6 (DBA/accent/reorder)7%14ms
Phase 7 (fuzzy)3%35ms
Manual review queue1%async

The “easy” 89% of matches never touch fuzzy logic.

The Lambda function

Stack: Python 3.12 Lambda, 1024MB, 10s timeout. Cold-start latency matters because ETL calls this from across the pipeline - provisioned concurrency of 5 keeps 95% of invocations warm.

# handler.py (skeleton)
import json, os, asyncpg, redis, boto3
from canonicaliser import canonicalise
from matcher import match

RDS_PROXY_ENDPOINT = os.environ["RDS_PROXY_ENDPOINT"]
CACHE_ENDPOINT = os.environ["CACHE_ENDPOINT"]

pool = None
cache = None

async def get_pool():
    global pool
    if pool is None:
        pool = await asyncpg.create_pool(
            host=RDS_PROXY_ENDPOINT,
            database="mdm",
            user=os.environ["DB_USER"],
            password=os.environ["DB_PASSWORD"],
            max_size=3,
        )
    return pool

def get_cache():
    global cache
    if cache is None:
        cache = redis.Redis(host=CACHE_ENDPOINT, port=6379)
    return cache

async def handle(event, context):
    records = event["records"]  # batch of up to 500
    responses = []
    pool = await get_pool()
    cache_client = get_cache()

    for rec in records:
        canonical = canonicalise(rec["raw_name"])
        # Cache check (see next section)
        cached_id = cache_client.get(f"mdm:{canonical}")
        if cached_id:
            responses.append({
                "source_primary_key": rec["source_primary_key"],
                "golden_id": cached_id.decode(),
                "action": "cache_hit",
                "confidence": 1.0,
            })
            continue
        # Cache miss: hit the ledger
        result = await match(pool, rec, canonical)
        cache_client.setex(f"mdm:{canonical}", 86400, result["golden_id"])
        responses.append(result)

    return {"resolved": responses}

Connection pooling via RDS Proxy

Lambda’s concurrency model is hostile to direct Postgres connections - a spike to 200 concurrent invocations can exhaust even a generous max_connections. RDS Proxy sits between Lambda and RDS, multiplexing connections so each Lambda invocation borrows an existing connection from a pool. Without RDS Proxy, this pattern does not scale past a few hundred invocations per second.

Batch shape

The contract:

// Request
{
  "records": [
    {"source_system": "bc-integration", "source_primary_key": "CUST-001234",
     "raw_name": "ACME MANUFACTURING INC.", "tax_id": "12-3456789",
     "raw_address": {"line1": "100 Main St", "city": "Austin", ...}},
    ... up to 500 rows
  ]
}

// Response
{
  "resolved": [
    {"source_primary_key": "CUST-001234",
     "golden_id": "b3a8f2c0-7e1a-4d9f-8e63-1f2a3b4c5d6e",
     "action": "matched",
     "confidence": 0.940,
     "phase_matched": 2}
  ]
}

Batching to 500 keeps each Lambda invocation busy for ~3-4 seconds - enough to amortise cold-start cost and pool overhead, small enough that API Gateway doesn’t time out.

The ledger-hit caching pattern

The design decision that takes this from “works” to “production-scale” is caching.

Without caching. Every ETL row hits the Lambda, Lambda hits RDS via Proxy, Postgres runs the trigram/GIN query, returns. Typical p95: 85-110ms. At 11M calls/day spread across 16 hours, that’s a peak of ~200 calls/second - well within what RDS can handle, but the tail latency creates backpressure in the ETL orchestrator and 2-hour pipelines stretch to 4.

With caching. 80%+ of enrichment calls within any 24-hour window are repeat lookups of the same canonicalised name. ElastiCache Redis sits in front of the ledger with:

  • Cache key: mdm:<canonicalised_name> (i.e., the Phase 6 output)
  • Cache value: the golden_id UUID
  • TTL: 24 hours

Hit rate in production: 82%. Those 82% of calls complete in 6-12ms end-to-end (network + Redis roundtrip). The remaining 18% hit the ledger and take the 85-110ms path - but 82% at 8ms + 18% at 100ms yields a weighted average of ~25ms, p95 of ~42ms, across the full workload.

Cache invalidation without the complexity

Caches without invalidation lie. The risk: a write to the ledger updates a golden record, cache still returns the stale ID, ETL pipeline writes wrong data downstream.

The pattern we use - the only one simple enough to trust:

-- Postgres trigger on golden_records write
CREATE OR REPLACE FUNCTION notify_mdm_cache_invalidate() RETURNS trigger AS $$
BEGIN
  PERFORM pg_notify(
    'mdm_cache_invalidate',
    json_build_object(
      'golden_id', NEW.golden_id,
      'canonical_name', NEW.canonical_name_normalized
    )::text
  );
  RETURN NEW;
END;
$$ LANGUAGE plpgsql;

CREATE TRIGGER tr_golden_records_cache_invalidate
  AFTER INSERT OR UPDATE ON golden_records
  FOR EACH ROW EXECUTE FUNCTION notify_mdm_cache_invalidate();

A tiny ECS listener holds a LISTEN mdm_cache_invalidate on Postgres. On notification it deletes the matching cache key (and any aliases stored against it). Total time from ledger write to cache invalidation: ~50ms.

If you prefer not to run the listener: SNS → SQS → Lambda also works, at the cost of about 2-3 seconds of invalidation lag - still well under any ETL pipeline’s cycle time.

As a drop-in replacement for Profisee or Tamr

The Profisee / Tamr integration surface is an HTTP match API: “here is a customer record, give me the canonical ID.” Our Lambda exposes the same contract. Migrating an existing Profisee integration looks like:

  1. Stand up the Lambda + ledger alongside Profisee (parallel, zero traffic)
  2. Backfill the ledger from Profisee’s existing golden records (one-time Python script)
  3. Route traffic from one source system at a time through the Lambda (A/B test)
  4. Compare Lambda vs Profisee match decisions in a shadow log; converge the thresholds
  5. Once parity is >99%, cut Profisee off that source; repeat for the next source

The migration took 6 months for the integrator customer. The Profisee license came up for renewal and did not get renewed.

What you trade vs the commercial tools

Honest comparison:

CapabilityServerless MDM (this post)Profisee / Tamr / Reltio
Multi-domain (customer + product + vendor + contract)One ledger per domain, separate deploysSingle platform
Data stewardship UIBuild in Retool / Streamlit (~1 week)Shipped out of the box
ML auto-tuning of thresholdsYou tune manually against the audit logVendor auto-tunes
Vendor-supported audit for regulatorYou own it; vendor-signed report not availableVendor signs audit
Entity-level 360 lineageCustom queries on audit_logBuilt-in dashboard
Cost at 10M records, 10M calls/day~$400/month AWS infra$150K-500K/year license
Time to first production load2-3 months6-9 months (implementation partner)
Engineering ownershipYoursShared with vendor

The capabilities are not equivalent. But when the missing capabilities aren’t actually requirements, the cost gap is worth a hard look.

Production numbers

From the security integrator engagement, steady state after 9 months:

  • ~11M enrichment calls/day, peaking around 250/sec during the nightly ETL window
  • Ledger size: 2.3M golden customer records, 7.8M source records
  • Cache hit rate: 82% (24h TTL)
  • Latency: p50 8ms, p95 42ms, p99 78ms
  • Infra cost: Lambda ~$120/month + RDS db.t4g.medium ~$80 + ElastiCache cache.t4g.micro ~$15 + RDS Proxy ~$15 + misc ~$150 = ~$380/month
  • Manual review queue: ~1% of calls, handled daily by a single data steward in a Streamlit app

Compare that to the Profisee renewal quote it replaced and the conversation with the CFO got short.

Where to start

If you’re on AWS and evaluating Profisee / Tamr / Informatica for a single-domain customer-identity problem, run this pattern as a prototype first - 3 week build, ledger + canonicalisation phases 1-3. That gets you 89% of the matches at ~15ms latency, on a few hundred dollars of monthly infra.

If it clears your accuracy bar, you can push to phase 7 + caching later (another 3-4 weeks). If it doesn’t, you’ve spent 3 weeks on something you can keep as your staging / dev-tier MDM even after buying the vendor tool.

Either way, Algoscale’s data warehouse consulting practice ships this MDM pattern as part of every modernization engagement - the ledger and the warehouse get built together so identity resolution is baked in, not bolted on. It’s how the 180-day M&A data consolidation playbook resolves identity across acquired entities at Phase 3, and how the hybrid row-level security architecture knows which Subsidiary to filter on. When you need deeper data integration consulting on top of the warehouse - CDC, streaming, reverse ETL to operational systems - the same team owns that surface. MDM isn’t a separate purchase on the stack; it’s a component in it.

Mukesh V

Data Engineer

Mukesh is a Data Engineer at Algoscale building the deep-plumbing pieces of enterprise data platforms across AWS and Azure — MDM ledgers, CDC pipelines, Lake Formation access controls, Fabric semantic models. Writes from the production side of the stack.

Related reading

More on this topic

Pick your starting point

Two quick diagnostics for the two questions we get most

No sales calls required to get real answers. Both tools return dedicated output in under 5 minutes.