Serverless MDM: Lambda + Postgres on AWS
A production MDM pattern with Lambda + RDS PostgreSQL. Multi-ERP canonicalisation, ledger-hit caching, sub-50ms enrichment - without Profisee or Tamr.
Enterprise MDM tools - Profisee, Tamr, Reltio, Informatica - start in the low-six figures per year and scale up from there. They earn their price when your requirement is genuinely multi-domain, regulated-audit-trail, stewardship-UI, ML-auto-tuning master data management at tens of millions of records.
Most enterprises don’t have that requirement. They have one domain (customer, or product), one acute identity-resolution problem (three CRMs with overlapping accounts), and 10-50 million records to reconcile. For that shape of problem, a serverless MDM ledger built on AWS Lambda + RDS PostgreSQL costs roughly a tenth as much, runs inside the ETL pipelines rather than alongside them, and - when built the way this post describes - returns canonical IDs in under 50ms per row at production scale.
This is the pattern we built for the same fire and security systems integrator described in the AWS data-warehouse modernization case study - 14+ subsidiary ERPs feeding a shared customer ledger, ~11M enrichment calls per day, running on infrastructure that bills around $380 per month. Here is the architecture and the three design decisions that make it work.
Why roll your own (and when not to)
The serverless MDM pattern fits when:
- You have one or two domains. Customer identity is the common one. Product and vendor are close seconds. If you need five separate MDM domains with cross-references between them, commercial tools earn their keep.
- You’re in the 10K - 50M master record range. Above that, commercial ML auto-tuning and distributed processing start to matter. Below 10K you barely need a ledger - a dbt model does the job.
- Your team owns ETL. MDM works best when it’s an embedded call inside the pipeline, not a separate system with its own operators.
- You want vendor independence. The entire stack here is AWS primitives + open-source - no Profisee license, no Informatica account, no Tamr contract.
It does not fit when:
- You need a stewardship UI out of the box - Profisee and Reltio ship one; you’d build yours in Retool or Streamlit
- You need vendor-supported audit trails for a regulator - rolling your own puts that ownership on your team
- You’re operating at 100M+ records with sub-100ms requirements across multiple domains
- You don’t have the engineering capacity to own the matching logic long-term
If your shop is the AWS/Azure kind and identity resolution is what’s blocking a consolidated data warehouse, keep reading.
Architecture, top-down
Source ERPs / CRMs (Business Central, Dynamics 365, Salesforce, NetSuite, ...)
│
│ AWS DMS (CDC)
▼
S3 landing zone ←─ Glue Catalog
│
▼
Glue / PySpark ETL
│
│ HTTP invoke, batch of up to 500 rows
▼
AWS Lambda (MDM match/enrich service)
│
├──► ElastiCache Redis (hot-path lookup, 24h TTL)
│
└──► RDS Proxy ──► RDS PostgreSQL (the ledger)
│
├── golden_records
├── source_records
├── resolutions
├── survivorship_rules
└── audit_log
│
▼
ETL writes enriched row back to Silver zone (golden_id attached)
Two architectural decisions are doing the heavy lifting here - the Postgres ledger schema (next section) and the cache-in-front pattern (further down). Everything else is plumbing.
The Postgres ledger schema
Postgres, not DynamoDB. The MDM workload is heavy on JOINs (customer + address + phone + tax ID, all reconciled together), fuzzy string search (trigram and GIN indexes), and transactional writes when resolutions are committed. DynamoDB is great for key-value; it’s the wrong choice when 70% of your queries involve multi-column fuzzy matching.
The core tables:
-- Canonical customer. One row per real-world entity.
CREATE TABLE golden_records (
golden_id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
canonical_name TEXT NOT NULL,
canonical_name_normalized TEXT NOT NULL, -- post 7-phase canonicalisation
address_line1 TEXT,
city TEXT,
state TEXT,
postal_code TEXT,
country_code CHAR(2),
tax_id_canonical TEXT, -- EIN / VAT / equivalent
survivorship_score INTEGER NOT NULL DEFAULT 0,
created_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
updated_at TIMESTAMPTZ NOT NULL DEFAULT NOW()
);
-- One row per occurrence of the entity in a source system.
-- Many source_records → one golden_record.
CREATE TABLE source_records (
source_record_id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
golden_id UUID NOT NULL REFERENCES golden_records(golden_id),
source_system TEXT NOT NULL, -- 'bc-integration', 'd365-fire', etc
source_primary_key TEXT NOT NULL, -- the natural key in the source
raw_name TEXT NOT NULL,
raw_address JSONB,
match_confidence NUMERIC(4,3), -- 0.000 - 1.000
matched_on TEXT[], -- ['canonical_name', 'tax_id']
first_seen_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
last_seen_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
UNIQUE (source_system, source_primary_key)
);
-- Survivorship rules per field - decides which source wins on merge.
CREATE TABLE survivorship_rules (
field_name TEXT PRIMARY KEY, -- 'address_line1', 'tax_id', etc
priority_order TEXT[] NOT NULL -- ['bc-integration', 'd365-fire', 'onestream']
);
-- Append-only record of every resolution decision for audit.
CREATE TABLE audit_log (
id BIGSERIAL PRIMARY KEY,
source_record_id UUID NOT NULL,
golden_id UUID NOT NULL,
action TEXT NOT NULL, -- 'matched', 'new', 'split', 'merged'
confidence NUMERIC(4,3),
phase_matched SMALLINT, -- which of the 7 phases triggered the match
created_at TIMESTAMPTZ NOT NULL DEFAULT NOW()
);
Two index decisions that make this fast:
-- Trigram index for fuzzy name search. The primary lookup.
CREATE EXTENSION IF NOT EXISTS pg_trgm;
CREATE INDEX idx_golden_canonical_name_trgm
ON golden_records USING gin (canonical_name_normalized gin_trgm_ops);
-- Tax-ID hard match. Hits this first; falls through to trigram if miss.
CREATE UNIQUE INDEX idx_golden_tax_id
ON golden_records (tax_id_canonical)
WHERE tax_id_canonical IS NOT NULL;
Tax ID (or VAT/equivalent) is the one true hard identifier. If two rows share a tax ID, they’re the same company - period. About 60% of incoming source records carry one; for those the match is deterministic and the trigram index never runs. The remaining 40% - mostly sub-divisions, DBA names, or informal B2B sellers without a registered tax ID - flow through the 7-phase canonicalisation.
The 7-phase name canonicalisation
Source name columns are a mess. From one morning’s batch: ACME MANUFACTURING INC., Acme Mfg., Acme Manufacturing, Inc, ACME MFG (formerly Widgets Ltd), Acmé Manufacturing. All the same company. None match on equality.
The 7 phases run as a pipeline of pure functions. Each returns a (progressively more canonical) string. Matching is tried against the ledger after each phase. If a phase hits a match with confidence above threshold, we stop; otherwise the next phase runs.
from typing import Callable
import unicodedata, re
PHASES: list[Callable[[str], str]] = [
phase_1_normalize_whitespace_case,
phase_2_strip_legal_suffixes,
phase_3_expand_abbreviations,
phase_4_strip_dba_parenthetical,
phase_5_unicode_nfd_accent_strip,
phase_6_token_sort,
phase_7_fuzzy_match, # returns a match result, not a string
]
Phase 1 - whitespace + case
def phase_1_normalize_whitespace_case(s: str) -> str:
return re.sub(r"\s+", " ", s).strip().lower()
Collapses runs of whitespace, strips leading/trailing, lowercases. ACME MFG becomes acme mfg.
Phase 2 - strip legal suffixes
LEGAL_SUFFIXES = {
"inc", "inc.", "incorporated",
"llc", "l.l.c.", "llc.",
"ltd", "ltd.", "limited",
"corp", "corp.", "corporation",
"co", "co.", "company",
"gmbh", "ag", "sa", "s.a.",
"pvt ltd", "pvt. ltd.", "private limited",
"sa de cv", "s.a. de c.v.",
"plc",
}
def phase_2_strip_legal_suffixes(s: str) -> str:
tokens = s.split(",")[0].split() # drop anything after first comma
while tokens and tokens[-1].lower().rstrip(".") in LEGAL_SUFFIXES:
tokens.pop()
return " ".join(tokens)
acme manufacturing, inc becomes acme manufacturing. Comma-split is a cheap trick - most , Inc, , LLC variants sit after a comma.
Phase 3 - expand abbreviations
ABBREV_MAP = {
"mfg": "manufacturing",
"mfgr": "manufacturing",
"mfrs": "manufacturers",
"intl": "international",
"inds": "industries",
"assn": "association",
"tech": "technology",
"&": "and",
"n.": "north",
"s.": "south",
}
def phase_3_expand_abbreviations(s: str) -> str:
return " ".join(ABBREV_MAP.get(tok.rstrip("."), tok) for tok in s.split())
acme mfg becomes acme manufacturing. The abbreviation map started at ~40 entries and grew to ~200 over 6 months of production - expand as you see new ones in the audit log.
Phase 4 - strip DBA / parenthetical
PAREN_PATTERN = re.compile(r"\s*\([^)]*\)\s*")
DBA_PATTERN = re.compile(r"\s+(?:dba|d/b/a|aka|formerly)\s+.*$", re.IGNORECASE)
def phase_4_strip_dba_parenthetical(s: str) -> str:
s = PAREN_PATTERN.sub(" ", s)
s = DBA_PATTERN.sub("", s)
return re.sub(r"\s+", " ", s).strip()
acme manufacturing (formerly widgets ltd) becomes acme manufacturing. Preserving the (formerly ...) value in a separate field is sometimes valuable; we do it in the source_records table’s raw JSONB, not the canonical.
Phase 5 - unicode NFD + accent strip
def phase_5_unicode_nfd_accent_strip(s: str) -> str:
decomposed = unicodedata.normalize("NFD", s)
return "".join(c for c in decomposed if unicodedata.category(c) != "Mn")
acmé manufacturing becomes acme manufacturing. Unicode normalisation decomposes accented characters into base+diacritic; filtering out combining-mark category (Mn) strips the diacritics while keeping the base letter.
Phase 6 - token sort
def phase_6_token_sort(s: str) -> str:
return " ".join(sorted(s.split()))
acme manufacturing inc and manufacturing acme inc are almost certainly the same company with reordered words (a common CRM data-entry variant, especially from phone-based data capture). Sorting tokens makes them compare-equal. This is the last deterministic phase before fuzzy matching.
Phase 7 - fuzzy match
from rapidfuzz import fuzz
def phase_7_fuzzy_match(normalized: str, candidates: list[dict]) -> dict | None:
best = None
best_score = 0
for cand in candidates:
jw = fuzz.WRatio(normalized, cand["canonical_name_normalized"])
if jw > best_score:
best, best_score = cand, jw
return best if best_score >= 88 else None
WRatio is rapidfuzz’s composite scorer - Jaro-Winkler plus partial-ratio plus token-sort - with a good default balance for short business-name strings. Threshold 88 (out of 100) matches ~95% of real matches with under 1% false positives in our production data; below that, the match goes to manual review queue instead of auto-resolving.
Each phase’s match is recorded in audit_log.phase_matched. Over 11M enrichment calls/day, the distribution looks like:
| Phase | % of matches | Latency |
|---|---|---|
| Exact tax ID (pre-phase 1) | 58% | 3ms |
| Phase 1-3 (deterministic, case/suffix/abbrev) | 31% | 8ms |
| Phase 4-6 (DBA/accent/reorder) | 7% | 14ms |
| Phase 7 (fuzzy) | 3% | 35ms |
| Manual review queue | 1% | async |
The “easy” 89% of matches never touch fuzzy logic.
The Lambda function
Stack: Python 3.12 Lambda, 1024MB, 10s timeout. Cold-start latency matters because ETL calls this from across the pipeline - provisioned concurrency of 5 keeps 95% of invocations warm.
# handler.py (skeleton)
import json, os, asyncpg, redis, boto3
from canonicaliser import canonicalise
from matcher import match
RDS_PROXY_ENDPOINT = os.environ["RDS_PROXY_ENDPOINT"]
CACHE_ENDPOINT = os.environ["CACHE_ENDPOINT"]
pool = None
cache = None
async def get_pool():
global pool
if pool is None:
pool = await asyncpg.create_pool(
host=RDS_PROXY_ENDPOINT,
database="mdm",
user=os.environ["DB_USER"],
password=os.environ["DB_PASSWORD"],
max_size=3,
)
return pool
def get_cache():
global cache
if cache is None:
cache = redis.Redis(host=CACHE_ENDPOINT, port=6379)
return cache
async def handle(event, context):
records = event["records"] # batch of up to 500
responses = []
pool = await get_pool()
cache_client = get_cache()
for rec in records:
canonical = canonicalise(rec["raw_name"])
# Cache check (see next section)
cached_id = cache_client.get(f"mdm:{canonical}")
if cached_id:
responses.append({
"source_primary_key": rec["source_primary_key"],
"golden_id": cached_id.decode(),
"action": "cache_hit",
"confidence": 1.0,
})
continue
# Cache miss: hit the ledger
result = await match(pool, rec, canonical)
cache_client.setex(f"mdm:{canonical}", 86400, result["golden_id"])
responses.append(result)
return {"resolved": responses}
Connection pooling via RDS Proxy
Lambda’s concurrency model is hostile to direct Postgres connections - a spike to 200 concurrent invocations can exhaust even a generous max_connections. RDS Proxy sits between Lambda and RDS, multiplexing connections so each Lambda invocation borrows an existing connection from a pool. Without RDS Proxy, this pattern does not scale past a few hundred invocations per second.
Batch shape
The contract:
// Request
{
"records": [
{"source_system": "bc-integration", "source_primary_key": "CUST-001234",
"raw_name": "ACME MANUFACTURING INC.", "tax_id": "12-3456789",
"raw_address": {"line1": "100 Main St", "city": "Austin", ...}},
... up to 500 rows
]
}
// Response
{
"resolved": [
{"source_primary_key": "CUST-001234",
"golden_id": "b3a8f2c0-7e1a-4d9f-8e63-1f2a3b4c5d6e",
"action": "matched",
"confidence": 0.940,
"phase_matched": 2}
]
}
Batching to 500 keeps each Lambda invocation busy for ~3-4 seconds - enough to amortise cold-start cost and pool overhead, small enough that API Gateway doesn’t time out.
The ledger-hit caching pattern
The design decision that takes this from “works” to “production-scale” is caching.
Without caching. Every ETL row hits the Lambda, Lambda hits RDS via Proxy, Postgres runs the trigram/GIN query, returns. Typical p95: 85-110ms. At 11M calls/day spread across 16 hours, that’s a peak of ~200 calls/second - well within what RDS can handle, but the tail latency creates backpressure in the ETL orchestrator and 2-hour pipelines stretch to 4.
With caching. 80%+ of enrichment calls within any 24-hour window are repeat lookups of the same canonicalised name. ElastiCache Redis sits in front of the ledger with:
- Cache key:
mdm:<canonicalised_name>(i.e., the Phase 6 output) - Cache value: the
golden_idUUID - TTL: 24 hours
Hit rate in production: 82%. Those 82% of calls complete in 6-12ms end-to-end (network + Redis roundtrip). The remaining 18% hit the ledger and take the 85-110ms path - but 82% at 8ms + 18% at 100ms yields a weighted average of ~25ms, p95 of ~42ms, across the full workload.
Cache invalidation without the complexity
Caches without invalidation lie. The risk: a write to the ledger updates a golden record, cache still returns the stale ID, ETL pipeline writes wrong data downstream.
The pattern we use - the only one simple enough to trust:
-- Postgres trigger on golden_records write
CREATE OR REPLACE FUNCTION notify_mdm_cache_invalidate() RETURNS trigger AS $$
BEGIN
PERFORM pg_notify(
'mdm_cache_invalidate',
json_build_object(
'golden_id', NEW.golden_id,
'canonical_name', NEW.canonical_name_normalized
)::text
);
RETURN NEW;
END;
$$ LANGUAGE plpgsql;
CREATE TRIGGER tr_golden_records_cache_invalidate
AFTER INSERT OR UPDATE ON golden_records
FOR EACH ROW EXECUTE FUNCTION notify_mdm_cache_invalidate();
A tiny ECS listener holds a LISTEN mdm_cache_invalidate on Postgres. On notification it deletes the matching cache key (and any aliases stored against it). Total time from ledger write to cache invalidation: ~50ms.
If you prefer not to run the listener: SNS → SQS → Lambda also works, at the cost of about 2-3 seconds of invalidation lag - still well under any ETL pipeline’s cycle time.
As a drop-in replacement for Profisee or Tamr
The Profisee / Tamr integration surface is an HTTP match API: “here is a customer record, give me the canonical ID.” Our Lambda exposes the same contract. Migrating an existing Profisee integration looks like:
- Stand up the Lambda + ledger alongside Profisee (parallel, zero traffic)
- Backfill the ledger from Profisee’s existing golden records (one-time Python script)
- Route traffic from one source system at a time through the Lambda (A/B test)
- Compare Lambda vs Profisee match decisions in a shadow log; converge the thresholds
- Once parity is >99%, cut Profisee off that source; repeat for the next source
The migration took 6 months for the integrator customer. The Profisee license came up for renewal and did not get renewed.
What you trade vs the commercial tools
Honest comparison:
| Capability | Serverless MDM (this post) | Profisee / Tamr / Reltio |
|---|---|---|
| Multi-domain (customer + product + vendor + contract) | One ledger per domain, separate deploys | Single platform |
| Data stewardship UI | Build in Retool / Streamlit (~1 week) | Shipped out of the box |
| ML auto-tuning of thresholds | You tune manually against the audit log | Vendor auto-tunes |
| Vendor-supported audit for regulator | You own it; vendor-signed report not available | Vendor signs audit |
| Entity-level 360 lineage | Custom queries on audit_log | Built-in dashboard |
| Cost at 10M records, 10M calls/day | ~$400/month AWS infra | $150K-500K/year license |
| Time to first production load | 2-3 months | 6-9 months (implementation partner) |
| Engineering ownership | Yours | Shared with vendor |
The capabilities are not equivalent. But when the missing capabilities aren’t actually requirements, the cost gap is worth a hard look.
Production numbers
From the security integrator engagement, steady state after 9 months:
- ~11M enrichment calls/day, peaking around 250/sec during the nightly ETL window
- Ledger size: 2.3M golden customer records, 7.8M source records
- Cache hit rate: 82% (24h TTL)
- Latency: p50 8ms, p95 42ms, p99 78ms
- Infra cost: Lambda ~$120/month + RDS
db.t4g.medium~$80 + ElastiCachecache.t4g.micro~$15 + RDS Proxy ~$15 + misc ~$150 = ~$380/month - Manual review queue: ~1% of calls, handled daily by a single data steward in a Streamlit app
Compare that to the Profisee renewal quote it replaced and the conversation with the CFO got short.
Where to start
If you’re on AWS and evaluating Profisee / Tamr / Informatica for a single-domain customer-identity problem, run this pattern as a prototype first - 3 week build, ledger + canonicalisation phases 1-3. That gets you 89% of the matches at ~15ms latency, on a few hundred dollars of monthly infra.
If it clears your accuracy bar, you can push to phase 7 + caching later (another 3-4 weeks). If it doesn’t, you’ve spent 3 weeks on something you can keep as your staging / dev-tier MDM even after buying the vendor tool.
Either way, Algoscale’s data warehouse consulting practice ships this MDM pattern as part of every modernization engagement - the ledger and the warehouse get built together so identity resolution is baked in, not bolted on. It’s how the 180-day M&A data consolidation playbook resolves identity across acquired entities at Phase 3, and how the hybrid row-level security architecture knows which Subsidiary to filter on. When you need deeper data integration consulting on top of the warehouse - CDC, streaming, reverse ETL to operational systems - the same team owns that surface. MDM isn’t a separate purchase on the stack; it’s a component in it.
Data Engineer
Mukesh is a Data Engineer at Algoscale building the deep-plumbing pieces of enterprise data platforms across AWS and Azure — MDM ledgers, CDC pipelines, Lake Formation access controls, Fabric semantic models. Writes from the production side of the stack.