Insurance Claims Classification with LLMs

A claims intake queue at a major insurance carrier opens on a Monday morning with several thousand documents that arrived over the weekend — first-notice-of-loss forms, adjuster narratives, medical records, repair estimates, police reports, photos of damage, witness statements, prior-policy disclosures, lien releases, and the dozens of supplementary artifacts that attach to any non-trivial claim. Before any downstream work can happen — coverage check, fraud screen, adjuster assignment, payment authorization — every document has to be classified, sorted, and routed. A human reviewer can do this. A team of human reviewers can do this. Sixty human reviewers, distributed across three time zones, can almost keep up with the inflow. What sixty reviewers cannot do is agree with each other on what counts as a “supplementary medical disclosure” versus a “treatment record” versus a “billing statement” — and the disagreement quietly shapes every downstream metric the carrier reports on cycle time, leakage, and reserve adequacy.

This is the problem we set out to solve on a recent engagement: an AI system that classifies incoming claims documents across 61 distinct labels, with the constraint that a single document can legitimately belong to several categories at once. Single-label classification — “pick one bucket” — is the textbook version that makes for clean Kaggle notebooks. Multi-label classification is the version that matches the actual document mix in a real claims operation, and it changes almost every assumption a team brings to the problem.

The interesting story of this project is not the model. It is the discipline around measurement, the architectural calls about where data lives, the deliberate bake-off between a classical machine-learning track and an LLM-based track, and the operational scaffolding that took an experiment into a production-grade triage system. The model was almost the easy part.

Below is the architecture and the engineering discipline that got the system to a place where it can be deployed inside a regulated industry’s existing generative AI stack — auditable, reproducible, and honest about what it does and doesn’t know.

Multi-label changes the problem, and most teams underweight that

The reflex when an ML team sees “classify documents into 61 categories” is to reach for a softmax head and an argmax decision rule. That reflex works for single-label problems and quietly produces wrong answers everywhere else.

A claims document that contains both a billing statement and a treatment record is both labels at once. The supervised signal for this document is two labels, not one. The cross-entropy objective has to be sigmoid-binary-per-label, not softmax-across-labels. The decision rule has to be a per-label threshold, not an argmax. And the evaluation metrics have to be multi-label-native — example-based F1, micro-F1, macro-F1, hamming loss, exact-match accuracy — because single-label accuracy reports a number that does not mean what the team thinks it means.

Three downstream consequences fall out of this and shape the rest of the architecture:

Argmax fails silently. If the model assigns 0.62 to “treatment record” and 0.58 to “billing statement” on a document that genuinely is both, argmax returns the first and quietly drops the second. The downstream router never sees the billing label, the routing rule for billing never fires, and the claim takes a slower path. In a single-label setting that’s a 4% accuracy hit. In multi-label it’s a hidden defect that compounds across the queue every shift.

Threshold selection is per-label, not global. A label that occurs in 20% of documents needs a different decision threshold from a label that occurs in 0.5%. Tuning each label’s threshold independently — typically via maximizing F1 on a held-out validation slice — is the only way to get the precision/recall mix the business actually wants per category. Rare labels run at higher thresholds to control false positives; common labels at lower thresholds to catch real instances.

Evaluation has to honor co-occurrence. Two labels that frequently appear together in real-world documents need to be evaluated together — splitting them across slices will hide the model’s actual confusion behavior. More on this below; it’s where teams accidentally make their numbers look better than the field reality.

Treating evaluation as the hard problem, not the model

The recurring theme of this project, and the one we’d encourage any team doing high-stakes classification work to internalize, is that the discipline around measurement mattered more than any single model choice. The framing that anchored the whole engagement:

Pick the metric first. Freeze the evaluation set. Then talk about models.

Four specific calls under that framing did most of the heavy lifting:

Macro-F1 as the honest headline metric. Micro-F1 weights every prediction equally, which means a model that does well on the common labels and terribly on the rare ones can look great on micro-F1 while being unusable in production — because the rare labels are usually the ones that change a downstream routing decision. Macro-F1 weights every label equally, which forces the model to earn its score on the hard parts of the distribution. Without a known production label distribution to weight against, macro-F1 is the metric that tells the truth. We kept micro-F1 and example-based F1 as multi-label-native complements, and exact-match accuracy as a hard-mode reference, but macro-F1 was the number that governed the build.

A frozen evaluation set, locked from day one. The temptation when a model underperforms is to “improve the eval set” — re-label a few examples, drop a confusing slice, swap in new documents. Every one of those edits silently inflates the metric and destroys reproducibility. We froze the evaluation set before training started, version-controlled it inside the lakehouse, and treated any change to it as a separate code change with its own review. The evaluation set is not where you make the model look good. It’s where you find out whether the model is good.

Per-label thresholds tuned on a separate validation split. Tuning thresholds on the test set leaks information and produces a number that won’t replicate. A held-out validation slice — disjoint from both training and final evaluation — was used to select the per-label thresholds; the test set was touched once, at the end, to report the honest number.

A random-sample evaluation set alongside the curated gold set. This one is non-obvious and matters. A curated gold set, hand-selected to cover all 61 labels, gives you confidence on rare classes but distorts the precision number — because in the real document stream the rare classes are rare, and a model that fires on rare classes too eagerly will rack up false positives that the curated gold set never sees. A random evaluation set drawn from production-mirror inflow, even if it under-samples the rare labels, gives you an honest read on what real-world precision will look like. Both sets together — curated gold for recall coverage, random sample for precision honesty — was the discipline that produced a metric the deployment team could defend.

The single most important takeaway here is that the discipline can’t wait for the tooling. Frozen evaluation sets and one agreed metric were treated as non-negotiable before any MLflow tracking, any experiment-management UI, any model registry was set up. The right metric on a CSV in a Git repo beats the wrong metric in a fancy MLOps platform every time.

Smart data slicing — confusion matrices over label co-occurrence

For phased delivery — we couldn’t ship all 61 labels at once — labels were grouped into delivery tiers using F1 thresholds, easiest first. The obvious slicing is alphabetical or by frequency. The slicing that actually works is by confusion-matrix and co-occurrence analysis.

The mechanism: build a confusion matrix and a co-occurrence matrix across the full label set. Two labels with high confusion (the model swaps them often) or high co-occurrence (they routinely appear together in the same document) belong in the same delivery slice. If they’re split across slices, the team ships slice 1, reports a high F1, ships slice 2 a quarter later, and only then discovers that the slice-1 number was inflated because the model never saw the labels it confuses with. The honest failure modes only emerge when confusing labels are evaluated together.

This sounds procedural and turns out to be one of the highest-leverage decisions a multi-label team makes. The default-of-record for slicing should be: cluster confusing/co-occurring labels into the same slice, even if it means a less even cardinality per slice.

Medallion architecture, but for unstructured documents

The lakehouse design for this project lived on Databricks Unity Catalog, in a dedicated ML catalog separate from the operational analytics catalog. Bronze → silver → gold, the standard medallion shape — but two engineering calls deserve more attention than they usually get in introductory write-ups.

Document binaries live in Unity Catalog Volumes, not Delta tables. The reflexive instinct is to base64-encode the PDFs, stuff them in a binary column, and let Delta handle storage. This works in a notebook and falls over at production scale — Delta is row-store-shaped, the binaries blow up the file sizes and the query planner, and the access patterns for documents (open, read, OCR) don’t match the access patterns for tabular data (scan, filter, aggregate). Volumes give you native object storage semantics with governance, lineage, and ACLs in one place. Documents stay where they belong; the Delta tables hold metadata, segmentation outputs, predictions, and labels.

Model metadata lives in the gold layer. Every prediction the system produces is written with the confidence score, the model version, the run ID, the segmentation version, and the inference timestamp. This sounds like operational overhead until the first regulatory audit asks the question “which model produced this routing decision on July 14?” — and the answer can be a deterministic query against the gold layer instead of a forensic spelunking exercise. In a regulated industry, this is the difference between a system that ships and a system that gets blocked at the governance review.

The data engineering work to do this cleanly — Volumes for binaries, Delta for metadata, model version stamped on every prediction, lineage from raw document through final routing — is the unglamorous foundation that makes everything else auditable.

Classical ML versus LLM — a deliberate bake-off

The framing question on every modern document-classification engagement is: do we use a fine-tuned classical model (linear over TF-IDF, gradient boosting over hand-crafted features, a transformer encoder fine-tuned on the labels) or do we use an LLM (zero-shot, few-shot, or fine-tuned prompt-based classifier)?

The honest answer is: you don’t know until you measure. The fashionable answer in 2026 is “LLM.” The fashionable answer is sometimes right and sometimes catastrophically wrong, and the only way to find out which one your problem is is to run both tracks in parallel against the same frozen evaluation set, with the same metric.

We structured the engagement as two parallel model tracks:

Classical track. Document segmentation → text extraction → feature engineering → multi-label model head. Inference is cheap, fast, and deterministic. The model is small enough to retrain on a single node and ship to a serverless endpoint without GPU. Failure modes are familiar; debugging is straightforward.
LLM track. Document segmentation → prompt construction → LLM inference → structured output parsing → per-label thresholding on the LLM’s confidence proxy. Inference is more expensive and slower; the model can reason about labels it hasn’t explicitly seen in training; debugging shifts from feature analysis to prompt analysis and output-parsing failure.

The two tracks shared a preprocessing layer — document segmentation — which is itself one of the higher-leverage pieces of engineering on the project. A 40-page claim packet is not one document; it’s a stitched-together envelope of distinct artifacts, each of which gets its own classification. Segmenting the packet before classification cleans the input distribution dramatically and makes both tracks materially better.

A dedicated model-comparison stage then consumed the outputs of both tracks against the frozen evaluation set, producing a side-by-side report on macro-F1, per-label F1, latency, cost-per-document, and qualitative failure-mode analysis. The selection was evidence-based, not narrative-based. The deliverable was not a model. It was a model selection report with the methodology documented well enough that the carrier’s internal review board could replicate it.

Worth saying out loud: the bake-off is the right answer even when you’re confident which track will win. The losing track produces baseline numbers that make the winning track’s claims defensible. A win without a baseline is a marketing number, not an engineering result.

From experiment to operations — phased delivery and SME loops

The work was structured across four delivery phases, each with explicit exit criteria:

Baseline iteration and model-assisted gold collection. Establish a baseline on each track using whatever labels exist. Use the baseline to surface candidate examples for the labelers — model-assisted labeling at the front end can roughly double the throughput of a labeling team without sacrificing quality, provided the labelers review the model’s suggestion rather than rubber-stamping it.
Training, evaluation, and model selection. Both tracks built out, frozen evaluation set scored, per-label thresholds tuned, model-comparison report delivered. MLflow tracking comes in here once the volume of experiments warrants it — at five experiments a CSV is fine; at five hundred it’s malpractice.
End-to-end deployment with SME feedback loops. The model is deployed behind the routing layer, with a parallel sample of every classification flagged for SME (subject-matter expert) review. The feedback loop is explicit: the SMEs are not labeling new training data; they are flagging production drift. Drift detected in this loop triggers a retraining cycle.
Observability and continuous improvement. Per-label precision and recall tracked continuously, threshold drift monitored, segmentation drift monitored. New labels added under controlled rollout — same discipline of confusion-matrix-driven slicing applies to each new tranche.

The pattern that makes this work in a regulated context — and which the insurance services team had to defend in the governance review — is that every prediction is reproducible. Same input, same model version, same segmentation version produces the same output, every time. The randomness in the pipeline is contained to training; inference is deterministic. That property is what makes the audit trail meaningful instead of theatrical.

The open question: batch versus real-time scoring

Worth flagging honestly, because most teams will face the same trade-off: the architectural call between batch scoring (nightly micro-batch over the day’s intake) and real-time scoring (per-document inference at the moment of intake) was deliberately left as an open question through the build phase, and it’s a question worth thinking through carefully.

Batch is cheaper, easier to operate, easier to monitor, and adequate for downstream consumers that aggregate by day or shift. Real-time is more expensive, harder to operate, but lets the routing layer act on the classification within seconds of intake — which matters if the carrier’s SLA on first-touch is measured in minutes. Most teams pick real-time because it sounds more impressive and then discover that 90% of their downstream consumers are happy with batch. A minority pick batch because it’s cheaper and then discover that the routing SLA can’t be hit without sub-hour classification.

The honest answer is that the right choice depends on the SLA the routing layer commits to downstream, the cost envelope, and the variance in inflow volume. A hybrid — batch as the default, with a real-time fast-path for documents flagged as high-priority on intake — is often the architecturally sound landing place but should be designed in rather than retrofitted. The discipline here is the same as the rest of the project: don’t pick the answer that sounds best; pick the answer that the SLA and the cost model jointly support, and measure the trade-off in production.

What governance buys you in a regulated industry

The last theme, and the one that earned the engagement its standing with the carrier’s risk and compliance team, is that governance is not an overhead on the system — it is the system. Every artifact in the pipeline — segmentation version, model version, threshold version, evaluation snapshot, prediction confidence — is versioned, addressable, and queryable. Every prediction the system writes carries enough metadata for a downstream investigator to reconstruct exactly which version of which model produced it, against which evaluation set, with what calibration.

In an unregulated context this is good engineering hygiene. In a regulated context it is the prerequisite to being allowed to ship at all. State insurance commissioners do not accept “we trust the model” as an answer. They accept reproducible audit trails, frozen evaluation sets, documented selection methodology, and SME-in-the-loop oversight. The carrier could deploy the system because every one of those was designed into the architecture from the first week, not bolted on for the compliance review.

Faster, more consistent claims triage. Lower manual review cost. A queue that gets smaller during the night shift instead of bigger. An audit trail that survives a regulatory examination without theatrics. Those are the commercial outcomes. The architecture above is what makes them defensible — not the model itself, but the discipline of measurement, the lakehouse design that respects unstructured data, the deliberate bake-off, and the operational scaffolding that turned an experiment into a system the business could rely on.

The model was almost the easy part. The discipline around it was the work.

Neeraj Agarwal

Founder & CEO, Algoscale

June 16, 2026

Neeraj has led AI and data engagements for Fortune 500 clients across finance, healthcare, and retail. He writes about what actually ships — not what looks good in a slide.

Insurance Claims Classification with LLMs

Multi-label changes the problem, and most teams underweight that

Treating evaluation as the hard problem, not the model

Smart data slicing — confusion matrices over label co-occurrence

Medallion architecture, but for unstructured documents

Classical ML versus LLM — a deliberate bake-off

From experiment to operations — phased delivery and SME loops

The open question: batch versus real-time scoring

What governance buys you in a regulated industry

More on this topic

Why Your AI Pilot Stalls at 80%

RAG vs Fine-Tuning vs Continued Pretraining

Watermark Bugs in Fabric Incremental Loads

Two quick diagnostics for the two questions we get most

How mature is your data?

How long would an engagement take?