Schema-first AI extraction: making LLMs useful for messy documents

The most common request for LLMs in business is “extract the data from this document.” Invoices, contracts, medical records, compliance forms, supplier catalogues — all full of useful information buried in inconsistent formatting, scanned PDFs, and human error.

LLMs can extract structured data from messy documents better than traditional OCR + rules-based approaches. But they also hallucinate fields, miss data that falls outside their training distribution, and quietly invent plausible-looking values for missing information.

TL;DR

Build extraction in five layers:

Schema — define the data structure, field types, validation rules, and confidence thresholds
Extraction — prompt the model with the schema and the document, request structured output
Validation — check extracted values against the schema rules, flag failures
Confidence — attach a confidence score to each extracted field, not just the overall result
Review queue — route low-confidence or validation-failed extractions to human review

The key insight: validation and confidence are not optional post-processing steps. They are part of the extraction design. If you cannot validate a field, you should not extract it. [1][3]

What the tutorials skip

Structured output is not reliable structured output. Most providers support JSON mode, tool-use, or constrained decoding. These improve reliability but do not guarantee correctness. The model can still output valid JSON with wrong values, or miss fields entirely. Schema validation catches the second problem, not the first. [1][2]

Confidence is per-field, not per-document. A document might contain a clearly printed invoice number (high confidence) alongside a faintly stamped date (low confidence). Reporting document-level confidence hides the uncertainty. Report confidence at the field level and use it to drive review routing. [1]

Source documents degrade extraction quality. A scanned PDF at 150 DPI with handwritten annotations, watermarks, and staple shadows will produce worse extraction than a clean digital file. Measure extraction quality by source quality tier and set threshold for review accordingly.

The schema is a moving target. Business rules change. New fields get added. Field definitions get refined. Version the extraction schema the same way you version prompts — with eval scores, review, and rollback. [4]

Where teams misuse extraction

Extracting everything. Just because you can extract 40 fields from an invoice does not mean you should. Each field adds validation rules, confidence thresholds, and edge cases. Start with the 5–10 fields you actually use in your business process. Add fields only when the operational value justifies the maintenance cost. [1]

No validation for optional fields. Required fields get validation. Optional fields get ignored — until someone relies on an optional field that the model left blank or silently filled with a guess. Validate optional fields too, or treat them as required and route blanks to review.

Trusting extraction without ground truth. Building an extraction pipeline without a labelled test set (documents with known correct values) means you cannot measure accuracy. Start with 50–100 labelled documents, measure per-field accuracy, and target 90%+ before putting extraction in front of users. [3]

Practical schema design

Field definition template

field_name: invoice_date
type: date
format: YYYY-MM-DD
required: true
validation:
  - rule: must be in the past 5 years
  - rule: must not be a future date
confidence_threshold: 0.8
review_on_failure: true
alternatives:
  - field_name: invoice_date_approximate
  prompt: "If exact date is not visible, provide best estimate with confidence"

Validation rules by field type

Field type	Validation rules
Date	Past/future check, format check, reasonableness (not 1900-01-01)
Currency	Positive number, reasonable range, decimal precision
Text (enum)	Must match one of known values, fuzzy match for typos
Phone/email	Format regex, test number check
Name	Non-empty, no obvious gibberish (single character, all numbers)
Reference number	Pattern match (alphanumeric, length check, checksum if available)

Confidence scoring

Assign per-field confidence based on:

Model confidence in the structured output
Document quality score (DPI, scan quality, text extractability)
Field type difficulty (structured fields are easier than free-text fields)
Historical accuracy for this field type and document source

Route a document to human review if any required field has confidence below threshold OR any validation rule fails. [1][3]

Decision framework

Question	Approach
Is the document a clean digital file?	Auto-extract with 10% random sample for review
Is it a scanned PDF with annotations?	Extract with low confidence defaults, route all to review
Are extracted values used in financial reporting?	Mandatory human review for all values above a threshold
Is the schema stable?	Version the schema with CI eval
Is the schema changing weekly?	Fix the business process before automating extraction
Do you have labelled test documents?	Measure extraction accuracy by field before go-live

Methodology

Data checked: 2026-05-28
Sources consulted: OpenAI structured outputs documentation, Anthropic tool-use documentation, Google GenAI structured output documentation, Pydantic validation library
Assumptions: The reader has access to at least one LLM provider with structured-output or tool-use API support. The extraction patterns assume English-language business documents. Validation rules are illustrative; real schemas must be tailored to specific document types and business processes.
Limitations: This guide covers schema design and validation patterns for document extraction. It does not cover OCR pre-processing, multi-page table extraction, handwriting recognition, or real-time streaming extraction. Provider-specific JSON mode and tool-use APIs evolve rapidly — verify against current provider documentation.
Jurisdiction: Global. No jurisdiction-specific regulatory advice. Extraction of personal data may be subject to GDPR, HIPAA, or equivalent regulations depending on document content and jurisdiction.

Source list

[1] OpenAI structured outputs documentation — https://platform.openai.com/docs/guides/structured-outputs (accessed 2026-05-28)
[2] Anthropic tool-use structured output — https://docs.anthropic.com/en/docs/build-with-claude/tool-use (accessed 2026-05-28)
[3] Google GenAI structured output — https://ai.google.dev/gemini-api/docs/structured-outputs (accessed 2026-05-28)
[4] Pydantic validation library — https://docs.pydantic.dev/ (accessed 2026-05-28)

What this page cannot tell you

This page cannot tell you which provider’s structured-output implementation is most reliable for your specific document type. Provider APIs evolve rapidly — test against your actual documents with labelled ground truth before committing to a pipeline. Extraction accuracy varies significantly by document quality tier, language, and domain-specific terminology.

Trust Stack

Last checked: 2026-05-28
Corrections: Contact us to report errors

Change log

2026-05-28: editorial review — corrected writtenBy field, converted Editor’s Notes to proper <aside> format (3 cards), added Methodology section, added Trust Stack, added slugified heading IDs, added access dates to source list, added in-text citations, removed self-referencing related guide link, updated all dates to 2026-05-28
2026-05-24: first published