Schema-first AI extraction: making LLMs useful for messy documents
The most common request for LLMs in business is “extract the data from this document.” Invoices, contracts, medical records, compliance forms, supplier catalogues — all full of useful information buried in inconsistent formatting, scanned PDFs, and human error.
LLMs can extract structured data from messy documents better than traditional OCR + rules-based approaches. But they also hallucinate fields, miss data that falls outside their training distribution, and quietly invent plausible-looking values for missing information.
Editor’s Note: An LLM that silently fills in a missing date with a plausible date is worse than one that leaves the field blank. The first creates a data quality problem that is hard to detect. The second creates a null that the validation layer catches. Editor’s Note: Schema-first means defining what valid data looks like before you show the document to the model. If you cannot write validation rules for the extracted fields, you cannot trust the extraction.
Quick answer
Build extraction in five layers:
- Schema — define the data structure, field types, validation rules, and confidence thresholds
- Extraction — prompt the model with the schema and the document, request structured output
- Validation — check extracted values against the schema rules, flag failures
- Confidence — attach a confidence score to each extracted field, not just the overall result
- Review queue — route low-confidence or validation-failed extractions to human review
The key insight: validation and confidence are not optional post-processing steps. They are part of the extraction design. If you cannot validate a field, you should not extract it.
What the tutorials skip
Structured output is not reliable structured output. Most providers support JSON mode, tool-use, or constrained decoding. These improve reliability but do not guarantee correctness. The model can still output valid JSON with wrong values, or miss fields entirely. Schema validation catches the second problem, not the first.
Confidence is per-field, not per-document. A document might contain a clearly printed invoice number (high confidence) alongside a faintly stamped date (low confidence). Reporting document-level confidence hides the uncertainty. Report confidence at the field level and use it to drive review routing.
Source documents degrade extraction quality. A scanned PDF at 150 DPI with handwritten annotations, watermarks, and staple shadows will produce worse extraction than a clean digital file. Measure extraction quality by source quality tier and set threshold for review accordingly.
The schema is a moving target. Business rules change. New fields get added. Field definitions get refined. Version the extraction schema the same way you version prompts — with eval scores, review, and rollback.
Where teams misuse extraction
Extracting everything. Just because you can extract 40 fields from an invoice does not mean you should. Each field adds validation rules, confidence thresholds, and edge cases. Start with the 5–10 fields you actually use in your business process. Add fields only when the operational value justifies the maintenance cost.
No validation for optional fields. Required fields get validation. Optional fields get ignored — until someone relies on an optional field that the model left blank or silently filled with a guess. Validate optional fields too, or treat them as required and route blanks to review.
Trusting extraction without ground truth. Building an extraction pipeline without a labelled test set (documents with known correct values) means you cannot measure accuracy. Start with 50–100 labelled documents, measure per-field accuracy, and target 90%+ before putting extraction in front of users.
Practical schema design
Field definition template
field_name: invoice_date
type: date
format: YYYY-MM-DD
required: true
validation:
- rule: must be in the past 5 years
- rule: must not be a future date
confidence_threshold: 0.8
review_on_failure: true
alternatives:
- field_name: invoice_date_approximate
prompt: "If exact date is not visible, provide best estimate with confidence"
Validation rules by field type
| Field type | Validation rules |
|---|---|
| Date | Past/future check, format check, reasonableness (not 1900-01-01) |
| Currency | Positive number, reasonable range, decimal precision |
| Text (enum) | Must match one of known values, fuzzy match for typos |
| Phone/email | Format regex, test number check |
| Name | Non-empty, no obvious gibberish (single character, all numbers) |
| Reference number | Pattern match (alphanumeric, length check, checksum if available) |
Confidence scoring
Assign per-field confidence based on:
- Model confidence in the structured output
- Document quality score (DPI, scan quality, text extractability)
- Field type difficulty (structured fields are easier than free-text fields)
- Historical accuracy for this field type and document source
Route a document to human review if any required field has confidence below threshold OR any validation rule fails.
Decision framework
| Question | Approach |
|---|---|
| Is the document a clean digital file? | Auto-extract with 10% random sample for review |
| Is it a scanned PDF with annotations? | Extract with low confidence defaults, route all to review |
| Are extracted values used in financial reporting? | Mandatory human review for all values above a threshold |
| Is the schema stable? | Version the schema with CI eval |
| Is the schema changing weekly? | Fix the business process before automating extraction |
| Do you have labelled test documents? | Measure extraction accuracy by field before go-live |
Methodology and sources
This guide draws on structured-output documentation from major LLM providers, document AI platform documentation, validation library practices, and operational experience from production extraction pipelines.
- OpenAI structured outputs documentation: https://platform.openai.com/docs/guides/structured-outputs — checked 2026-05-24
- Anthropic tool-use structured output: https://docs.anthropic.com/en/docs/build-with-claude/tool-use — checked 2026-05-24
- Google GenAI structured output: https://ai.google.dev/gemini-api/docs/structured-outputs — checked 2026-05-24
- Pydantic validation library: https://docs.pydantic.dev/ — checked 2026-05-24
Change log
2026-05-24 — First published version.
Source list
- OpenAI structured outputs: https://platform.openai.com/docs/guides/structured-outputs
- Anthropic tool-use: https://docs.anthropic.com/en/docs/build-with-claude/tool-use
- Google structured outputs: https://ai.google.dev/gemini-api/docs/structured-outputs
- Pydantic: https://docs.pydantic.dev/