theLLMs

Last checked: 2026-05-24

Scope: Global. Structured-output documentation, document AI provider docs, and validation library documentation checked on 2026-05-24. Provider-specific JSON mode and tool-use APIs evolve rapidly.

AI draft model: deepseek-v4-flash

AI review model: llm-editor (deepseek-v4-pro)

Schema-first AI extraction: making LLMs useful for messy documents

The most common request for LLMs in business is “extract the data from this document.” Invoices, contracts, medical records, compliance forms, supplier catalogues — all full of useful information buried in inconsistent formatting, scanned PDFs, and human error.

LLMs can extract structured data from messy documents better than traditional OCR + rules-based approaches. But they also hallucinate fields, miss data that falls outside their training distribution, and quietly invent plausible-looking values for missing information.

Editor’s Note: An LLM that silently fills in a missing date with a plausible date is worse than one that leaves the field blank. The first creates a data quality problem that is hard to detect. The second creates a null that the validation layer catches. Editor’s Note: Schema-first means defining what valid data looks like before you show the document to the model. If you cannot write validation rules for the extracted fields, you cannot trust the extraction.

Quick answer

Build extraction in five layers:

  1. Schema — define the data structure, field types, validation rules, and confidence thresholds
  2. Extraction — prompt the model with the schema and the document, request structured output
  3. Validation — check extracted values against the schema rules, flag failures
  4. Confidence — attach a confidence score to each extracted field, not just the overall result
  5. Review queue — route low-confidence or validation-failed extractions to human review

The key insight: validation and confidence are not optional post-processing steps. They are part of the extraction design. If you cannot validate a field, you should not extract it.

What the tutorials skip

Structured output is not reliable structured output. Most providers support JSON mode, tool-use, or constrained decoding. These improve reliability but do not guarantee correctness. The model can still output valid JSON with wrong values, or miss fields entirely. Schema validation catches the second problem, not the first.

Confidence is per-field, not per-document. A document might contain a clearly printed invoice number (high confidence) alongside a faintly stamped date (low confidence). Reporting document-level confidence hides the uncertainty. Report confidence at the field level and use it to drive review routing.

Source documents degrade extraction quality. A scanned PDF at 150 DPI with handwritten annotations, watermarks, and staple shadows will produce worse extraction than a clean digital file. Measure extraction quality by source quality tier and set threshold for review accordingly.

The schema is a moving target. Business rules change. New fields get added. Field definitions get refined. Version the extraction schema the same way you version prompts — with eval scores, review, and rollback.

Where teams misuse extraction

Extracting everything. Just because you can extract 40 fields from an invoice does not mean you should. Each field adds validation rules, confidence thresholds, and edge cases. Start with the 5–10 fields you actually use in your business process. Add fields only when the operational value justifies the maintenance cost.

No validation for optional fields. Required fields get validation. Optional fields get ignored — until someone relies on an optional field that the model left blank or silently filled with a guess. Validate optional fields too, or treat them as required and route blanks to review.

Trusting extraction without ground truth. Building an extraction pipeline without a labelled test set (documents with known correct values) means you cannot measure accuracy. Start with 50–100 labelled documents, measure per-field accuracy, and target 90%+ before putting extraction in front of users.

Practical schema design

Field definition template

field_name: invoice_date
type: date
format: YYYY-MM-DD
required: true
validation:
  - rule: must be in the past 5 years
  - rule: must not be a future date
confidence_threshold: 0.8
review_on_failure: true
alternatives:
  - field_name: invoice_date_approximate
    prompt: "If exact date is not visible, provide best estimate with confidence"

Validation rules by field type

Field typeValidation rules
DatePast/future check, format check, reasonableness (not 1900-01-01)
CurrencyPositive number, reasonable range, decimal precision
Text (enum)Must match one of known values, fuzzy match for typos
Phone/emailFormat regex, test number check
NameNon-empty, no obvious gibberish (single character, all numbers)
Reference numberPattern match (alphanumeric, length check, checksum if available)

Confidence scoring

Assign per-field confidence based on:

  • Model confidence in the structured output
  • Document quality score (DPI, scan quality, text extractability)
  • Field type difficulty (structured fields are easier than free-text fields)
  • Historical accuracy for this field type and document source

Route a document to human review if any required field has confidence below threshold OR any validation rule fails.

Decision framework

QuestionApproach
Is the document a clean digital file?Auto-extract with 10% random sample for review
Is it a scanned PDF with annotations?Extract with low confidence defaults, route all to review
Are extracted values used in financial reporting?Mandatory human review for all values above a threshold
Is the schema stable?Version the schema with CI eval
Is the schema changing weekly?Fix the business process before automating extraction
Do you have labelled test documents?Measure extraction accuracy by field before go-live

Methodology and sources

This guide draws on structured-output documentation from major LLM providers, document AI platform documentation, validation library practices, and operational experience from production extraction pipelines.

Change log

2026-05-24 — First published version.

Source list