What model cards tell you — and what they do not

Every major model launch comes with a model card. It is a short document that says what the model can do, how it was trained, how it was evaluated, and what its limitations are.

The problem is that model cards are written by the people who built the model. They are useful, but they are not neutral. Reading one well means knowing what to treat as fact, what to treat as marketing, and what to treat as absent.

TL;DR

A model card gives you the basics: what the model is, what benchmarks it was tested on, and a rough safety summary. What it will not tell you is how well the model performs on your specific workload, what data it was actually trained on, whether the benchmark scores are reproducible, or how behaviour changes at the edges of its stated capabilities.

Treat model cards as starting points, not contracts.

What model cards typically contain

Most model cards follow a template derived from the Hugging Face or Google standard. The sections you usually see:

Intended use and out-of-scope use. The provider says what the model is for and what it should not be used for. This is one of the most useful sections: if your use case is listed as out-of-scope, the card is telling you that the model was not tested for it.

Training data. Usually a high-level description: dataset names, size, language mix, and sometimes a filtering summary. The actual data is almost never included, and contamination with benchmark data is rarely disclosed.

Evaluation results. Benchmark names and scores. These are the most visible numbers on the card, and the most misleading if read in isolation. Benchmarks saturate, tasks overlap with training data, and leaderboard rankings tell you almost nothing about real-workload performance.

Limitations and biases. A section that often acknowledges that the model has limitations. The detail varies enormously. Some cards list specific known failure modes. Others say “may produce incorrect information” and call it done.

Safety and responsible-use notes. Red-teaming summaries, content-filtering details, and sometimes usage guidelines. The level of candour here is a signal in itself.

What model cards do not tell you

The gaps are predictable, and they matter:

Train-test contamination. Did the model see benchmark data during training? Some providers publish decontamination results. Most do not. Without it, a high benchmark score could mean the model memorised the answer set.

Real-workload performance. Benchmarks test general capabilities, not your documents, your prompts, or your users. A model that scores 90% on MMLU can still fail on a straightforward document-extraction task.

Reproducibility. Can you reproduce the benchmark results with the same model and configuration? Not always. Providers sometimes use undisclosed prompt formats, sampling parameters, or system prompts.

Pricing and latency. These are in the API docs, not usually in the model card. But they are the other half of any practical decision.

Edge-case behaviour. Model cards cover typical performance. They do not tell you how the model handles unusual inputs, ambiguous instructions, or competing constraints in a tool-calling setting.

Data provenance. The description of training data is almost never detailed enough to verify claims about language coverage, domain representation, or data quality.

Where teams misunderstand model cards

The common mistakes are predictable:

treating benchmark scores as a purchase decision;
assuming “limitations acknowledged” means “limitations understood”;
skipping the out-of-scope section because it looks like boilerplate;
comparing models by the number of benchmarks listed rather than the relevance of those benchmarks;
ignoring the date — a model card from six months ago describes a model that may no longer exist under the same API endpoint.

The most expensive mistake is deploying a model based on its card without running your own eval first.

Practical decision check

Before you rely on a model card:

Does the card date match the model version you are evaluating?
Are the benchmarks relevant to your workload?
Is training-data contamination disclosed?
Are limitations specific enough to test?
Have you run at least 50 of your own examples through the model?
Are the safety notes detailed enough for your risk profile?

If the answer to any of those is no, the card is not enough.

Methodology

Data checked: 2026-05-24
Sources consulted: Published model cards from OpenAI, Anthropic, Google, Meta, Mistral and Hugging Face; Hugging Face model card template; Google model card toolkit; NIST AI RMF guidance.
Assumptions: Model cards from different periods and providers vary in detail. This is a structural guide, not an audit of any specific card.
Limitations: This page cannot tell you whether a specific model card is accurate or complete. It cannot tell you whether a model is safe for your use case. It can only give you a framework for reading model cards as the partial, provider-produced documents they are.
Jurisdiction: Global. Model card practices vary by provider and jurisdiction; specific regulatory requirements for model documentation may apply under the EU AI Act or sector-specific rules.

Source list

Hugging Face model card guide — https://huggingface.co/docs/hub/en/model-cards (accessed 2026-05-24)
Google Model Cards — https://modelcards.withgoogle.com/about (accessed 2026-05-24)
NIST AI Risk Management Framework — https://www.nist.gov/itl/ai-risk-management-framework (accessed 2026-05-24)
OpenAI GPT-4 system card — https://cdn.openai.com/papers/gpt-4-system-card.pdf (accessed 2026-05-24)
Anthropic model card documentation — https://docs.anthropic.com/en/docs/about-claude/model-card (accessed 2026-05-24)

Trust Stack

Last checked: 2026-05-28
Corrections: Contact us to report errors

Change log

2026-05-28: full editorial review — added Editor’s Notes, Trust Stack, slugified heading IDs, formal Methodology and Source list sections, fixed writtenBy label (editor)
2026-05-24: first published