What model cards tell you — and what they do not
Every major model launch comes with a model card. It is a short document that says what the model can do, how it was trained, how it was evaluated, and what its limitations are.
The problem is that model cards are written by the people who built the model. They are useful, but they are not neutral. Reading one well means knowing what to treat as fact, what to treat as marketing, and what to treat as absent.
Quick answer
A model card gives you the basics: what the model is, what benchmarks it was tested on, and a rough safety summary. What it will not tell you is how well the model performs on your specific workload, what data it was actually trained on, whether the benchmark scores are reproducible, or how behaviour changes at the edges of its stated capabilities.
Treat model cards as starting points, not contracts.
What model cards typically contain
Most model cards follow a template derived from the Hugging Face or Google standard. The sections you usually see:
Intended use and out-of-scope use. The provider says what the model is for and what it should not be used for. This is one of the most useful sections: if your use case is listed as out-of-scope, the card is telling you that the model was not tested for it.
Training data. Usually a high-level description: dataset names, size, language mix, and sometimes a filtering summary. The actual data is almost never included, and contamination with benchmark data is rarely disclosed.
Evaluation results. Benchmark names and scores. These are the most visible numbers on the card, and the most misleading if read in isolation. Benchmarks saturate, tasks overlap with training data, and leaderboard rankings tell you almost nothing about real-workload performance.
Limitations and biases. A section that often acknowledges that the model has limitations. The detail varies enormously. Some cards list specific known failure modes. Others say “may produce incorrect information” and call it done.
Safety and responsible-use notes. Red-teaming summaries, content-filtering details, and sometimes usage guidelines. The level of candour here is a signal in itself.
What model cards do not tell you
The gaps are predictable, and they matter:
Train-test contamination. Did the model see benchmark data during training? Some providers publish decontamination results. Most do not. Without it, a high benchmark score could mean the model memorised the answer set.
Real-workload performance. Benchmarks test general capabilities, not your documents, your prompts, or your users. A model that scores 90% on MMLU can still fail on a straightforward document-extraction task.
Reproducibility. Can you reproduce the benchmark results with the same model and configuration? Not always. Providers sometimes use undisclosed prompt formats, sampling parameters, or system prompts.
Pricing and latency. These are in the API docs, not usually in the model card. But they are the other half of any practical decision.
Edge-case behaviour. Model cards cover typical performance. They do not tell you how the model handles unusual inputs, ambiguous instructions, or competing constraints in a tool-calling setting.
Data provenance. The description of training data is almost never detailed enough to verify claims about language coverage, domain representation, or data quality.
Where teams misunderstand model cards
The common mistakes are predictable:
- treating benchmark scores as a purchase decision;
- assuming “limitations acknowledged” means “limitations understood”;
- skipping the out-of-scope section because it looks like boilerplate;
- comparing models by the number of benchmarks listed rather than the relevance of those benchmarks;
- ignoring the date — a model card from six months ago describes a model that may no longer exist under the same API endpoint.
The most expensive mistake is deploying a model based on its card without running your own eval first.
Practical decision check
Before you rely on a model card:
- Does the card date match the model version you are evaluating?
- Are the benchmarks relevant to your workload?
- Is training-data contamination disclosed?
- Are limitations specific enough to test?
- Have you run at least 50 of your own examples through the model?
- Are the safety notes detailed enough for your risk profile?
If the answer to any of those is no, the card is not enough.
What this page cannot tell you
This page cannot tell you whether a specific model card is accurate or complete. It cannot tell you whether a model is safe for your use case. It can only give you a framework for reading model cards as the partial, provider-produced documents they are.
Methodology and sources
Check date: 2026-05-24
What was checked: Published model cards from OpenAI, Anthropic, Google, Meta, Mistral and Hugging Face; Hugging Face model card template; Google model card toolkit; NIST AI RMF guidance.
What the sources were used for: identifying standard sections, common gaps, and typical patterns in provider disclosure.
Assumptions and limits: Model cards from different periods and providers vary in detail. This is a structural guide, not an audit of any specific card.
Change log
- 2026-05-24: first draft built from the llm-editor-approved brief, with a critical-reading framework for model cards.
Source list
- Hugging Face model card guide — https://huggingface.co/docs/hub/en/model-cards
- Google Model Cards — https://modelcards.withgoogle.com/about
- NIST AI Risk Management Framework — https://www.nist.gov/itl/ai-risk-management-framework
- OpenAI GPT-4 model card — https://cdn.openai.com/papers/gpt-4-system-card.pdf
- Anthropic model card examples — https://docs.anthropic.com/en/docs/about-claude/model-card