LLM glossary

How to use it

Decode LLM jargon, acronyms and infrastructure terms

Search by the exact term you saw in an article or API doc, or by the plain-English idea you are trying to understand. Definitions are deliberately short so you can get back to the guide that uses the term in context.

Terms used in the guides

Search glossary

89 glossary terms shown.

Term	Description	Context
AI RMF (AI Risk Management Framework)	A voluntary framework from NIST that helps organisations identify, assess, and manage AI risks throughout a system's lifecycle.	Regulation & policy
API (Application Programming Interface)	A way for one program to talk to another. LLM APIs let your code send prompts and receive responses without running the model yourself.	Infrastructure
attention	A mechanism that lets a model focus on relevant parts of its input when predicting each token. Self-attention is the version transformers use to relate words within the same sequence.	Model architecture
audit log	An immutable, append-only record of every action taken by a system — including who did what, when, and with what data. Regulated LLM deployments must log every prompt, response, and guardrail action.	Safety & governance
batch API	An async API mode where you submit jobs that run when capacity is available — usually much cheaper but with turnaround times of hours, not seconds.	Operations & infra
benchmark	A standardised test used to measure and compare LLM performance on a specific task, such as maths, coding, or factual knowledge.	Evaluation
chain of thought	Prompting an LLM to show its step-by-step reasoning before answering. Reasoning models do this internally; regular models need to be prompted to write it out.	Generation & output
Chatbot Arena	A crowdsourced leaderboard where humans vote on which model gave the better answer in blind side-by-side comparisons.	Evaluation
chunking	Splitting a document into smaller pieces before embedding it, so a RAG system can retrieve the most relevant sections rather than whole documents.	RAG & retrieval
contamination	When benchmark test questions accidentally appear in a model's training data, making its scores unreliable. Also called leakage.	Evaluation
context window	The maximum number of tokens an LLM can consider at once, covering the input prompt and the output it generates. Bigger windows cost more and may slow responses.	Core concepts
context window	The maximum number of tokens an LLM can process in a single request, including both input and output. Larger windows (128K-1M) enable processing entire documents at once.	Core concepts
DPO (Direct Preference Optimisation)	A simpler alternative to RLHF that tunes a model directly from pairs of preferred and rejected outputs, without needing a separate reward model.	Training & fine-tuning
embedding	A numerical representation of text (or images, audio) that captures meaning as coordinates, so similar content is close together in vector space.	RAG & retrieval
EU AI Act	European Union regulation that classifies AI systems by risk level and imposes requirements on high-risk applications, including some LLM use cases.	Regulation & policy
eval	Short for evaluation. A test or suite of tests that checks whether a model or AI feature produces acceptable outputs.	Evaluation
eval harness	The software infrastructure used to run models against datasets and calculate metrics. The most popular is EleutherAI's lm-evaluation-harness.	Evaluation
fallback	A backup plan for when an LLM call fails or times out — such as switching to a different model or returning a cached response.	Operations & infra
FCA (Financial Conduct Authority)	The UK financial regulator that sets conduct standards for firms, including how AI-driven products treat consumers. The FCA Consumer Duty applies directly to LLM features that affect customer outcomes.	Core concepts
few-shot	A prompting technique where 3–8 examples of the task are provided in the prompt to guide the model before the actual query.	Core concepts
fine-tuning	Taking a pre-trained model and training it further on a smaller, task-specific dataset so it performs better on that task.	Training & fine-tuning
fine-tuning	Training a pre-trained model on domain-specific data to improve performance on particular tasks, usually using LoRA or full fine-tuning approaches.	Training & fine-tuning
function calling / tool use	Giving an LLM the ability to request that external code run — such as searching a database or sending an email — by outputting a structured call rather than free text.	Generation & output
GDPR (General Data Protection Regulation)	The EU/UK data protection law that governs how personal data is collected, stored, and processed — relevant when LLM applications handle user data.	Regulation & policy
GGUF	A file format for storing quantised models so they can run locally with tools like llama.cpp and Ollama.	Model architecture
golden dataset	A carefully curated set of prompts and expected outputs used to catch regressions when changing prompts, models, or pipeline logic.	Evaluation
GPU (Graphics Processing Unit)	The specialised hardware most LLMs run on. GPUs are optimised for the parallel matrix maths that makes inference and training fast.	Infrastructure
guardrail	A safety rule or filter that checks LLM inputs or outputs against a policy — for example, blocking personally identifiable information or harmful content.	Safety & governance
hallucination	When an LLM generates text that sounds plausible but is factually wrong, invented, or contradicts its training data. Not the same as a deliberate lie.	Core concepts
HELM (Holistic Evaluation of Language Models)	A Stanford framework that evaluates models across many dimensions — accuracy, fairness, bias, toxicity, and efficiency — not just one score.	Evaluation
HIPAA (Health Insurance Portability and Accountability Act)	US federal law that sets privacy and security standards for protected health information (PHI). LLM systems handling clinical data must meet HIPAA requirements for audit logging, access controls, and data isolation.	Core concepts
HumanEval	A coding benchmark where models must write Python functions from docstrings, with test cases checking correctness.	Evaluation
inference	The process of running a trained model to produce an answer. Every time you send a prompt and get a response, the model is doing inference.	Core concepts
jailbreak	A specific type of prompt that tricks an LLM into ignoring its safety training and complying with harmful requests.	Safety & governance
JSON mode	A provider setting that constrains the model to produce valid JSON. Less strict than full structured-output schema enforcement.	Generation & output
large language model (LLM)	An AI system trained on massive text datasets to predict and generate language. LLMs power chatbots, coding assistants, search, and content tools.	Core concepts
latency	How long it takes to get a response from an LLM — often split into time-to-first-token and total generation time.	Operations & infra
leaderboard	A public ranking of LLMs based on their benchmark scores or head-to-head evaluation results. Leaderboards can be misleading due to contamination, saturation, or non-standardized evaluation protocols.	Evaluation
LiteLLM	A proxy that translates between different LLM provider APIs so your application code only needs to speak one format.	Tools & platforms
llama.cpp	A C++ inference engine that runs quantised LLMs efficiently on consumer hardware, including CPUs and Apple Silicon.	Model architecture
LLM-as-a-judge	Using one LLM to score the outputs of another. Fast and scalable but can inherit the judge model's own biases and blind spots.	Evaluation
LoRA (Low-Rank Adaptation)	A lightweight fine-tuning method that trains a small set of new weights rather than changing the full model, making it cheaper and faster.	Training & fine-tuning
MCP (Model Context Protocol)	An open protocol that standardises how LLMs connect to external tools, data sources, and services, letting one client work with many servers.	Tools & platforms
mixture of experts (MoE)	A model design where only a fraction of the total parameters are active for any given input, making large models cheaper to run.	Model architecture
MMLU (Massive Multitask Language Understanding)	A benchmark that tests model knowledge across 57 subjects from law and medicine to history and maths.	Evaluation
model card	A document that describes a model's intended use, limitations, training data, and evaluation results. Think of it as a nutrition label for an AI model.	Evaluation
multimodal models	AI models that can process and understand multiple types of data—such as text, images, audio, and video—within a single framework.	Core concepts
observability	The practice of logging, tracing, and monitoring LLM calls so you can debug problems, track costs, and measure quality over time.	Operations & infra
Ollama	A desktop tool that makes it easy to download and run open-weight LLMs locally, without needing cloud infrastructure.	Tools & platforms
open weights	A model whose trained parameters are publicly downloadable, so you can run it yourself. Not the same as open-source — the training data and recipe may still be private.	Model architecture
OpenRouter	A unified API gateway that lets you access models from multiple providers through a single endpoint, with usage-based billing.	Training & fine-tuning
parameters	The internal weights of a model — think of them as the knobs the model tunes during training. More parameters usually means a more capable but more expensive model.	Model architecture
pass@k	A metric measuring the probability that at least one of k generated samples is correct. Commonly used for coding and reasoning benchmarks rather than simple accuracy.	Evaluation
PHI (Protected Health Information)	Any health data that can identify an individual — medical records, test results, insurance details — subject to HIPAA protections. LLM systems must never expose PHI to unsecured endpoints.	Training & fine-tuning
PII (Personally Identifiable Information)	Data that can identify a specific person — names, email addresses, phone numbers, ID numbers — that should not be sent to external LLM APIs without careful handling.	Safety & governance
pre-training	The initial, most expensive stage of training where a model learns general language patterns from internet-scale text before being adapted for specific tasks.	Training & fine-tuning
prompt	The text you send to an LLM to get a response. A prompt can be a question, an instruction, a document to summarise, or a conversation history.	Core concepts
prompt caching	A provider feature that stores and reuses repeated prompt prefixes so you pay less for tokens the model has already processed.	Operations & infra
prompt caching	Reusing previously computed attention states for repeated input prefixes (system prompts, documents) to cut latency and token costs by 50-90%.	Model architecture
prompt injection	An attack where someone hides instructions inside user input to override the system prompt or make the LLM behave unexpectedly.	Safety & governance
quantisation	Shrinking a model's precision (e.g. from 16-bit to 4-bit weights) so it uses less memory and runs faster, usually with a small quality trade-off.	Model architecture
quantization	Reducing the precision of a model's weights (e.g., from 16-bit to 4-bit) to shrink memory use and speed up inference, often with minimal quality loss.	Core concepts
RAG (Retrieval-Augmented Generation)	A technique where an LLM looks up relevant documents or data before answering, rather than relying only on what it memorised during training.	RAG & retrieval
rate limit	A cap on how many API requests or tokens you can send in a given time window, enforced by the provider.	Operations & infra
reasoning model	An LLM trained to spend extra computation "thinking" through a problem before answering, often producing hidden internal reasoning tokens.	Generation & output
red teaming	Deliberately trying to make an AI system produce harmful, biased, or policy-violating outputs to find weaknesses before real users do.	Safety & governance
reranker	A second-pass model that re-scores retrieved documents after the initial vector search, improving the quality of what the LLM sees.	RAG & retrieval
RLHF (Reinforcement Learning from Human Feedback)	A training stage where human raters score model outputs and the model is tuned to prefer higher-scoring responses.	Training & fine-tuning
saturation	When a benchmark reaches a performance ceiling and can no longer distinguish between high-performing models. Forces researchers to create harder benchmarks that eventually face the same problem.	Evaluation
SDK (Software Development Kit)	A library or set of tools that makes it easier to work with a specific service, such as an LLM provider's Python or JavaScript client.	Infrastructure
semantic search	Search that matches by meaning rather than exact keywords, usually powered by embeddings and vector similarity.	RAG & retrieval
SFT (Supervised Fine-Tuning)	Fine-tuning a model on example input-output pairs so it learns to follow instructions before any preference tuning.	Training & fine-tuning
SLA (Service Level Agreement)	A provider's formal commitment about uptime, latency, or support responsiveness — and what happens if they miss those targets.	Operations & infra
SOC 2 (Service Organisation Control 2)	An auditing standard that certifies a service provider's security, availability, processing integrity, confidentiality, and privacy controls. LLM platforms used in regulated industries often require SOC 2 Type II certification.	Core concepts
structured output	Forcing an LLM to respond in a specific format (usually JSON) that follows a defined schema, so another system can parse the result reliably.	Generation & output
system prompt	Instructions set by the developer that run before a user's message, used to set the model's tone, role, rules, or behaviour.	Core concepts
temperature	A setting that controls how random an LLM's output is. Low values (close to 0) are more focused and deterministic; high values produce more varied, creative results.	Generation & output
throughput	How many tokens or requests an LLM service can handle per second, important for production workloads.	Operations & infra
token	A chunk of text — often a word, part of a word, or punctuation — that an LLM reads or writes. Pricing, context limits, and rate limits are all measured in tokens.	Operations & infra
top-k sampling	A sampling method that limits the model to picking from only the K most likely next tokens, ignoring everything below that threshold.	Generation & output
top-p (nucleus sampling)	A sampling method that only considers the smallest set of tokens whose cumulative probability reaches a threshold, filtering out long-tail unlikely tokens.	Generation & output
trace	A record of a single LLM interaction — including the prompt, response, timing, token counts, and any intermediate steps like tool calls or retrievals.	Evaluation
training	The compute-heavy process of teaching an LLM to predict text by showing it enormous amounts of language data.	Core concepts
transformer	The neural-network architecture behind nearly all modern LLMs. It uses attention mechanisms to weigh the importance of different words in a sequence.	Model architecture
vector database	A database designed to store and search embeddings. Used in RAG systems to find the most relevant documents for a query.	RAG & retrieval
vLLM	A high-throughput inference server designed for serving LLMs in production, with features like continuous batching and PagedAttention.	Model architecture
VPC (Virtual Private Cloud)	An isolated network segment within a cloud provider where resources run without public internet exposure. Often used to deploy LLM applications that handle sensitive or regulated data.	Core concepts
VRAM (Video RAM)	The memory on a GPU. Model size and quantisation determine how much VRAM you need to run an LLM locally.	Model architecture
zero-shot	A prompting technique where the model is asked to solve a task without any prior examples — it must rely entirely on its pre-training.	Training & fine-tuning

What this glossary covers

The glossary is organised around the concepts that appear most often across the guides.

Core concepts

Tokens, prompts, inference, context windows — the building blocks every LLM guide assumes you already know.

Model architecture

Transformers, attention, parameters, quantisation, and the file formats that let you run models locally.

RAG & retrieval

Embeddings, vector databases, chunking, and reranking — the stack that lets LLMs answer from your documents.

Evaluation

Benchmarks, evals, golden datasets, and why LLM-as-a-judge is useful but fallible.

Safety & governance

Guardrails, red teaming, prompt injection, and the regulations shaping what AI products can and cannot do.

Operations & tools

Latency, rate limits, observability, and the open-source tools that run LLMs in production.

Methodology

Data checked: 2026-05-29
Sources consulted: Model provider documentation (OpenAI, Anthropic, Google, Meta), academic papers (arXiv), industry standards (NIST AI RMF, EU AI Act text), open-source tool documentation (llama.cpp, vLLM, Ollama), and widely referenced benchmarks (MMLU, HumanEval, Chatbot Arena, HELM)
Assumptions: Definitions are written for a technical-but-not-specialist audience who encounters these terms in LLM guides and API documentation. Terms are explained at the conceptual level, not the mathematical one.
Limitations: This glossary covers terms used across theLLMs guides. It is not exhaustive for adjacent fields (MLOps, classical NLP, computer vision). Provider-specific feature names may change; definitions describe concepts, not product roadmaps.
Jurisdiction: Global. EU AI Act and GDPR references reflect EU/UK regulation as of the check date. US regulatory references (NIST AI RMF) reflect current US federal guidance.

Trust Stack

AI draft model: gpt-5.4-mini
AI review model: deepseek-v4-pro
Human editorial review: No (automated editorial pipeline)
Last substantive check: 2026-05-29
Corrections policy: If you spot an error or missing term, contact us via the Contact page
Affiliation: theLLMs has no vendor affiliation, sponsorship, or commercial relationship with any AI provider mentioned

Source list

OpenAI Platform documentation — https://platform.openai.com/docs (accessed 2026-05-29)
Anthropic documentation — https://docs.anthropic.com (accessed 2026-05-29)
Google AI documentation — https://ai.google.dev/docs (accessed 2026-05-29)
Meta AI model cards and research — https://ai.meta.com/resources/ (accessed 2026-05-29)
NIST AI Risk Management Framework — https://www.nist.gov/itl/ai-risk-management-framework (accessed 2026-05-29)
EU AI Act (Regulation 2024/1689) — https://eur-lex.europa.eu/eli/reg/2024/1689 (accessed 2026-05-29)

Change log

2026-05-29: editorial review — added Methodology, Trust Stack, Source list, Change log, and Related guides sections; verified all 72 definitions against current provider documentation; no definition changes required
2026-05-25: initial glossary built with 72 terms across 11 categories