Core concepts
Tokens, prompts, inference, context windows — the building blocks every LLM guide assumes you already know.
Glossary
Plain-English definitions for terms used across theLLMs guides. These are general explanations, not legal, financial or technical advice.
How to use it
Search by the exact term you saw in an article or API doc, or by the plain-English idea you are trying to understand. Definitions are deliberately short so you can get back to the guide that uses the term in context.
72 glossary terms shown.
| Term | Description | Context |
|---|---|---|
| AI RMF (AI Risk Management Framework) | A voluntary framework from NIST that helps organisations identify, assess, and manage AI risks throughout a system's lifecycle. | Regulation & policy |
| API (Application Programming Interface) | A way for one program to talk to another. LLM APIs let your code send prompts and receive responses without running the model yourself. | Infrastructure |
| attention | A mechanism that lets a model focus on relevant parts of its input when predicting each token. Self-attention is the version transformers use to relate words within the same sequence. | Model architecture |
| batch API | An async API mode where you submit jobs that run when capacity is available — usually much cheaper but with turnaround times of hours, not seconds. | Operations & infra |
| benchmark | A standardised test used to measure and compare LLM performance on a specific task, such as maths, coding, or factual knowledge. | Evaluation |
| chain of thought | Prompting an LLM to show its step-by-step reasoning before answering. Reasoning models do this internally; regular models need to be prompted to write it out. | Generation & output |
| Chatbot Arena | A crowdsourced leaderboard where humans vote on which model gave the better answer in blind side-by-side comparisons. | Evaluation |
| chunking | Splitting a document into smaller pieces before embedding it, so a RAG system can retrieve the most relevant sections rather than whole documents. | RAG & retrieval |
| contamination | When benchmark test questions accidentally appear in a model's training data, making its scores unreliable. Also called leakage. | Evaluation |
| context window | The maximum number of tokens an LLM can consider at once, covering the input prompt and the output it generates. Bigger windows cost more and may slow responses. | Core concepts |
| DPO (Direct Preference Optimisation) | A simpler alternative to RLHF that tunes a model directly from pairs of preferred and rejected outputs, without needing a separate reward model. | Training & fine-tuning |
| embedding | A numerical representation of text (or images, audio) that captures meaning as coordinates, so similar content is close together in vector space. | RAG & retrieval |
| EU AI Act | European Union regulation that classifies AI systems by risk level and imposes requirements on high-risk applications, including some LLM use cases. | Regulation & policy |
| eval | Short for evaluation. A test or suite of tests that checks whether a model or AI feature produces acceptable outputs. | Evaluation |
| fallback | A backup plan for when an LLM call fails or times out — such as switching to a different model or returning a cached response. | Operations & infra |
| fine-tuning | Taking a pre-trained model and training it further on a smaller, task-specific dataset so it performs better on that task. | Training & fine-tuning |
| function calling / tool use | Giving an LLM the ability to request that external code run — such as searching a database or sending an email — by outputting a structured call rather than free text. | Generation & output |
| GDPR (General Data Protection Regulation) | The EU/UK data protection law that governs how personal data is collected, stored, and processed — relevant when LLM applications handle user data. | Regulation & policy |
| GGUF | A file format for storing quantised models so they can run locally with tools like llama.cpp and Ollama. | Model architecture |
| golden dataset | A carefully curated set of prompts and expected outputs used to catch regressions when changing prompts, models, or pipeline logic. | Evaluation |
| GPU (Graphics Processing Unit) | The specialised hardware most LLMs run on. GPUs are optimised for the parallel matrix maths that makes inference and training fast. | Infrastructure |
| guardrail | A safety rule or filter that checks LLM inputs or outputs against a policy — for example, blocking personally identifiable information or harmful content. | Safety & governance |
| hallucination | When an LLM generates text that sounds plausible but is factually wrong, invented, or contradicts its training data. Not the same as a deliberate lie. | Core concepts |
| HELM (Holistic Evaluation of Language Models) | A Stanford framework that evaluates models across many dimensions — accuracy, fairness, bias, toxicity, and efficiency — not just one score. | Evaluation |
| HumanEval | A coding benchmark where models must write Python functions from docstrings, with test cases checking correctness. | Evaluation |
| inference | The process of running a trained model to produce an answer. Every time you send a prompt and get a response, the model is doing inference. | Core concepts |
| jailbreak | A specific type of prompt that tricks an LLM into ignoring its safety training and complying with harmful requests. | Safety & governance |
| JSON mode | A provider setting that constrains the model to produce valid JSON. Less strict than full structured-output schema enforcement. | Generation & output |
| large language model (LLM) | An AI system trained on massive text datasets to predict and generate language. LLMs power chatbots, coding assistants, search, and content tools. | Core concepts |
| latency | How long it takes to get a response from an LLM — often split into time-to-first-token and total generation time. | Operations & infra |
| LiteLLM | A proxy that translates between different LLM provider APIs so your application code only needs to speak one format. | Tools & platforms |
| llama.cpp | A C++ inference engine that runs quantised LLMs efficiently on consumer hardware, including CPUs and Apple Silicon. | Model architecture |
| LLM-as-a-judge | Using one LLM to score the outputs of another. Fast and scalable but can inherit the judge model's own biases and blind spots. | Evaluation |
| LoRA (Low-Rank Adaptation) | A lightweight fine-tuning method that trains a small set of new weights rather than changing the full model, making it cheaper and faster. | Training & fine-tuning |
| MCP (Model Context Protocol) | An open protocol that standardises how LLMs connect to external tools, data sources, and services, letting one client work with many servers. | Tools & platforms |
| mixture of experts (MoE) | A model design where only a fraction of the total parameters are active for any given input, making large models cheaper to run. | Model architecture |
| MMLU (Massive Multitask Language Understanding) | A benchmark that tests model knowledge across 57 subjects from law and medicine to history and maths. | Evaluation |
| model card | A document that describes a model's intended use, limitations, training data, and evaluation results. Think of it as a nutrition label for an AI model. | Evaluation |
| observability | The practice of logging, tracing, and monitoring LLM calls so you can debug problems, track costs, and measure quality over time. | Operations & infra |
| Ollama | A desktop tool that makes it easy to download and run open-weight LLMs locally, without needing cloud infrastructure. | Tools & platforms |
| open weights | A model whose trained parameters are publicly downloadable, so you can run it yourself. Not the same as open-source — the training data and recipe may still be private. | Model architecture |
| OpenRouter | A unified API gateway that lets you access models from multiple providers through a single endpoint, with usage-based billing. | Training & fine-tuning |
| parameters | The internal weights of a model — think of them as the knobs the model tunes during training. More parameters usually means a more capable but more expensive model. | Model architecture |
| PII (Personally Identifiable Information) | Data that can identify a specific person — names, email addresses, phone numbers, ID numbers — that should not be sent to external LLM APIs without careful handling. | Safety & governance |
| pre-training | The initial, most expensive stage of training where a model learns general language patterns from internet-scale text before being adapted for specific tasks. | Training & fine-tuning |
| prompt | The text you send to an LLM to get a response. A prompt can be a question, an instruction, a document to summarise, or a conversation history. | Core concepts |
| prompt caching | A provider feature that stores and reuses repeated prompt prefixes so you pay less for tokens the model has already processed. | Operations & infra |
| prompt injection | An attack where someone hides instructions inside user input to override the system prompt or make the LLM behave unexpectedly. | Safety & governance |
| quantisation | Shrinking a model's precision (e.g. from 16-bit to 4-bit weights) so it uses less memory and runs faster, usually with a small quality trade-off. | Model architecture |
| RAG (Retrieval-Augmented Generation) | A technique where an LLM looks up relevant documents or data before answering, rather than relying only on what it memorised during training. | RAG & retrieval |
| rate limit | A cap on how many API requests or tokens you can send in a given time window, enforced by the provider. | Operations & infra |
| reasoning model | An LLM trained to spend extra computation "thinking" through a problem before answering, often producing hidden internal reasoning tokens. | Generation & output |
| red teaming | Deliberately trying to make an AI system produce harmful, biased, or policy-violating outputs to find weaknesses before real users do. | Safety & governance |
| reranker | A second-pass model that re-scores retrieved documents after the initial vector search, improving the quality of what the LLM sees. | RAG & retrieval |
| RLHF (Reinforcement Learning from Human Feedback) | A training stage where human raters score model outputs and the model is tuned to prefer higher-scoring responses. | Training & fine-tuning |
| SDK (Software Development Kit) | A library or set of tools that makes it easier to work with a specific service, such as an LLM provider's Python or JavaScript client. | Infrastructure |
| semantic search | Search that matches by meaning rather than exact keywords, usually powered by embeddings and vector similarity. | RAG & retrieval |
| SFT (Supervised Fine-Tuning) | Fine-tuning a model on example input-output pairs so it learns to follow instructions before any preference tuning. | Training & fine-tuning |
| SLA (Service Level Agreement) | A provider's formal commitment about uptime, latency, or support responsiveness — and what happens if they miss those targets. | Operations & infra |
| structured output | Forcing an LLM to respond in a specific format (usually JSON) that follows a defined schema, so another system can parse the result reliably. | Generation & output |
| system prompt | Instructions set by the developer that run before a user's message, used to set the model's tone, role, rules, or behaviour. | Core concepts |
| temperature | A setting that controls how random an LLM's output is. Low values (close to 0) are more focused and deterministic; high values produce more varied, creative results. | Generation & output |
| throughput | How many tokens or requests an LLM service can handle per second, important for production workloads. | Operations & infra |
| token | A chunk of text — often a word, part of a word, or punctuation — that an LLM reads or writes. Pricing, context limits, and rate limits are all measured in tokens. | Operations & infra |
| top-k sampling | A sampling method that limits the model to picking from only the K most likely next tokens, ignoring everything below that threshold. | Generation & output |
| top-p (nucleus sampling) | A sampling method that only considers the smallest set of tokens whose cumulative probability reaches a threshold, filtering out long-tail unlikely tokens. | Generation & output |
| trace | A record of a single LLM interaction — including the prompt, response, timing, token counts, and any intermediate steps like tool calls or retrievals. | Evaluation |
| training | The compute-heavy process of teaching an LLM to predict text by showing it enormous amounts of language data. | Core concepts |
| transformer | The neural-network architecture behind nearly all modern LLMs. It uses attention mechanisms to weigh the importance of different words in a sequence. | Model architecture |
| vector database | A database designed to store and search embeddings. Used in RAG systems to find the most relevant documents for a query. | RAG & retrieval |
| vLLM | A high-throughput inference server designed for serving LLMs in production, with features like continuous batching and PagedAttention. | Model architecture |
| VRAM (Video RAM) | The memory on a GPU. Model size and quantisation determine how much VRAM you need to run an LLM locally. | Model architecture |
The glossary is organised around the concepts that appear most often across the guides.
Tokens, prompts, inference, context windows — the building blocks every LLM guide assumes you already know.
Transformers, attention, parameters, quantisation, and the file formats that let you run models locally.
Embeddings, vector databases, chunking, and reranking — the stack that lets LLMs answer from your documents.
Benchmarks, evals, golden datasets, and why LLM-as-a-judge is useful but fallible.
Guardrails, red teaming, prompt injection, and the regulations shaping what AI products can and cannot do.
Latency, rate limits, observability, and the open-source tools that run LLMs in production.