AI coding agents: what to measure before trusting them
A practical measurement framework for evaluating AI coding agents on real engineering work: review burden, test pass rate, diff quality, security issues and rollback cost.
Run
The Run lane is for doing the work: choosing models, setting up evals, pricing a feature, testing retrieval, running agents, and explaining trade-offs without turning the meeting into acronym soup.
Published now
A practical measurement framework for evaluating AI coding agents on real engineering work: review burden, test pass rate, diff quality, security issues and rollback cost.
A practical guide to adapting incident response for prompts, outputs, evals, rollbacks and customer-facing AI failures.
A practical guide to balancing observability with sensitive-data retention risk in production LLM systems.
A practical guide to testing whether cited sources actually support the generated claim, not just whether the answer looks grounded.
A practical guide to reducing personal-data exposure in AI features by minimising what you send before you try to redact it.
A practical guide to direct and indirect prompt injection, what can go wrong in retrieval-heavy apps, and what controls actually reduce risk.
A practical guide to where LLM data leaks happen, what to minimise before sending data, and what retention settings to check before launch.
A practical guide to approval gates, least privilege, dry runs and audit logs for AI agents with tools.
A practical guide to separating model-level safety from app-level permissions, tool boundaries and operational controls.
A practical guide to creating a small test set for unsupported claims, regression checking and safer LLM updates.
A plain-English guide to the difference between chat history, profile memory, stored app data and training data, with privacy checks for product teams.
A decision guide for operators who need to know when deterministic automation is enough and when real agent behaviour is worth the operational cost.
A practical explanation of why a small, stable test set is often more useful than a huge benchmark when you need confidence before shipping.
A release-process guide for teams that want automated evaluation to run before changes reach users.
A practical guide to decoupling your product from provider churn so every model update does not become a rewrite.
A plain-English guide to distinguishing sensible safety boundaries from over-refusal that breaks legitimate use cases.
A guide to spotting benchmark overfitting and test-specific behaviour before it turns into product disappointment.
A practical first-week checklist for finding failure modes in a new LLM feature — with concrete test items, sample prompts, and decision guides.
A clear guide to what Model Context Protocol is, what it is not, and why the marketing sometimes runs ahead of the wiring.
A practical guide to deciding whether you need a vector database, a search index or something much simpler.
A practical guide to breaking source material into retrievable pieces without wrecking meaning or search quality.
How rerankers improve retrieval precision — with a worked example, model/provider names, latency numbers, and a decision framework for when the extra step pays for itself.
A plain-English guide to the three phases of model work, what each one changes, and what the difference means for budget, data and risk.
A plain-English guide to why AI features feel slow, what to measure, and how to separate queueing, model time, tool calls and network delay.
Editorial rule
Briefed pipeline
Understand AI memory features and their privacy implications.
Decide whether a task needs an “agent” or deterministic automation.
Understand benchmark claims in model launches.
Build a practical evaluation set for an AI feature.
Put AI regression tests into a software pipeline.
Reduce disruption from frequent model releases/deprecations.
Reduce factual errors in LLM outputs.
Diagnose unwanted model refusals.
Understand why high benchmark scores may not translate.
Test an AI feature before launch.
Choose an adaptation strategy for an AI product.
Build reliable tool-using AI workflows.
Understand Model Context Protocol and when it helps.
Decide whether to add a vector database.
Improve retrieval quality in document AI.
Understand reranking after vector search.
Plan a simple document-QA prototype.
Evaluate coding agents for real engineering work.
Extract structured data from unstructured text.
Monitor an LLM app in production.
Design resilient AI features.
Add human review to AI workflows.
Manage changes to prompts in teams.
Use AI over internal policies and procedures.
Understand the site’s editorial trust model.
Search by idea
Try "how much do tokens cost?", "run a model on my own hardware", or "stop prompt injection attacks". Search runs in your browser against our article index.