Severity Gates: The Missing Layer in AI Agent Workflows

Severity gates pipeline: Agent Output passes through Critical, Major, and Minor gates to become Trusted Output

AI agents produce output fast. Code, design documents, API schemas, migration plans — the generation part is solved. The problem is what happens next.

How do you know the output is correct?

Most teams answer this with more context. Longer prompts. More rules. The assumption: if the agent had enough instructions, it would get it right the first time. In practice, the opposite happens. More context means agents forget things. Rules get lost in long sessions. The agent follows 90% of the instructions and silently drops the rest.

But the AI is only half the problem. The human directing the agent might be making wrong decisions — and without something pushing back, those decisions get implemented faithfully.

The trust problem has two sides. AI forgets rules. Humans make wrong calls. Neither side alone is reliable. I wrote about why trust became the expensive part of AI-assisted development — this article is about the methodology to build that trust.

The Core Idea: Severity as Exit Criteria

In 1976, Michael Fagan at IBM applied manufacturing quality control to software. His insight: classify defects by severity, use severity to gate progress.

Major defect — compromises functional integrity. You cannot proceed.
Minor defect — stylistic, cosmetic. You can proceed.

That’s it. A binary gate based on severity classification. Not a judgment call. Not “looks good to me.” A deterministic decision: are there unresolved critical issues? Yes → stop. No → continue.

Fagan inspections caught 80-90% of defects. But the full process — planning, group inspection, rework, follow-up — cost 16-20 hours per 1,000 lines of code. Too slow for modern development.

The insight survived. The process didn’t.

How Others Apply This Principle

The same idea — severity drives the verdict — shows up across the industry. Different implementations, same core.

System	How it gates	Key mechanism
Google Critique	Severity labels on every comment: Critical (blocks), Nit (non-blocking), Optional, FYI	LGTM depends on resolving all Critical comments. The prefix removes ambiguity
Stripe Blueprints	Deterministic nodes + agentic nodes in a pipeline	Agent cannot proceed past a failing deterministic check. 1,300+ PRs/week, all gated
Meta Coordinator	Specialized sub-agents review independently, coordinator consolidates	90-95% convergence threshold. Forces agents to trace code paths before submitting findings
CodeRabbit	Two-loop architecture: fast summarization → deep reasoning	40+ static analysis tools feed into frontier model reasoning. Sandboxed execution

Two patterns stand out:

Stripe constrains context aggressively. 400+ internal tools available, but each agent gets only ~15 relevant ones. Less context, more focus. The opposite of “give it everything and hope it remembers.”
Meta forces justification. Semi-formal reasoning requires agents to state premises, trace paths, and provide conclusions before submitting. This reduces hallucinated findings.

Three Frameworks That Complete the Picture

Severity-gated review judges the output. But a complete evaluation system also needs structure, adversarial pressure, and a technique for surfacing hidden risks.

Framework	Origin	What it adds
Stage-Gate	Robert Cooper, 1990. Product development at P&G, 3M, LEGO	Structure. Work happens in phases. Each phase produces an artifact. A gate evaluates: Go / Kill / Hold / Recycle. “Recycle” = iterate
Red Team	US Army, 1960s. Adopted in cybersecurity, chaos engineering	Adversarial posture. The evaluator tries to break the plan. Looks for unvalidated assumptions, undiscussed failure modes, dismissed edge cases
Pre-mortem	Gary Klein, 2007. Harvard Business Review	Prospective hindsight. “Assume this project failed. Why?” Increases risk identification by 30% vs standard brainstorming. Gives permission to criticize

Combined:

Stage-Gate provides the when (checkpoints between phases). Red Team provides the how (adversarial, not friendly). Pre-mortem provides the lens (“this shipped and broke — why?”).

The result is not a checklist review. It’s a structured challenge that surfaces what you missed.

The Working Framework

Taking Fagan’s principle (severity as exit criteria) and the orchestration patterns from Google, Stripe, and Meta (multi-reviewer consolidation):

Severity Levels

Severity	Meaning	Examples
CRITICAL	Blocks progress	Security risk, data loss, breaking change without migration, business rule violation
MAJOR	Must fix, but not a design blocker	Convention violation, missing masking, unsafe migration, missing deprecation plan
MINOR	Address during implementation	Naming suggestion, optimization opportunity, documentation gap

Verdict

Deterministic. No judgment calls.

Verdict	Condition	Next action
NO-GO	Any CRITICAL unresolved	Fix CRITICALs, re-run full review
CONDITIONAL	No CRITICALs, MAJORs remain	Fix MAJORs, re-run affected reviewer(s)
GO	Only MINORs or clean	Proceed

Structured Findings

Every finding follows the same format:

[SEVERITY] Description
→ Location: file or section affected
→ Action: what to fix or decide
→ Verify: how to confirm the fix is correct

The Verify line is concrete. Not “check it works” but “run X and confirm Y” or “search for Z and verify no references remain.” Actionable without interpretation.

Multi-Agent Orchestration

Multi-agent orchestration: specialized reviewers feed into an orchestrator that applies worst-verdict-wins

Multiple specialized reviewers (security, database, API contracts, performance) evaluate the same artifact independently. An orchestrator consolidates the results.

Consolidation rule: the final verdict is the worst across all reviewers.

Any NO-GO → consolidated NO-GO
Any CONDITIONAL (no NO-GO) → consolidated CONDITIONAL
All GO → consolidated GO

Each reviewer stays within its domain. The security reviewer doesn’t comment on naming. The database reviewer doesn’t flag API design. Scope boundaries prevent context dilution.

Humans make final decisions. Reviewers surface concerns. They never approve or reject.

The framework defines what to evaluate and how to classify it. For how to wire automated checks into the agent’s working loop — linters after each edit, type checkers incrementally, tests before task completion — see Harness Engineering. The harness is the runtime enforcement. The severity gate is the judgment layer on top.

Gates for Planning, Not Just Code

The highest-value gate is not on the pull request. It’s on the plan.

Fixing a wrong decision in a design document costs minutes. Fixing it after implementation costs days — sometimes weeks, if other features built on top of it. When agents can implement a feature overnight, a wrong decision in the spec becomes wrong code by morning.

The review framework applies to any artifact:

System design — module boundaries, data flows, dependency direction
API contracts — backward compatibility, schema breaking changes
Migration plans — rollback strategy, step ordering, data safety
Task specs — scope clarity, enough detail for unsupervised execution

Many of these checks can be automated as fitness functions — automated tests that protect structural properties. “No cross-module imports,” “API schema changes must not break clients,” “dependencies flow in one direction.” When a fitness function fails, it becomes a finding in the severity-gated review.

This is where the frameworks connect:

Stage-Gate structures the design into phases. Red Team challenges the design at each gate. Severity-gated review classifies what the challenge finds.

Review the thinking before you review the code. The expensive mistakes are never syntax errors. They are wrong decisions that got implemented correctly.

The Day-Night Shift

Day-Night Shift workflow: prepare specs during the day, agents execute in parallel at night, evaluate results in the morning

This framework enables a workflow that would be impossible without structured trust in planning output.

Day: Engineers prepare specifications. Design documents go through severity-gated review. Adversarial interrogation challenges assumptions. Iterate until the verdict is GO.

Night: Agents execute. Multiple features in parallel. Each works from a reviewed specification. No supervision needed — the planning has been stress-tested.

Morning: Engineers evaluate results. Implementation review checks that code follows the contracts in the spec. Module boundaries respected? API matches schema? Tests cover the defined behavior?

The day-night shift only works when planning output is trusted enough for unsupervised execution. The review framework creates that trust.

Running 10 features in parallel overnight instead of 1 with constant supervision — that is the unlock. But it requires a structured evaluation layer that most teams don’t have.

The Methodology Is Not New

Fagan solved this in 1976. What changed is the volume. AI agents generate more artifacts, faster, across more domains. The need for structured evaluation didn’t decrease — it increased.

More context doesn’t make agents reliable. Structured evaluation does. The answer to “how do I trust AI output” is not a better prompt. It’s a better gate.

References

Design Inspections to Reduce Errors in Program Development — Michael Fagan, IBM Systems Journal, 1976. The original severity-classified inspection methodology.
Winning at New Products — Robert Cooper, 1990. Stage-Gate framework for phased development with decision gates.
Performing a Project Pre-Mortem — Gary Klein, Harvard Business Review, 2007. Prospective hindsight for risk identification.
Minions: Stripe’s One-Shot, End-to-End Coding Agents — Stripe Engineering. Blueprint architecture with deterministic and agentic nodes.
Modern Code Review: A Case Study at Google — Sadowski et al., ICSE 2018. How Google’s Critique system standardizes review severity.