← Back

Severity-Gated Review

Classify findings by severity. Use severity to gate progress. This is the methodology for evaluating AI agent output — whether code, design documents, API contracts, or migration plans. The principle comes from Fagan inspections (IBM, 1976) and is applied at Google, Stripe, and Meta with different implementations.

More context does not make agents reliable. Structured evaluation does. The answer to “how do I trust AI output” is not a better prompt — it is a better gate.

Severity Levels

Every finding is classified into one of three tiers:

SeverityMeaningExamples
CRITICALBlocks progress. Must resolve before proceedingSecurity risk, data loss, breaking change without migration, business rule violation
MAJORMust fix before proceeding. Not a design blocker but an implementation blockerConvention violation, missing data masking, unsafe migration, missing deprecation plan
MINORCan address during implementationNaming suggestion, optimization opportunity, documentation gap

When classifying: if the merged artifact sets a wrong baseline for future work, it is at least MAJOR. If it risks data loss, security, or breaking clients, it is CRITICAL.

Verdict Logic

Verdicts are deterministic. No judgment calls.

VerdictConditionNext action
NO-GOAny CRITICAL unresolvedFix all CRITICALs, re-run full review
CONDITIONALNo CRITICALs, MAJORs remainFix MAJORs, re-run affected reviewer(s) only
GOOnly MINORs or cleanProceed

Structured Findings

Every finding must include four elements:

[SEVERITY] Description
→ Location: file or section affected
→ Action: what to fix or decide
→ Verify: how to confirm the fix is correct

The Verify line is concrete. Not “check it works” but “run X and confirm Y” or “search for Z and verify no references remain.” Findings without a verify step are incomplete.

Multi-Agent Orchestration

When multiple specialized reviewers evaluate the same artifact, each produces an independent review with its own verdict. An orchestrator consolidates.

Consolidation rule: the final verdict is the worst across all reviewers. Any NO-GO makes the consolidated verdict NO-GO. Any CONDITIONAL (no NO-GO) makes it CONDITIONAL. All GO makes it GO.

Scope boundaries prevent context dilution. The security reviewer does not comment on naming. The database reviewer does not flag API design. Each reviewer stays within its domain.

Humans make final decisions. Reviewers surface concerns. They never approve or reject.

Gates for Planning, Not Just Code

The highest-value gate is on the plan, not the pull request. Fixing a wrong decision in a design document costs minutes. Fixing it after implementation costs days.

Apply severity-gated review to any artifact:

  • System design — module boundaries, data flows, dependency direction
  • API contracts — backward compatibility, schema breaking changes
  • Migration plans — rollback strategy, step ordering, data safety
  • Task specs — scope clarity, enough detail for unsupervised execution

When a fitness function fails, it becomes a finding in the severity-gated review. The fitness function detects; the severity framework classifies and gates.

Adjacent Frameworks

Three frameworks complement severity-gated review:

  • Stage-Gate (Cooper, 1990) — provides the when: checkpoints between phases. Each phase produces an artifact, each gate evaluates it.
  • Red Team (US Army, 1960s) — provides the how: adversarial posture. The evaluator tries to break the plan.
  • Pre-mortem (Klein, 2007) — provides the lens: “assume this shipped and broke — why?” Increases risk identification by 30%.

Combined: adversarial interrogation using prospective hindsight. Not a checklist review — a structured challenge that surfaces what you missed.

The Day-Night Shift

Severity-gated review enables unsupervised parallel execution:

  • Day — prepare specs, run severity-gated review, iterate until GO
  • Night — agents execute multiple features in parallel from reviewed specs
  • Morning — evaluate results with implementation review (code follows contracts?)

This only works when planning output is trusted enough for unsupervised execution. The review framework creates that trust.

Industry Patterns

SystemHow it gatesKey insight
Google CritiqueSeverity labels: Critical, Nit, Optional, FYIPrefix removes ambiguity about reviewer intent
Stripe BlueprintsDeterministic nodes + agentic nodesAgent cannot proceed past failing deterministic check
Meta CoordinatorSub-agents review independently, coordinator consolidatesForces agents to trace code paths before submitting (reduces hallucinated findings)

Stripe constrains context aggressively — 400+ tools available, each agent gets ~15. Less context, more focus.

Decision Criteria

  • Use severity-gated review when the artifact will be acted on by AI agents or when mistakes compound (specs, API contracts, migrations).
  • Use simpler review for isolated, easily reversible changes with good test coverage.
  • Gate planning artifacts before implementation — wrong decisions in specs become wrong decisions in production code overnight.
  • Add multi-agent orchestration when the artifact spans multiple domains (security + database + API + performance).
  • Match review depth to reversibility — hard to reverse = stricter gates. Easy to reverse = lighter gates.

Anti-patterns

  • Review without severity classification — all findings treated equal, critical issues lost in noise.
  • Human-only review at AI speed — review capacity stays flat while generation volume multiplies. The bottleneck moves to review.
  • “Looks good to me” without structured evaluation — subjective approval without checking specific criteria.
  • Gating code but not plans — catching bugs in PRs while wrong architectural decisions ship unchallenged.
  • Context overload instead of gates — adding more rules to the prompt instead of checking the output.
  • Silencing findings without resolution — downgrading severity to avoid rework instead of fixing the issue.

Experience Notes

Previously: relied on comprehensive rules files and documentation to get AI output right on the first pass. Now: assume imperfect output and build structured evaluation loops. Moved because more context paradoxically makes agents less reliable — rules get lost in long sessions. The gate catches what the prompt forgot.