Severity-Gated Review
Classify findings by severity. Use severity to gate progress. This is the methodology for evaluating AI agent output — whether code, design documents, API contracts, or migration plans. The principle comes from Fagan inspections (IBM, 1976) and is applied at Google, Stripe, and Meta with different implementations.
More context does not make agents reliable. Structured evaluation does. The answer to “how do I trust AI output” is not a better prompt — it is a better gate.
Severity Levels
Every finding is classified into one of three tiers:
| Severity | Meaning | Examples |
|---|---|---|
| CRITICAL | Blocks progress. Must resolve before proceeding | Security risk, data loss, breaking change without migration, business rule violation |
| MAJOR | Must fix before proceeding. Not a design blocker but an implementation blocker | Convention violation, missing data masking, unsafe migration, missing deprecation plan |
| MINOR | Can address during implementation | Naming suggestion, optimization opportunity, documentation gap |
When classifying: if the merged artifact sets a wrong baseline for future work, it is at least MAJOR. If it risks data loss, security, or breaking clients, it is CRITICAL.
Verdict Logic
Verdicts are deterministic. No judgment calls.
| Verdict | Condition | Next action |
|---|---|---|
| NO-GO | Any CRITICAL unresolved | Fix all CRITICALs, re-run full review |
| CONDITIONAL | No CRITICALs, MAJORs remain | Fix MAJORs, re-run affected reviewer(s) only |
| GO | Only MINORs or clean | Proceed |
Structured Findings
Every finding must include four elements:
[SEVERITY] Description
→ Location: file or section affected
→ Action: what to fix or decide
→ Verify: how to confirm the fix is correct
The Verify line is concrete. Not “check it works” but “run X and confirm Y” or “search for Z and verify no references remain.” Findings without a verify step are incomplete.
Multi-Agent Orchestration
When multiple specialized reviewers evaluate the same artifact, each produces an independent review with its own verdict. An orchestrator consolidates.
Consolidation rule: the final verdict is the worst across all reviewers. Any NO-GO makes the consolidated verdict NO-GO. Any CONDITIONAL (no NO-GO) makes it CONDITIONAL. All GO makes it GO.
Scope boundaries prevent context dilution. The security reviewer does not comment on naming. The database reviewer does not flag API design. Each reviewer stays within its domain.
Humans make final decisions. Reviewers surface concerns. They never approve or reject.
Gates for Planning, Not Just Code
The highest-value gate is on the plan, not the pull request. Fixing a wrong decision in a design document costs minutes. Fixing it after implementation costs days.
Apply severity-gated review to any artifact:
- System design — module boundaries, data flows, dependency direction
- API contracts — backward compatibility, schema breaking changes
- Migration plans — rollback strategy, step ordering, data safety
- Task specs — scope clarity, enough detail for unsupervised execution
When a fitness function fails, it becomes a finding in the severity-gated review. The fitness function detects; the severity framework classifies and gates.
Adjacent Frameworks
Three frameworks complement severity-gated review:
- Stage-Gate (Cooper, 1990) — provides the when: checkpoints between phases. Each phase produces an artifact, each gate evaluates it.
- Red Team (US Army, 1960s) — provides the how: adversarial posture. The evaluator tries to break the plan.
- Pre-mortem (Klein, 2007) — provides the lens: “assume this shipped and broke — why?” Increases risk identification by 30%.
Combined: adversarial interrogation using prospective hindsight. Not a checklist review — a structured challenge that surfaces what you missed.
The Day-Night Shift
Severity-gated review enables unsupervised parallel execution:
- Day — prepare specs, run severity-gated review, iterate until GO
- Night — agents execute multiple features in parallel from reviewed specs
- Morning — evaluate results with implementation review (code follows contracts?)
This only works when planning output is trusted enough for unsupervised execution. The review framework creates that trust.
Industry Patterns
| System | How it gates | Key insight |
|---|---|---|
| Google Critique | Severity labels: Critical, Nit, Optional, FYI | Prefix removes ambiguity about reviewer intent |
| Stripe Blueprints | Deterministic nodes + agentic nodes | Agent cannot proceed past failing deterministic check |
| Meta Coordinator | Sub-agents review independently, coordinator consolidates | Forces agents to trace code paths before submitting (reduces hallucinated findings) |
Stripe constrains context aggressively — 400+ tools available, each agent gets ~15. Less context, more focus.
Decision Criteria
- Use severity-gated review when the artifact will be acted on by AI agents or when mistakes compound (specs, API contracts, migrations).
- Use simpler review for isolated, easily reversible changes with good test coverage.
- Gate planning artifacts before implementation — wrong decisions in specs become wrong decisions in production code overnight.
- Add multi-agent orchestration when the artifact spans multiple domains (security + database + API + performance).
- Match review depth to reversibility — hard to reverse = stricter gates. Easy to reverse = lighter gates.
Anti-patterns
- Review without severity classification — all findings treated equal, critical issues lost in noise.
- Human-only review at AI speed — review capacity stays flat while generation volume multiplies. The bottleneck moves to review.
- “Looks good to me” without structured evaluation — subjective approval without checking specific criteria.
- Gating code but not plans — catching bugs in PRs while wrong architectural decisions ship unchallenged.
- Context overload instead of gates — adding more rules to the prompt instead of checking the output.
- Silencing findings without resolution — downgrading severity to avoid rework instead of fixing the issue.
Experience Notes
Previously: relied on comprehensive rules files and documentation to get AI output right on the first pass. Now: assume imperfect output and build structured evaluation loops. Moved because more context paradoxically makes agents less reliable — rules get lost in long sessions. The gate catches what the prompt forgot.