Severity-Gated Review - Rodrigo Ramirez

Classify findings by severity. Use severity to gate progress. This is the methodology for evaluating AI agent output — whether code, design documents, API contracts, or migration plans. The principle comes from Fagan inspections (IBM, 1976) and is applied at Google, Stripe, and Meta with different implementations.

More context does not make agents reliable. Structured evaluation does. The answer to “how do I trust AI output” is not a better prompt — it is a better gate.

Severity Levels

Every finding is classified into one of three tiers:

Severity	Meaning	Examples
CRITICAL	Blocks progress. Must resolve before proceeding	Security risk, data loss, breaking change without migration, business rule violation
MAJOR	Must fix before proceeding. Not a design blocker but an implementation blocker	Convention violation, missing data masking, unsafe migration, missing deprecation plan
MINOR	Can address during implementation	Naming suggestion, optimization opportunity, documentation gap

When classifying: if the merged artifact sets a wrong baseline for future work, it is at least MAJOR. If it risks data loss, security, or breaking clients, it is CRITICAL.

Verdict Logic

Verdicts are deterministic. No judgment calls.

Verdict	Condition	Next action
NO-GO	Any CRITICAL unresolved	Fix all CRITICALs, re-run full review
CONDITIONAL	No CRITICALs, MAJORs remain	Fix MAJORs, re-run affected reviewer(s) only
GO	Only MINORs or clean	Proceed

Structured Findings

Every finding must include four elements:

[SEVERITY] Description
→ Location: file or section affected
→ Action: what to fix or decide
→ Verify: how to confirm the fix is correct

The Verify line is concrete. Not “check it works” but “run X and confirm Y” or “search for Z and verify no references remain.” Findings without a verify step are incomplete.

Multi-Agent Orchestration

When multiple specialized reviewers evaluate the same artifact, each produces an independent review with its own verdict. An orchestrator consolidates.

Consolidation rule: the final verdict is the worst across all reviewers. Any NO-GO makes the consolidated verdict NO-GO. Any CONDITIONAL (no NO-GO) makes it CONDITIONAL. All GO makes it GO.

Scope boundaries prevent context dilution. The security reviewer does not comment on naming. The database reviewer does not flag API design. Each reviewer stays within its domain.

Humans make final decisions. Reviewers surface concerns. They never approve or reject.

Gates for Planning, Not Just Code

The highest-value gate is on the plan, not the pull request. Fixing a wrong decision in a design document costs minutes. Fixing it after implementation costs days.

Apply severity-gated review to any artifact:

System design — module boundaries, data flows, dependency direction
API contracts — backward compatibility, schema breaking changes
Migration plans — rollback strategy, step ordering, data safety
Task specs — scope clarity, enough detail for unsupervised execution

When a fitness function fails, it becomes a finding in the severity-gated review. The fitness function detects; the severity framework classifies and gates.

Adjacent Frameworks

Three frameworks complement severity-gated review:

Stage-Gate (Cooper, 1990) — provides the when: checkpoints between phases. Each phase produces an artifact, each gate evaluates it.
Red Team (US Army, 1960s) — provides the how: adversarial posture. The evaluator tries to break the plan.
Pre-mortem (Klein, 2007) — provides the lens: “assume this shipped and broke — why?” Increases risk identification by 30%.

Combined: adversarial interrogation using prospective hindsight. Not a checklist review — a structured challenge that surfaces what you missed.

The Day-Night Shift

Severity-gated review enables unsupervised parallel execution:

Day — prepare specs, run severity-gated review, iterate until GO
Night — agents execute multiple features in parallel from reviewed specs
Morning — evaluate results with implementation review (code follows contracts?)

This only works when planning output is trusted enough for unsupervised execution. The review framework creates that trust.

Industry Patterns

System	How it gates	Key insight
Google Critique	Severity labels: Critical, Nit, Optional, FYI	Prefix removes ambiguity about reviewer intent
Stripe Blueprints	Deterministic nodes + agentic nodes	Agent cannot proceed past failing deterministic check
Meta Coordinator	Sub-agents review independently, coordinator consolidates	Forces agents to trace code paths before submitting (reduces hallucinated findings)

Stripe constrains context aggressively — 400+ tools available, each agent gets ~15. Less context, more focus.

Decision Criteria

Use severity-gated review when the artifact will be acted on by AI agents or when mistakes compound (specs, API contracts, migrations).
Use simpler review for isolated, easily reversible changes with good test coverage.
Gate planning artifacts before implementation — wrong decisions in specs become wrong decisions in production code overnight.
Add multi-agent orchestration when the artifact spans multiple domains (security + database + API + performance).
Match review depth to reversibility — hard to reverse = stricter gates. Easy to reverse = lighter gates.

Anti-patterns

Review without severity classification — all findings treated equal, critical issues lost in noise.
Human-only review at AI speed — review capacity stays flat while generation volume multiplies. The bottleneck moves to review.
“Looks good to me” without structured evaluation — subjective approval without checking specific criteria.
Gating code but not plans — catching bugs in PRs while wrong architectural decisions ship unchallenged.
Context overload instead of gates — adding more rules to the prompt instead of checking the output.
Silencing findings without resolution — downgrading severity to avoid rework instead of fixing the issue.

Experience Notes

Previously: relied on comprehensive rules files and documentation to get AI output right on the first pass. Now: assume imperfect output and build structured evaluation loops. Moved because more context paradoxically makes agents less reliable — rules get lost in long sessions. The gate catches what the prompt forgot.