Installation

Install with CLI Recommended
gh skills-hub install quality-playbook

Don't have the extension? Run gh extension install samueltauil/skills-hub first.

Download and extract to your repository:

.github/skills/quality-playbook/

Extract the ZIP to .github/skills/ in your repo. The folder name must match quality-playbook for Copilot to auto-discover it.

Skill Files (9)

LICENSE.txt 1.0 KB
MIT License

Copyright (c) 2025 Andrew Stellman

Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.
SKILL.md 42.7 KB
---
name: quality-playbook
description: "Explore any codebase from scratch and generate six quality artifacts: a quality constitution (QUALITY.md), spec-traced functional tests, a code review protocol with regression test generation, an integration testing protocol, a multi-model spec audit (Council of Three), and an AI bootstrap file (AGENTS.md). Includes state machine completeness analysis and missing safeguard detection. Works with any language (Python, Java, Scala, TypeScript, Go, Rust, etc.). Use this skill whenever the user asks to set up a quality playbook, generate functional tests from specifications, create a quality constitution, build testing protocols, audit code against specs, or establish a repeatable quality system for a project. Also trigger when the user mentions 'quality playbook', 'spec audit', 'Council of Three', 'fitness-to-purpose', 'coverage theater', or wants to go beyond basic test generation to build a full quality system grounded in their actual codebase."
license: Complete terms in LICENSE.txt
metadata:
  version: 1.2.0
  author: Andrew Stellman
  github: https://github.com/andrewstellman/
---

# Quality Playbook Generator

**When this skill starts, display this banner before doing anything else:**

```
Quality Playbook v1.2.0 — by Andrew Stellman
https://github.com/andrewstellman/
```

Generate a complete quality system tailored to a specific codebase. Unlike test stub generators that work mechanically from source code, this skill explores the project first — understanding its domain, architecture, specifications, and failure history — then produces a quality playbook grounded in what it finds.

## Why This Exists

Most software projects have tests, but few have a quality *system*. Tests check whether code works. A quality system answers harder questions: what does "working correctly" mean for this specific project? What are the ways it could fail that wouldn't be caught by tests? What should every developer (human or AI) know before touching this code?

Without a quality playbook, every new contributor (and every new AI session) starts from scratch — guessing at what matters, writing tests that look good but don't catch real bugs, and rediscovering failure modes that were already found and fixed months ago. A quality playbook makes the bar explicit, persistent, and inherited.

## What This Skill Produces

Six files that together form a repeatable quality system:

| File | Purpose | Why It Matters | Executes Code? |
|------|---------|----------------|----------------|
| `quality/QUALITY.md` | Quality constitution — coverage targets, fitness-to-purpose scenarios, theater prevention | Every AI session reads this first. It tells them what "good enough" means so they don't guess. | No |
| `quality/test_functional.*` | Automated functional tests derived from specifications | The safety net. Tests tied to what the spec says should happen, not just what the code does. Use the project's language: `test_functional.py` (Python), `FunctionalSpec.scala` (Scala), `functional.test.ts` (TypeScript), `FunctionalTest.java` (Java), etc. | **Yes** |
| `quality/RUN_CODE_REVIEW.md` | Code review protocol with guardrails that prevent hallucinated findings | AI code reviews without guardrails produce confident but wrong findings. The guardrails (line numbers, grep before claiming, read bodies) often improve accuracy. | No |
| `quality/RUN_INTEGRATION_TESTS.md` | Integration test protocol — end-to-end pipeline across all variants | Unit tests pass, but does the system actually work end-to-end with real external services? | **Yes** |
| `quality/RUN_SPEC_AUDIT.md` | Council of Three multi-model spec audit protocol | No single AI model catches everything. Three independent models with different blind spots catch defects that any one alone would miss. | No |
| `AGENTS.md` | Bootstrap context for any AI session working on this project | The "read this first" file. Without it, AI sessions waste their first hour figuring out what's going on. | No |

Plus output directories: `quality/code_reviews/`, `quality/spec_audits/`, `quality/results/`.

The critical deliverable is the functional test file (named for the project's language and test framework conventions). The Markdown protocols are documentation for humans and AI agents. The functional tests are the automated safety net.

## How to Use

Point this skill at any codebase:

```
Generate a quality playbook for this project.
```

```
Update the functional tests — the quality playbook already exists.
```

```
Run the spec audit protocol.
```

If a quality playbook already exists (`quality/QUALITY.md`, functional tests, etc.), read the existing files first, then evaluate them against the self-check benchmarks in the verification phase. Don't assume existing files are complete — treat them as a starting point.

---

## Phase 1: Explore the Codebase (Do Not Write Yet)

Spend the first phase understanding the project. The quality playbook must be grounded in this specific codebase — not generic advice.

**Why explore first?** The most common failure in AI-generated quality playbooks is producing generic content — coverage targets that could apply to any project, scenarios that describe theoretical failures, tests that exercise language builtins instead of project code. Exploration prevents this by forcing every output to reference something real: a specific function, a specific schema, a specific defensive code pattern. If you can't point to where something lives in the code, you're guessing — and guesses produce quality playbooks nobody trusts.

**Scaling for large codebases:** For projects with more than ~50 source files, don't try to read everything. Focus exploration on the 3–5 core modules (the ones that handle the primary data flow, the most complex logic, and the most failure-prone operations). Read representative tests from each subsystem rather than every test file. The goal is depth on what matters, not breadth across everything.

### Step 0: Ask About Development History

Before exploring code, ask the user one question:

> "Do you have exported AI chat history from developing this project — Claude exports, Gemini takeouts, ChatGPT exports, Claude Code transcripts, or similar? If so, point me to the folder. The design discussions, incident reports, and quality decisions in those chats will make the generated quality playbook significantly better."

If the user provides a chat history folder:

1. **Scan for an index file first.** Look for files named `INDEX*`, `CONTEXT.md`, `README.md`, or similar navigation aids. If one exists, read it — it will tell you what's there and how to find things.
2. **Search for quality-relevant conversations.** Look for messages mentioning: quality, testing, coverage, bugs, failures, incidents, crashes, validation, retry, recovery, spec, fitness, audit, review. Also search for the project name.
3. **Extract design decisions and incident history.** The most valuable content is: (a) incident reports — what went wrong, how many records affected, how it was detected, (b) design discussions — why a particular approach was chosen, what alternatives were rejected, (c) quality framework discussions — coverage targets, testing philosophy, model review experiences, (d) cross-model feedback — where different AI models disagreed about the code.
4. **Don't try to read everything.** Chat histories can be enormous. Use the index to find the most relevant conversations, then search within those for quality-related content. 10 minutes of targeted searching beats 2 hours of exhaustive reading.

This context is gold. A chat history where the developer discussed "why we chose this concurrency model" or "the time we lost 1,693 records in production" transforms generic scenarios into authoritative ones.

If the user doesn't have chat history, proceed normally — the skill works without it, just with less context.

### Step 1: Identify Domain, Stack, and Specifications

Read the README, existing documentation, and build config (`pyproject.toml` / `package.json` / `Cargo.toml`). Answer:

- What does this project do? (One sentence.)
- What language and key dependencies?
- What external systems does it talk to?
- What is the primary output?

**Find the specifications.** Specs are the source of truth for functional tests. Search in order: `AGENTS.md`/`CLAUDE.md` in root, `specs/`, `docs/`, `spec/`, `design/`, `architecture/`, `adr/`, then `.md` files in root. Record the paths.

**If no formal spec documents exist**, the skill still works — but you need to assemble requirements from other sources. In order of preference:

1. **Ask the user** — they often know the requirements even if they're not written down.
2. **README and inline documentation** — many projects embed requirements in their README, API docs, or code comments.
3. **Existing test suite** — tests are implicit specifications. If a test asserts `process(x) == y`, that's a requirement.
4. **Type signatures and validation rules** — schemas, type annotations, and validators define what the system accepts and rejects.
5. **Infer from code behavior** — as a last resort, read the code and infer what it's supposed to do. Mark these as *inferred requirements* in QUALITY.md and flag them for user confirmation.

When working from non-formal requirements, label each scenario and test with a **requirement tag** that includes a confidence tier and source:

- `[Req: formal — README §3]` — written by humans in a spec document. Authoritative.
- `[Req: user-confirmed — "must handle empty input"]` — stated by the user but not in a formal doc. Treat as authoritative.
- `[Req: inferred — from validate_input() behavior]` — deduced from code. Flag for user review.

Use this exact tag format in QUALITY.md scenarios, functional test documentation, and spec audit findings. It makes clear which requirements are authoritative and which need validation.

### Step 2: Map the Architecture

List source directories and their purposes. Read the main entry point, trace execution flow. Identify:

- The 3–5 major subsystems
- The data flow (Input → Processing → Output)
- The most complex module
- The most fragile module

### Step 3: Read Existing Tests

Read the existing test files — all of them for small/medium projects, or a representative sample from each subsystem for large ones. Identify: test count, coverage patterns, gaps, and any coverage theater (tests that look good but don't catch real bugs).

**Critical: Record the import pattern.** How do existing tests import project modules? Every language has its own conventions (Python `sys.path` manipulation, Java/Scala package imports, TypeScript relative paths or aliases, Go package/module paths, Rust `use crate::` or `use myproject::`). You must use the exact same pattern in your functional tests — getting this wrong means every test fails with import/resolution errors. See `references/functional_tests.md` § "Import Pattern" for the full six-language matrix.

**Identify integration test runners.** Look for scripts or test files that exercise the system end-to-end against real external services (APIs, databases, etc.). Note their patterns — you'll need them for `RUN_INTEGRATION_TESTS.md`.

### Step 4: Read the Specifications

Walk each spec document section by section. For every section, ask: "What testable requirement does this state?" Record spec requirements without corresponding tests — these are the gaps the functional tests must close.

If using inferred requirements (from tests, types, or code behavior), tag each with its confidence tier using the `[Req: tier — source]` format defined in Step 1. Inferred requirements feed into QUALITY.md scenarios and should be flagged for user review in Phase 4.

### Step 4b: Read Function Signatures and Real Data

Before writing any test, you must know exactly how each function is called. For every module you identified in Step 2:

1. **Read the actual function signatures** — parameter names, types, defaults. Don't guess from usage context — read the function definition and any documentation (Python docstrings, Java/Scala Javadoc/ScalaDoc, TypeScript type annotations, Go godoc comments, Rust doc comments and type signatures).
2. **Read real data files** — If the project has items files, fixture files, config files, or sample data (in `pipelines/`, `fixtures/`, `test_data/`, `examples/`), read them. Your test fixtures must match the real data shape exactly.
3. **Read existing test fixtures** — How do existing tests create test data? Copy their patterns. If they build config dicts with specific keys, use those exact keys.
4. **Check library versions** — Check the project's dependency manifest (`requirements.txt`, `build.sbt`, `package.json`, `pom.xml`/`build.gradle`, `go.mod`, `Cargo.toml`) to see what's actually available. Don't write tests that depend on library features that aren't installed. If a dependency might be missing, use the test framework's skip mechanism — see `references/functional_tests.md` § "Library version awareness" for framework-specific examples.

Record a **function call map**: for each function you plan to test, write down its name, module, parameters, and what it returns. This map prevents the most common test failure: calling functions with wrong arguments.

### Step 5: Find the Skeletons

This is the most important step. Search for defensive code patterns — each one is evidence of a past failure or known risk.

**Why this matters:** Developers don't write `try/except` blocks, null checks, or retry logic for fun. Every piece of defensive code exists because someone got burned. A `try/except` around a JSON parse means malformed JSON happened in production. A null check on a field means that field was missing when it shouldn't have been. These patterns are the codebase whispering its history of failures. Each one becomes a fitness-to-purpose scenario and a boundary test.

**Read `references/defensive_patterns.md`** for the systematic search approach, grep patterns, and how to convert findings into fitness-to-purpose scenarios and boundary tests.

Minimum bar: at least 2–3 defensive patterns per core source file. If you find fewer, you're skimming — read function bodies, not just signatures.

### Step 5a: Trace State Machines

If the project has any kind of state management — status fields, lifecycle phases, workflow stages, mode flags — trace the state machine completely. This catches a category of bugs that defensive pattern analysis alone misses: states that exist but aren't handled.

**How to find state machines:** Search for status/state fields in models, enums, or constants (e.g., `status`, `state`, `phase`, `mode`). Search for guards that check status before allowing actions (e.g., `if status == "running"`, `match self.state`). Search for state transitions (assignments to status fields).

**For each state machine you find:**

1. **Enumerate all possible states.** Read the enum, the constants, or grep for every value the field is assigned. List them all.
2. **For each consumer of state** (UI handlers, API endpoints, control flow guards), check: does it handle every possible state? A `switch`/`match` without a meaningful default, or an `if/elif` chain that doesn't cover all states, is a gap.
3. **For each state transition**, check: can you reach every state? Are there states you can enter but never leave? Are there states that block operations that should be available?
4. **Record gaps as findings.** A status guard that allows action X for "running" but not for "stuck" is a real bug if the user needs to perform action X on stuck processes. A process that enters a terminal state but never triggers cleanup is a real bug.

**Why this matters:** State machine gaps produce bugs that are invisible during normal operation but surface under stress or edge conditions — exactly when you need the system to work. A batch processor that can't be killed when it's in "stuck" status, or a watcher that never self-terminates after all work completes, or a UI that refuses to resume a "pending" run, are all symptoms of incomplete state handling. These bugs don't show up in defensive pattern analysis because the code isn't defending against them — it's simply not handling them at all.

### Step 5b: Map Schema Types

If the project has a validation layer (Pydantic models in Python, JSON Schema, TypeScript interfaces/Zod schemas, Java Bean Validation annotations, Scala case class codecs), read the schema definitions now. For every field you found a defensive pattern for, record what the schema accepts vs. rejects.

**Read `references/schema_mapping.md`** for the mapping format and why this matters for writing valid boundary tests.

### Step 6: Identify Quality Risks (Code + Domain Knowledge)

Every project has a different failure profile. This step uses **two sources** — not just code exploration, but your training knowledge of what goes wrong in similar systems.

**From code exploration**, ask:
- What does "silently wrong" look like for this project?
- What external dependencies can change without warning?
- What looks simple but is actually complex?
- Where do cross-cutting concerns hide?

**From domain knowledge**, ask:
- "What goes wrong in systems like this?" — If it's a batch processor, think about crash recovery, idempotency, silent data loss, state corruption. If it's a web app, think about auth edge cases, race conditions, input validation bypasses. If it handles randomness or statistics, think about seeding, correlation, distribution bias.
- "What produces correct-looking output that is actually wrong?" — This is the most dangerous class of bug: output that passes all checks but is subtly corrupted.
- "What happens at 10x scale that doesn't happen at 1x?" — Chunk boundaries, rate limits, timeout cascading, memory pressure.
- "What happens when this process is killed at the worst possible moment?" — Mid-write, mid-transaction, mid-batch-submission.
- "What information does the user need before committing to an irreversible or expensive operation?" — Pre-run cost estimates, confirmation of scope (especially when fan-out or expansion will multiply the work), resource warnings. If the system can silently commit the user to hours of processing or significant cost without showing them what they're about to do, that's a missing safeguard. Search for operations that start long-running processes, submit batch jobs, or trigger expansion/fan-out — and check whether the user sees a preview, estimate, or confirmation with real numbers before the point of no return.
- "What happens when a long-running process finishes — does it actually stop?" — Polling loops, watchers, background threads, and daemon processes that run until completion should have explicit termination conditions. If the loop checks "is there more work?" but never checks "is all work done?", it will run forever after completion. This is especially common in batch processors and queue consumers.

Generate realistic failure scenarios from this knowledge. You don't need to have observed these failures — you know from training that they happen to systems of this type. Write them as **architectural vulnerability analyses** with specific quantities and consequences. Frame each as "this architecture permits the following failure mode" — not as a fabricated incident report. Use concrete numbers to make the severity non-negotiable: "If the process crashes mid-write during a 10,000-record batch, `save_state()` without an atomic rename pattern will leave a corrupted state file — the next run gets JSONDecodeError and cannot resume without manual intervention." Then ground them in the actual code you explored: "Read persistence.py line ~340 (save_state): verify temp file + rename pattern."

---

## Phase 2: Generate the Quality Playbook

Now write the six files. For each one, follow the structure below and consult the relevant reference file for detailed guidance.

**Why six files instead of just tests?** Tests catch regressions but don't prevent new categories of bugs. The quality constitution (`QUALITY.md`) tells future sessions what "correct" means before they start writing code. The protocols (`RUN_*.md`) provide structured processes for review, integration testing, and spec auditing that produce repeatable results — instead of leaving quality to whatever the AI feels like checking. Together, these files create a quality system where each piece reinforces the others: scenarios in QUALITY.md map to tests in the functional test file, which are verified by the integration protocol, which is audited by the Council of Three.

### File 1: `quality/QUALITY.md` — Quality Constitution

**Read `references/constitution.md`** for the full template and examples.

The constitution has six sections:

1. **Purpose** — What quality means for this project, grounded in Deming (built in, not inspected), Juran (fitness for use), Crosby (quality is free). Apply these specifically: what does "fitness for use" mean for *this system*? Not "tests pass" but the actual operational requirement.
2. **Coverage Targets** — Table mapping each subsystem to a target with rationale referencing real risks. Every target must have a "why" grounded in a specific scenario — without it, a future AI session will argue the target down.
3. **Coverage Theater Prevention** — Project-specific examples of fake tests, derived from what you saw during exploration. (Why: AI-generated tests often pad coverage numbers without catching real bugs — asserting that imports worked, that dicts have keys, or that mocks return what they were configured to return. Calling this out explicitly stops the pattern.)
4. **Fitness-to-Purpose Scenarios** — The heart of it. Each scenario documents a realistic failure mode with code references and verification method. Aim for 2+ scenarios per core module — typically 8–10 total for a medium project, fewer for small projects, more for complex ones. Quality matters more than count: a scenario that precisely captures a real architectural vulnerability is worth more than three generic ones. (Why: Coverage percentages tell you how much code ran, not whether it ran correctly. A system can have 95% coverage and still lose records silently. Fitness scenarios define what "working correctly" actually means in concrete terms that no one can argue down.)
5. **AI Session Quality Discipline** — Rules every AI session must follow
6. **The Human Gate** — Things requiring human judgment

**Scenario voice is critical.** Write "What happened" as architectural vulnerability analyses with specific quantities, cascade consequences, and detection difficulty — not as abstract specifications. "Because `save_state()` lacks an atomic rename pattern, a mid-write crash during a 10,000-record batch will leave a corrupted state file — the next run gets JSONDecodeError and cannot resume. At scale, this risks silent loss of 1,693+ records with no detection mechanism." An AI session reading that will not argue the standard down. Use your knowledge of similar systems to generate realistic failure scenarios, then ground them in the actual code you explored. Scenarios come from both code exploration AND domain knowledge about what goes wrong in systems like this.

Every scenario's "How to verify" must map to at least one test in the functional test file.

### File 2: Functional Tests

**This is the most important deliverable.** Read `references/functional_tests.md` for the complete guide.

Organize the tests into three logical groups (classes, describe blocks, modules, or whatever the test framework uses):

- **Spec requirements** — One test per testable spec section. Each test's documentation cites the spec requirement it verifies.
- **Fitness scenarios** — One test per QUALITY.md scenario. 1:1 mapping, named to match.
- **Boundaries and edge cases** — One test per defensive pattern from Step 5.

Key rules:
- **Match the existing import pattern exactly.** Read how existing tests import project modules and do the same thing. Getting this wrong means every test fails.
- **Read every function's signature before calling it.** Read the actual `def` line — parameter names, types, defaults. Read real data files from the project to understand data shapes. Do not guess at function parameters or fixture structures.
- **No placeholder tests.** Every test must import and call actual project code. If the body is `pass` or the assertion is trivial (`assert isinstance(x, list)`), delete it. A test that doesn't exercise project code inflates the count and creates false confidence.
- **Test count heuristic** = (testable spec sections) + (QUALITY.md scenarios) + (defensive patterns). For a medium project (5–15 source files), this typically yields 35–50 tests. Significantly fewer suggests missed requirements or shallow exploration. Significantly more is fine if every test is meaningful — don't pad to hit a number.
- **Cross-variant heuristic: ~30%** — If the project handles multiple input types, aim for roughly 30% of tests parametrized across all variants. The exact percentage matters less than ensuring every cross-cutting property is tested across all variants.
- **Test outcomes, not mechanisms** — Assert what the spec says should happen, not how the code implements it.
- **Use schema-valid mutations** — Boundary tests must use values the schema accepts (from Step 5b), not values it rejects.

### File 3: `quality/RUN_CODE_REVIEW.md`

**Read `references/review_protocols.md`** for the template.

Key sections: bootstrap files, focus areas mapped to architecture, and these mandatory guardrails:

- Line numbers are mandatory — no line number, no finding
- Read function bodies, not just signatures
- If unsure: flag as QUESTION, not BUG
- Grep before claiming missing
- Do NOT suggest style changes — only flag things that are incorrect

**Phase 2: Regression tests.** After the review produces BUG findings, write regression tests in `quality/test_regression.*` that reproduce each bug. Each test should fail on the current implementation, confirming the bug is real. Report results as a confirmation table (BUG CONFIRMED / FALSE POSITIVE / NEEDS INVESTIGATION). See `references/review_protocols.md` for the full regression test protocol.

### File 4: `quality/RUN_INTEGRATION_TESTS.md`

**Read `references/review_protocols.md`** for the template.

Must include: safety constraints, pre-flight checks, test matrix with specific pass criteria, an execution UX section, and a structured reporting format. Cover happy path, cross-variant consistency, output correctness, and component boundaries.

**All commands must use relative paths.** The generated protocol should include a "Working Directory" section at the top stating that all commands run from the project root using relative paths. Never generate commands that `cd` to an absolute path — this breaks when the protocol is run from a different machine or directory. Use `./scripts/`, `./pipelines/`, `./quality/`, etc.

**Include an Execution UX section.** When someone tells an AI agent to "run the integration tests," the agent needs to know how to present its work. The protocol should specify three phases: (1) show the plan as a numbered table before running anything, (2) report one-line progress updates as each test runs (`✓`/`✗`/`⧗`), (3) show a summary table with pass/fail counts and a recommendation. See `references/review_protocols.md` section "Execution UX" for the template and examples. Without this, the agent dumps raw output or stays silent — neither is useful.

**This protocol must exercise real external dependencies.** If the project talks to APIs, databases, or external services, the integration test protocol runs real end-to-end executions against those services — not just local validation checks. Design the test matrix around the project's actual execution modes and external dependencies. Look for API keys, provider abstractions, and existing integration test scripts during exploration and build on them.

**Derive quality gates from the code, not generic checks.** Read validation rules, schema enums, and generation logic during exploration. Turn them into per-pipeline quality checks with specific fields and acceptable value ranges. "All units validated" is not enough — the protocol must verify domain-specific correctness.

**Script parallelism, don't just describe it.** Group runs so independent executions (different providers) run concurrently. Include actual bash commands with `&` and `wait`. One run per provider at a time to avoid rate limits.

**Calibrate unit counts to the project.** Read `chunk_size` or equivalent config. Use enough units to span at least 2 chunks and enough to verify distribution checks. Typically 10–30 for integration testing.

**Deep post-run verification.** Don't stop at "process completed." Verify log files, manifest state, output data existence, sample record content, and any existing quality check scripts — for every run.

**Find and use existing verification tools.** Search for existing scripts that verify output quality (e.g., `integration_checks.py`, validation scripts, quality gate functions). If they exist, call them from the protocol. If the project has a TUI or dashboard, include TUI verification commands (e.g., `--dump` flags) in the post-run checklist.

**Build a Field Reference Table before writing quality gates.** This is the most important step for protocol accuracy. AI models confidently write wrong field names even after reading schemas — `document_id` becomes `doc_id`, `sentiment_score` becomes `sentiment`, `float 0-1` becomes `int 0-100`. The fix is procedural: **re-read each schema file IMMEDIATELY before writing each table row.** Do not rely on what you read earlier in the conversation — your memory of field names drifts over thousands of tokens. Copy field names character-for-character from the file contents. Include ALL fields from each schema (if the schema has 8 fields, the table has 8 rows). See `references/review_protocols.md` section "The Field Reference Table" for the full process and format. Do not skip this step — it prevents the single most common protocol inaccuracy.

### File 5: `quality/RUN_SPEC_AUDIT.md` — Council of Three

**Read `references/spec_audit.md`** for the full protocol.

Three independent AI models audit the code against specifications. Why three? Because each model has different blind spots — in practice, different auditors catch different issues. Cross-referencing catches what any single model misses.

The protocol defines: a copy-pasteable audit prompt with guardrails, project-specific scrutiny areas, a triage process (merge findings by confidence level), and fix execution rules (small batches by subsystem, not mega-prompts).

### File 6: `AGENTS.md`

If `AGENTS.md` already exists, update it — don't replace it. Add a Quality Docs section pointing to all generated files.

If creating from scratch: project description, setup commands, build & test commands, architecture overview, key design decisions, known quirks, and quality docs pointers.

---

## Phase 3: Verify

**Why a verification phase?** AI-generated output can look polished and be subtly wrong. Tests that reference undefined fixtures report 0 failures but 16 errors — and "0 failures" sounds like success. Integration protocols can list field names that don't exist in the actual schemas. The verification phase catches these problems before the user discovers them, which is important because trust in a generated quality playbook is fragile — one wrong field name undermines confidence in everything else.

### Self-Check Benchmarks

Before declaring done, check every benchmark. **Read `references/verification.md`** for the complete checklist.

The critical checks:

1. **Test count** near heuristic target (spec sections + scenarios + defensive patterns)
2. **Scenario coverage** — scenario test count matches QUALITY.md scenario count
3. **Cross-variant coverage** — ~30% of tests parametrize across all input variants
4. **Boundary test count** ≈ defensive pattern count from Step 5
5. **Assertion depth** — Majority of assertions check values, not just presence
6. **Layer correctness** — Tests assert outcomes (what spec says), not mechanisms (how code implements)
7. **Mutation validity** — Every fixture mutation uses a schema-valid value from Step 5b
8. **All tests pass — zero failures AND zero errors.** Run the test suite using the project's test runner (Python: `pytest -v`, Scala: `sbt testOnly`, Java: `mvn test`/`gradle test`, TypeScript: `npx jest`, Go: `go test -v`, Rust: `cargo test`) and check the summary. Errors from missing fixtures, failed imports, or unresolved dependencies count as broken tests. If you see setup errors, you forgot to create the fixture/setup file or referenced undefined test helpers.
9. **Existing tests unbroken** — The new files didn't break anything.
10. **Integration test quality gates were written from a Field Reference Table.** Verify that you built a Field Reference Table by re-reading each schema file before writing quality gates, and that every field name in the quality gates is copied from that table — not from memory. If you skipped the table, go back and build it now.

If any benchmark fails, go back and fix it before proceeding.

---

## Phase 4: Present, Explore, Improve (Interactive)

After generating and verifying, present the results clearly and give the user control over what happens next. This phase has three parts: a scannable summary, drill-down on demand, and a menu of improvement paths.

**Do not skip this phase.** The autonomous output from Phases 1-3 is a solid starting point, but the user needs to understand what was generated, explore what matters to them, and choose how to improve it. A quality playbook is only useful if the people who own the project trust it and understand it. Dumping six files without explanation creates artifacts nobody reads.

### Part 1: The Summary Table

Present a single table the user can scan in 10 seconds:

```
Here's what I generated:

| File | What It Does | Key Metric | Confidence |
|------|-------------|------------|------------|
| QUALITY.md | Quality constitution | 10 scenarios | ██████░░ High — grounded in code, but scenarios are inferred, not from real incidents |
| Functional tests | Automated tests | 47 passing | ████████ High — all tests pass, 35% cross-variant |
| RUN_CODE_REVIEW.md | Code review protocol | 8 focus areas | ████████ High — derived from architecture |
| RUN_INTEGRATION_TESTS.md | Integration test protocol | 9 runs × 3 providers | ██████░░ Medium — quality gates need threshold tuning |
| RUN_SPEC_AUDIT.md | Council of Three audit | 10 scrutiny areas | ████████ High — guardrails included |
| AGENTS.md | AI session bootstrap | Updated | ████████ High — factual |
```

Adapt the table to what you actually generated — the file names, metrics, and confidence levels will vary by project. The confidence column is the most important: it tells the user where to focus their attention.

**Confidence levels:**
- **High** — Derived directly from code, specs, or schemas. Unlikely to need revision.
- **Medium** — Reasonable inference, but could be wrong. Benefits from user input.
- **Low** — Best guess. Definitely needs user input to be useful.

After the table, add a "Quick Start" block with ready-to-copy prompts for executing each artifact:

```
To use these artifacts, start a new AI session and try one of these prompts:

• Run a code review:
  "Read quality/RUN_CODE_REVIEW.md and follow its instructions to review [module or file]."

• Run the functional tests:
  "[test runner command, e.g. pytest quality/ -v, mvn test -Dtest=FunctionalTest, etc.]"

• Run the integration tests:
  "Read quality/RUN_INTEGRATION_TESTS.md and follow its instructions."

• Start a spec audit (Council of Three):
  "Read quality/RUN_SPEC_AUDIT.md and follow its instructions using [model name]."
```

Adapt the test runner command and module names to the actual project. The point is to give the user copy-pasteable prompts — not descriptions of what they could do, but the actual text they'd type.

After the Quick Start block, add one line:

> "You can ask me about any of these to see the details — for example, 'show me Scenario 3' or 'walk me through the integration test matrix.'"

### Part 2: Drill-Down on Demand

When the user asks about a specific item, give a focused summary — not the whole file, but the key decisions and what you're uncertain about. Examples:

- **"Tell me about Scenario 4"** → Show the scenario text, explain where it came from (which defensive pattern or domain knowledge), and flag what you inferred vs. what you know.
- **"Show me the integration test matrix"** → Show the run groups, explain the parallelism strategy, and note which quality gates you derived from schemas vs. guessed at.
- **"How do the functional tests work?"** → Show the three test groups, explain the mapping to specs and scenarios, and highlight any tests you're least confident about.

The user may go through several drill-downs before they're ready to improve anything. That's fine — let them explore at their own pace.

### Part 3: The Improvement Menu

After the user has seen the summary (and optionally drilled into details), present the improvement options:

> "Three ways to make this better:"
>
> **1. Review and harden individual items** — Pick any scenario, test, or protocol section and I'll walk through it with you. Good for: tightening specific quality gates, fixing inferred scenarios, adding missing edge cases.
>
> **2. Guided Q&A** — I'll ask you 3-5 targeted questions about things I couldn't infer from the code: incident history, expected distributions, cost tolerance, model preferences. Good for: filling knowledge gaps that make scenarios more authoritative.
>
> **3. Review development history** — Point me to exported AI chat history (Claude, Gemini, ChatGPT exports, Claude Code transcripts) and I'll mine it for design decisions, incident reports, and quality discussions that should be in QUALITY.md. Good for: grounding scenarios in real project history instead of inference.
>
> "You can do any combination of these, in any order. Which would you like to start with?"

### Executing Each Improvement Path

**Path 1: Review and harden.** The user picks an item. Walk through it: show the current text, explain your reasoning, ask if it's accurate. Revise based on their feedback. Re-run tests if the functional tests change.

**Path 2: Guided Q&A.** Ask 3-5 questions derived from what you actually found during exploration. These categories cover the most common high-leverage gaps:

- **Incident history for scenarios.** "I found [specific defensive code]. What failure caused this? How many records were affected?"
- **Quality gate thresholds.** "I'm checking that [field] contains [values]. What distribution is normal? What signals a problem?"
- **Integration test scale and cost.** "The protocol runs [N] tests costing roughly $[X]. Should I increase or decrease coverage?"
- **Test scope.** "I generated [N] functional tests. Your existing suite covers [other areas]. Are there gaps?"
- **Model preferences for spec audit.** "Which AI models do you use? Have you noticed specific strengths?"

After the user answers, revise the generated files and re-run tests.

**Path 3: Review development history.** If the user provides a chat history folder:

1. Scan for index files and navigate to quality-relevant conversations (same approach as Step 0, but now with specific targets — you know which scenarios need grounding, which quality gates need thresholds, which design decisions need rationale).
2. Extract: incident stories with specific numbers, design rationale for defensive patterns, quality framework discussions, cross-model audit results.
3. Revise QUALITY.md scenarios with real incident details. Update integration test thresholds with real-world values. Add Council of Three empirical data if audit results exist.
4. Re-run tests after revisions.

If the user already provided chat history in Step 0, you've already mined it — but they may want to point you to specific conversations or ask you to dig deeper into a particular topic.

### Iteration

The user can cycle through these paths as many times as they want. Each pass makes the quality playbook more grounded. When they're satisfied, they'll move on naturally — there's no explicit "done" step.

---

## Fixture Strategy

The `quality/` folder is separate from the project's unit test folder. Create the appropriate test setup for the project's language:

- **Python:** `quality/conftest.py` for pytest fixtures. If fixtures are defined inline (common with pytest's `tmp_path` pattern), prefer that over shared fixtures.
- **Java:** A test class with `@BeforeEach`/`@BeforeAll` setup methods, or a shared test utility class.
- **Scala:** A trait mixed into test specs (e.g., `trait FunctionalTestFixtures`), or inline data builders.
- **TypeScript/JavaScript:** A `quality/setup.ts` with `beforeAll`/`beforeEach` hooks, or inline test factories.
- **Go:** Helper functions in the same `_test.go` file or a shared `testutil_test.go`. Use `t.Helper()` for test helpers. Go convention prefers inline test setup over shared fixtures.
- **Rust:** Helper functions in a `#[cfg(test)] mod tests` block, or a shared `test_utils.rs` module. Use builder patterns for test data.

Examine existing test files to understand how they set up test data. Whatever pattern the existing tests use, copy it. Study existing fixture patterns for realistic data shapes.

---

## Terminology

- **Functional testing** — Does the code produce the output specs say it should? Distinct from unit testing (individual functions in isolation).
- **Integration testing** — Do components work together end-to-end, including real external services?
- **Spec audit** — AI models read code and compare against specs. No code executed. Catches where code doesn't match documentation.
- **Coverage theater** — Tests that produce high coverage numbers but don't catch real bugs. Example: asserting a function didn't throw without checking its output.
- **Fitness-to-purpose** — Does the code do what it's supposed to do under real-world conditions? A system can have 95% coverage and still lose records silently.

---

## Principles

1. Fitness-to-purpose over coverage percentages
2. Scenarios come from code exploration AND domain knowledge
3. Concrete failure modes make standards non-negotiable — abstract requirements invite rationalization
4. Guardrails transform AI review quality (line numbers, read bodies, grep before claiming)
5. Triage before fixing — many "defects" are spec bugs or design decisions

---

## Reference Files

Read these as you work through each phase:

| File | When to Read | Contains |
|------|-------------|----------|
| `references/defensive_patterns.md` | Step 5 (finding skeletons) | Grep patterns, how to convert findings to scenarios |
| `references/schema_mapping.md` | Step 5b (schema types) | Field mapping format, mutation validity rules |
| `references/constitution.md` | File 1 (QUALITY.md) | Full template with section-by-section guidance |
| `references/functional_tests.md` | File 2 (functional tests) | Test structure, anti-patterns, cross-variant strategy |
| `references/review_protocols.md` | Files 3–4 (code review, integration) | Templates for both protocols |
| `references/spec_audit.md` | File 5 (Council of Three) | Full audit protocol, triage process, fix execution |
| `references/verification.md` | Phase 3 (verify) | Complete self-check checklist with all 13 benchmarks |
references/
constitution.md 9.4 KB
# Writing the Quality Constitution (File 1: QUALITY.md)

The quality constitution defines what "quality" means for this specific project and makes the bar explicit, persistent, and inherited by every AI session.

## Template

```markdown
# Quality Constitution: [Project Name]

## Purpose

[2–3 paragraphs grounding quality in three principles:]

- **Deming** ("quality is built in, not inspected in") — Quality is built into context files
  and the quality playbook so every AI session inherits the same bar.
- **Juran** ("fitness for use") — Define fitness specifically for this project. Not "tests pass"
  but the actual real-world requirement. Example: "generates correct output that survives
  input schema changes without silently producing wrong results."
- **Crosby** ("quality is free") — Building a quality playbook upfront costs less than
  debugging problems found after deployment.

## Coverage Targets

| Subsystem | Target | Why |
|-----------|--------|-----|
| [Most fragile module] | 90–95% | [Real edge case or past bug] |
| [Core logic module] | 85–90% | [Concrete risk] |
| [I/O or integration layer] | 80% | [Explain] |
| [Configuration/utilities] | 75–80% | [Explain] |

The rationale column is essential. It must reference specific risks or past failures.
If you can't explain why a subsystem needs high coverage with a concrete example,
the target is arbitrary.

## Coverage Theater Prevention

[Define what constitutes a fake test for this project.]

Generic examples that apply to most projects:
- Asserting a function returned *something* without checking what
- Testing with synthetic data that lacks the quirks of real data
- Asserting an import succeeded
- Asserting mock returns what the mock was configured to return
- Calling a function and only asserting no exception was thrown

[Add project-specific examples based on what you learned during exploration.
For a data pipeline: "counting output records without checking their values."
For a web app: "checking HTTP 200 without checking the response body."
For a compiler: "checking output compiles without checking behavior."]

## Fitness-to-Purpose Scenarios

[5–10 scenarios. Every scenario must include a `[Req: tier — source]` tag linking it to its requirement source. Use the template below:]

### Scenario N: [Memorable Name]

**Requirement tag:** [Req: formal — Spec §X] *(or `user-confirmed` / `inferred` — see SKILL.md Phase 1, Step 1 for tier definitions)*

**What happened:** [The architectural vulnerability, edge case, or design decision.
Reference actual code — function names, file names, line numbers. Frame as "this architecture permits the following failure mode."]

**The requirement:** [What the code must do to prevent this failure.
Be specific enough that an AI can verify it.]

**How to verify:** [Concrete test or query that would fail if this regressed.
Include exact commands, test names, or assertions.]

---

[Repeat for each scenario]

## AI Session Quality Discipline

1. Read QUALITY.md before starting work.
2. Run the full test suite before marking any task complete.
3. Add tests for new functionality (not just happy path — include edge cases).
4. Update this file if new failure modes are discovered.
5. Output a Quality Compliance Checklist before ending a session.
6. Never remove a fitness-to-purpose scenario. Only add new ones.

## The Human Gate

[List things that require human judgment:]
- Output that "looks right" (requires domain knowledge)
- UX and responsiveness
- Documentation accuracy
- Security review of auth changes
- Backward compatibility decisions
```

## Where Scenarios Come From

Scenarios come from two sources — **code exploration** and **domain knowledge** — and the best scenarios combine both.

### Source 1: Defensive Code Patterns (Code Exploration)

Every defensive pattern is evidence of a past failure or known risk:

1. **Defensive code** — Every `if value is None: return` guard is a scenario. Why was it needed?
2. **Normalization functions** — Every function that cleans input exists because raw input caused problems
3. **Configuration that could be hardcoded** — If a value is read from config instead of hardcoded, someone learned the value varies
4. **Git blame / commit messages** — "Fix crash when X is missing" → Scenario: X can be missing
5. **Comments explaining "why"** — "We use hash(id) not sequential index because..." → Scenario about correctness under that constraint

### Source 2: What Could Go Wrong (Domain Knowledge)

Don't limit yourself to what the code already defends against. Use your knowledge of similar systems to generate realistic failure scenarios that the code **should** handle. For every major subsystem, ask:

- "What happens if this process is killed mid-operation?" (state machines, file I/O, batch processing)
- "What happens if external input is subtly wrong?" (validation pipelines, API integrations)
- "What happens if this runs at 10x scale?" (batch processing, databases, queues)
- "What happens if two operations overlap?" (concurrency, file locks, shared state)
- "What produces correct-looking output that is actually wrong?" (randomness, statistical operations, type coercion)

These are not hypothetical — they are things that happen to every system of this type. Write them as **architectural vulnerability analyses**: "Because `save_state()` lacks an atomic rename pattern, a mid-write crash during a 10,000-record batch will leave a corrupted state file — the next run gets JSONDecodeError and cannot resume without manual intervention. At scale (9,240 records across 64 batches), this pattern risks silent loss of 1,693+ records with nothing to flag them as missing." Concrete numbers and specific consequences make scenarios authoritative and non-negotiable. An AI session reading "records can be lost" will argue the standard down. An AI session reading a specific failure mode with quantified impact will not.

### The Narrative Voice

Each scenario's "What happened" must read like an architectural vulnerability analysis, not an abstract specification. Include:

- **Specific quantities** — "308 records across 64 batches" not "some records"
- **Cascade consequences** — "cascading through all subsequent pipeline steps, requiring reprocessing of 4,300 records instead of 308"
- **Detection difficulty** — "nothing would flag them as missing" or "only statistical verification would catch it"
- **Root cause in code** — "`random.seed(index)` creates correlated sequences because sequential integers produce related random streams"

The narrative voice serves a critical purpose: it makes standards non-negotiable. Abstract requirements ("records should not be lost") invite rationalization. Specific failure modes with quantified impact ("a mid-batch crash silently loses 1,693 records with no detection mechanism") do not. Frame these as "this architecture permits the following failure" — grounded in the actual code, not fabricated as past incidents.

### Combining Both Sources

The strongest scenarios combine a defensive pattern found in code with domain knowledge about why it matters:

1. Find the defensive code: `save_state()` writes to a temp file then renames
2. Ask what failure this prevents: mid-write crash leaves corrupted state file
3. Write the scenario as a vulnerability analysis: "Without the atomic rename pattern, a crash mid-write leaves state.json 50% complete. The next run gets JSONDecodeError and cannot resume without manual intervention."
4. Ground it in code: "Read persistence.py line ~340: verify temp file + rename pattern"

### The "Why" Requirement

Every coverage target, every quality gate, every standard must have a "why" that references a specific scenario or risk. Without rationale, a future AI session will optimize for speed and argue the standard down.

Bad: "Core logic: 100% coverage"
Good: "Core logic: 100% — because `random.seed(index)` created correlated sequences that produced 77.5% bias instead of 50/50. Subtle bugs here produce plausible-but-wrong output. Only statistical verification catches them."

The "why" is not documentation — it is protection against erosion.

## Calibrating Scenario Count

Aim for 2+ scenarios per core module (the modules identified as most complex or fragile). For a medium-sized project, this typically yields 8–10 scenarios. Fewer is fine for small projects; more for complex ones. If you're finding very few scenarios, it usually means the exploration was shallow rather than the project being simple — go back and read function bodies more carefully. Quality matters more than count: one scenario that precisely captures an architectural vulnerability is worth more than three generic "what if the input is bad" scenarios.

## Self-Critique Before Finishing

After drafting all scenarios, review each one and ask:

1. **"Would an AI session argue this standard down?"** If yes, the "why" isn't concrete enough. Add numbers, consequences, and detection difficulty.
2. **"Does the 'What happened' read like a vulnerability analysis or an abstract spec?"** If it reads like a spec, rewrite it with specific quantities, cascading consequences, and grounding in actual code.
3. **"Is there a scenario I'm not seeing?"** Think about what a different AI model would flag. Architecture models catch data flow problems. Edge-case models catch boundary conditions. What are you blind to?

## Critical Rule

Each scenario's "How to verify" section must map to at least one automated test in the functional test file. If a scenario can't be automated, note why (it may require the Human Gate) — but most scenarios should be testable.
defensive_patterns.md 8.6 KB
# Finding Defensive Patterns (Step 5)

Defensive code patterns are evidence of past failures or known risks. Every null guard, try/catch, normalization function, and sentinel check exists because something went wrong — or because someone anticipated it would. Your job is to find these patterns systematically and convert them into fitness-to-purpose scenarios and boundary tests.

## Systematic Search

Don't skim — grep the codebase methodically. The exact patterns depend on the project's language. Here are common defensive-code indicators grouped by what they protect against:

**Null/nil guards:**

| Language | Grep pattern |
|---|---|
| Python | `None`, `is None`, `is not None` |
| Java | `null`, `Optional`, `Objects.requireNonNull` |
| Scala | `Option`, `None`, `.getOrElse`, `.isEmpty` |
| TypeScript | `undefined`, `null`, `??`, `?.` |
| Go | `== nil`, `!= nil`, `if err != nil` |
| Rust | `Option`, `unwrap`, `.is_none()`, `?` |

**Exception/error handling:**

| Language | Grep pattern |
|---|---|
| Python | `except`, `try:`, `raise` |
| Java | `catch`, `throws`, `try {` |
| Scala | `Try`, `catch`, `recover`, `Failure` |
| TypeScript | `catch`, `throw`, `.catch(` |
| Go | `if err != nil`, `errors.New`, `fmt.Errorf` |
| Rust | `Result`, `Err(`, `unwrap_or`, `match` |

**Internal/private helpers (often defensive):**

| Language | Grep pattern |
|---|---|
| Python | `def _`, `__` |
| Java/Scala | `private`, `protected` |
| TypeScript | `private`, `#` (private fields) |
| Go | lowercase function names (unexported) |
| Rust | `pub(crate)`, non-`pub` functions |

**Sentinel values, fallbacks, boundary checks:** Search for `== 0`, `< 0`, `default`, `fallback`, `else`, `match`, `switch` — these are language-agnostic.

## What to Look For Beyond Grep

- **Bugs that were fixed** — Git history, TODO comments, workarounds, defensive code that checks for things that "shouldn't happen"
- **Design decisions** — Comments explaining "why" not just "what." Configuration that could have been hardcoded but isn't. Abstractions that exist for a reason.
- **External data quirks** — Any place the code normalizes, validates, or rejects input from an external system
- **Parsing functions** — Every parser (regex, string splitting, format detection) has failure modes. What happens with malformed input? Empty input? Unexpected types?
- **Boundary conditions** — Zero values, empty strings, maximum ranges, first/last elements, type boundaries

## Converting Findings to Scenarios

For each defensive pattern, ask: "What failure does this prevent? What input would trigger this code path?"

The answer becomes a fitness-to-purpose scenario:

```markdown
### Scenario N: [Memorable Name]

**Requirement tag:** [Req: inferred — from function_name() behavior] *(use the canonical `[Req: tier — source]` format from SKILL.md Phase 1, Step 1)*

**What happened:** [The failure mode this code prevents. Reference the actual function, file, and line. Frame as a vulnerability analysis, not a fabricated incident.]

**The requirement:** [What the code must do to prevent this failure.]

**How to verify:** [A concrete test that would fail if this regressed.]
```

## Converting Findings to Boundary Tests

Each defensive pattern also maps to a boundary test:

```python
# Python (pytest)
def test_defensive_pattern_name(fixture):
    """[Req: inferred — from function_name() guard] guards against X."""
    # Mutate fixture to trigger the defensive code path
    # Assert the system handles it gracefully
```

```java
// Java (JUnit 5)
@Test
@DisplayName("[Req: inferred — from methodName() guard] guards against X")
void testDefensivePatternName() {
    fixture.setField(null);  // Trigger defensive code path
    var result = process(fixture);
    assertNotNull(result);  // Assert graceful handling
}
```

```scala
// Scala (ScalaTest)
// [Req: inferred — from methodName() guard]
"defensive pattern: methodName()" should "guard against X" in {
  val input = fixture.copy(field = None)  // Trigger defensive code path
  val result = process(input)
  result should equal (defined)  // Assert graceful handling
}
```

```typescript
// TypeScript (Jest)
test('[Req: inferred — from functionName() guard] guards against X', () => {
    const input = { ...fixture, field: null };  // Trigger defensive code path
    const result = process(input);
    expect(result).toBeDefined();  // Assert graceful handling
});
```

```go
// Go (testing)
func TestDefensivePatternName(t *testing.T) {
    // [Req: inferred — from FunctionName() guard] guards against X
    t.Helper()
    fixture.Field = nil  // Trigger defensive code path
    result, err := Process(fixture)
    if err != nil {
        t.Fatalf("expected graceful handling, got error: %v", err)
    }
    // Assert the system handled it
}
```

```rust
// Rust (cargo test)
#[test]
fn test_defensive_pattern_name() {
    // [Req: inferred — from function_name() guard] guards against X
    let input = Fixture { field: None, ..default_fixture() };
    let result = process(&input);
    assert!(result.is_ok(), "expected graceful handling");
}
```

## State Machine Patterns

State machines are a special category of defensive pattern. When you find status fields, lifecycle phases, or mode flags, trace the full state machine — see SKILL.md Step 5a for the complete process.

**How to find state machines:**

| Language | Grep pattern |
|---|---|
| Python | `status`, `state`, `phase`, `mode`, `== "running"`, `== "pending"` |
| Java | `enum.*Status`, `enum.*State`, `.getStatus()`, `switch.*status` |
| Scala | `sealed trait.*State`, `case object`, `status match` |
| TypeScript | `status:`, `state:`, `Status =`, `switch.*status` |
| Go | `Status`, `State`, `type.*Phase`, `switch.*status` |
| Rust | `enum.*State`, `enum.*Status`, `match.*state` |

**For each state machine found:**

1. List every possible state value (read the enum or grep for assignments)
2. For each handler/consumer that checks state, verify it handles ALL states
3. Look for states you can enter but never leave (terminal state without cleanup)
4. Look for operations that should be available in a state but are blocked by an incomplete guard

**Converting state machine gaps to scenarios:**

```markdown
### Scenario N: [Status] blocks [operation]

**Requirement tag:** [Req: inferred — from handler() status guard]

**What happened:** The [handler] only allows [operation] when status is "[allowed_states]", but the system can enter "[missing_state]" status (e.g., due to [condition]). When this happens, the user cannot [operation] and has no workaround through the interface.

**The requirement:** [operation] must be available in all states where the user would reasonably need it, including [missing_state].

**How to verify:** Set up a [entity] in "[missing_state]" status. Attempt [operation]. Assert it succeeds or provides a clear error with a workaround.
```

## Missing Safeguard Patterns

Search for operations that commit the user to expensive, irreversible, or long-running work without adequate preview or confirmation:

| Pattern | What to look for |
|---|---|
| Pre-commit information gap | Operations that start batch jobs, fan-out expansions, or API calls without showing estimated cost, scope, or duration |
| Silent expansion | Fan-out or multiplication steps where the final work count isn't known until runtime, with no warning shown |
| No termination condition | Polling loops, watchers, or daemon processes that check for new work but never check whether all work is done |
| Retry without backoff | Error handling that retries immediately or on a fixed interval without exponential backoff, risking rate limit floods |

**Converting missing safeguards to scenarios:**

```markdown
### Scenario N: No [safeguard] before [operation]

**Requirement tag:** [Req: inferred — from init_run()/start_watch() behavior]

**What happened:** [Operation] commits the user to [consequence] without showing [missing information]. In practice, a [example] fanned out from [small number] to [large number] units with no warning, resulting in [cost/time consequence].

**The requirement:** Before committing to [operation], display [safeguard] showing [what the user needs to see].

**How to verify:** Initiate [operation] and assert that [safeguard information] is displayed before the point of no return.
```

## Minimum Bar

You should find at least 2–3 defensive patterns per source file in the core logic modules. If you find fewer, read function bodies more carefully — not just signatures and comments.

For a medium-sized project (5–15 source files), expect to find 15–30 defensive patterns total. Each one should produce at least one boundary test. Additionally, trace at least one state machine if the project has status/state fields, and check at least one long-running operation for missing safeguards.
functional_tests.md 22.1 KB
# Writing Functional Tests

This is the most important deliverable. The Markdown files are documentation. The functional test file is the automated safety net. Name it using the project's conventions: `test_functional.py` (Python/pytest), `FunctionalSpec.scala` (Scala/ScalaTest), `FunctionalTest.java` (Java/JUnit), `functional.test.ts` (TypeScript/Jest), `functional_test.go` (Go), etc.

## Structure: Three Test Groups

Organize tests into three logical groups using whatever structure the test framework provides — classes (Python/Java), describe blocks (TypeScript/Jest), traits (Scala), or subtests (Go):

```
Spec Requirements
    — One test per testable spec section
    — Each test's documentation cites the spec requirement

Fitness Scenarios
    — One test per QUALITY.md scenario (1:1 mapping)
    — Named to match: test_scenario_N_memorable_name (or equivalent convention)

Boundaries and Edge Cases
    — One test per defensive pattern from Step 5
    — Targets null guards, try/catch, normalization, fallbacks
```

## Test Count Heuristic

**Target = (testable spec sections) + (QUALITY.md scenarios) + (defensive patterns from Step 5)**

Example: 12 spec sections + 10 scenarios + 15 defensive patterns = 37 tests as a target.

For a medium-sized project (5–15 source files), this typically yields 35–50 functional tests. Significantly fewer suggests missed requirements or shallow exploration. Don't pad to hit a number — every test should exercise real project code and verify a meaningful property.

## Import Pattern: Match the Existing Tests

Before writing any test code, read 2–3 existing test files and identify how they import project modules. This is critical — projects handle imports differently and getting it wrong means every test fails with resolution errors.

Common patterns by language:

**Python:**
- `sys.path.insert(0, "src/")` then bare imports (`from module import func`)
- Package imports (`from myproject.module import func`)
- Relative imports with conftest.py path manipulation

**Java:**
- `import com.example.project.Module;` matching the package structure
- Test source root must mirror main source root

**Scala:**
- `import com.example.project._` or `import com.example.project.{ClassA, ClassB}`
- SBT project layout: `src/test/scala/` mirrors `src/main/scala/`

**TypeScript/JavaScript:**
- `import { func } from '../src/module'` with relative paths
- Path aliases from `tsconfig.json` (e.g., `@/module`)

**Go:**
- Same package: test files in the same directory with `package mypackage`
- Black-box testing: `package mypackage_test` with explicit imports
- Internal packages may require specific import paths

**Rust:**
- `use crate::module::function;` for unit tests in the same crate
- `use myproject::module::function;` for integration tests in `tests/`

Whatever pattern the existing tests use, copy it exactly. Do not guess or invent a different pattern.

## Create Test Setup BEFORE Writing Tests

Every test framework has a mechanism for shared setup. If your tests use shared fixtures or test data, you MUST create the setup file before writing tests. Test frameworks do not auto-discover fixtures from other directories.

**By language:**

**Python (pytest):** Create `quality/conftest.py` defining every fixture. Fixtures in `tests/conftest.py` are NOT available to `quality/test_functional.py`. Preferred: write tests that create data inline using `tmp_path` to eliminate conftest dependency.

**Java (JUnit):** Use `@BeforeEach`/`@BeforeAll` methods in the test class, or create a shared `TestFixtures` utility class in the same package.

**Scala (ScalaTest):** Mix in a trait with `before`/`after` blocks, or use inline data builders. If using SBT, ensure the test file is in the correct source tree.

**TypeScript (Jest):** Use `beforeAll`/`beforeEach` in the test file, or create a `quality/testUtils.ts` with factory functions.

**Go (testing):** Helper functions in the same `_test.go` file with `t.Helper()`. Use `t.TempDir()` for temporary directories. Go convention strongly prefers inline setup — avoid shared test state.

**Rust (cargo test):** Helper functions in a `#[cfg(test)] mod tests` block or a `test_utils.rs` module. Use builder patterns for constructing test data. For integration tests, place files in `tests/`.

**Rule: Every fixture or test helper referenced must be defined.** If a test depends on shared setup that doesn't exist, the test will error during setup (not fail during assertion) — producing broken tests that look like they pass.

**Preferred approach across all languages:** Write tests that create their own data inline. This eliminates cross-file dependencies:

```python
# Python
def test_config_validation(tmp_path):
    config = {"pipeline": {"name": "Test", "steps": [...]}}
```

```java
// Java
@Test
void testConfigValidation(@TempDir Path tempDir) {
    var config = Map.of("pipeline", Map.of("name", "Test"));
}
```

```typescript
// TypeScript
test('config validation', () => {
    const config = { pipeline: { name: 'Test', steps: [] } };
});
```

```go
// Go
func TestConfigValidation(t *testing.T) {
    tmpDir := t.TempDir()
    config := Config{Pipeline: Pipeline{Name: "Test"}}
}
```

```rust
// Rust
#[test]
fn test_config_validation() {
    let config = Config { pipeline: Pipeline { name: "Test".into() } };
}
```

**After writing all tests, run the test suite and check for setup errors.** Setup errors (fixture not found, import failures) count as broken tests regardless of how the framework categorizes them.

## No Placeholder Tests

Every test must import and call actual project code. If a test body is `pass`, or its only assertion is `assert isinstance(errors, list)`, or it checks a trivial property like `assert hasattr(cls, 'validate')`, delete it and write a real test or drop it entirely. A test that doesn't exercise project code is worse than no test — it inflates the count and creates false confidence.

If you genuinely cannot write a meaningful test for a defensive pattern (e.g., it requires a running server or external service), note it as untestable in a comment rather than writing a placeholder.

## Read Before You Write: The Function Call Map

Before writing a single test, build a function call map. For every function you plan to test:

1. **Read the function/method signature** — not just the name, but every parameter, its type, and default value. In Python, read the `def` line and type hints. In Java, read the method signature and generics. In Scala, read the method definition and implicit parameters. In TypeScript, read the type annotations.
2. **Read the documentation** — docstrings, Javadoc, TSDoc, ScalaDoc. They often specify return types, exceptions, and edge case behavior.
3. **Read one existing test that calls it** — existing tests show you the exact calling convention, fixture shape, and assertion pattern.
4. **Read real data files** — if the function processes configs, schemas, or data files, read an actual file from the project. Your test fixtures must match this shape exactly.

**Common failure pattern:** The agent explores the architecture, understands conceptually what a function does, then writes a test call with guessed parameters. The test fails because the real function takes `(config, items_data, limit)` not `(items, seed, strategy)`. Reading the actual signature takes 5 seconds and prevents this entirely.

**Library version awareness:** Check the project's dependency manifest (`requirements.txt`, `build.sbt`, `package.json`, `pom.xml`, `build.gradle`, `Cargo.toml`) to verify what's available. Use the test framework's skip mechanism for optional dependencies: Python `pytest.importorskip()`, JUnit `Assumptions.assumeTrue()`, ScalaTest `assume()`, Jest conditional `describe.skip`, Go `t.Skip()`, Rust `#[ignore]` with a comment explaining the prerequisite.

## Writing Spec-Derived Tests

Walk each spec document section by section. For each section, ask: "What testable requirement does this state?" Then write a test.

Each test should:
1. **Set up** — Load a fixture, create test data, configure the system
2. **Execute** — Call the function, run the pipeline, make the request
3. **Assert specific properties** the spec requires

```python
# Python (pytest)
class TestSpecRequirements:
    def test_requirement_from_spec_section_N(self, fixture):
        """[Req: formal — Design Doc §N] X should produce Y."""
        result = process(fixture)
        assert result.property == expected_value
```

```java
// Java (JUnit 5)
class SpecRequirementsTest {
    @Test
    @DisplayName("[Req: formal — Design Doc §N] X should produce Y")
    void testRequirementFromSpecSectionN() {
        var result = process(fixture);
        assertEquals(expectedValue, result.getProperty());
    }
}
```

```scala
// Scala (ScalaTest)
class SpecRequirements extends FlatSpec with Matchers {
  // [Req: formal — Design Doc §N] X should produce Y
  "Section N requirement" should "produce Y from X" in {
    val result = process(fixture)
    result.property should equal (expectedValue)
  }
}
```

```typescript
// TypeScript (Jest)
describe('Spec Requirements', () => {
  test('[Req: formal — Design Doc §N] X should produce Y', () => {
    const result = process(fixture);
    expect(result.property).toBe(expectedValue);
  });
});
```

```go
// Go (testing)
func TestSpecRequirement_SectionN_XProducesY(t *testing.T) {
    // [Req: formal — Design Doc §N] X should produce Y
    result := Process(fixture)
    if result.Property != expectedValue {
        t.Errorf("expected %v, got %v", expectedValue, result.Property)
    }
}
```

```rust
// Rust (cargo test)
#[test]
fn test_spec_requirement_section_n_x_produces_y() {
    // [Req: formal — Design Doc §N] X should produce Y
    let result = process(&fixture);
    assert_eq!(result.property, expected_value);
}
```

## What Makes a Good Functional Test

- **Traceable** — Test name, display name, or documentation comment says which spec requirement it verifies
- **Specific** — Checks a specific property, not just "something happened"
- **Robust** — Uses real data (fixtures from the actual system), not synthetic data
- **Cross-variant** — If the project handles multiple input types, test all of them
- **Tests at the right layer** — Test the *behavior* you care about. If the requirement is "invalid data doesn't produce wrong output," test the pipeline output — don't just test that the schema validator rejects the input.

## Cross-Variant Testing Strategy

If the project handles multiple input types, cross-variant coverage is where silent bugs hide. Aim for roughly 30% of tests exercising all variants — the exact percentage matters less than ensuring every cross-cutting property is tested across all variants.

Use your framework's parametrization mechanism:

```python
# Python (pytest)
@pytest.mark.parametrize("variant", [variant_a, variant_b, variant_c])
def test_feature_works(variant):
    output = process(variant.input)
    assert output.has_expected_property
```

```java
// Java (JUnit 5)
@ParameterizedTest
@MethodSource("variantProvider")
void testFeatureWorks(Variant variant) {
    var output = process(variant.getInput());
    assertTrue(output.hasExpectedProperty());
}
```

```scala
// Scala (ScalaTest)
Seq(variantA, variantB, variantC).foreach { variant =>
  it should s"work for ${variant.name}" in {
    val output = process(variant.input)
    output should have ('expectedProperty (true))
  }
}
```

```typescript
// TypeScript (Jest)
test.each([variantA, variantB, variantC])(
  'feature works for %s', (variant) => {
    const output = process(variant.input);
    expect(output).toHaveProperty('expectedProperty');
});
```

```go
// Go (testing) — table-driven tests
func TestFeatureWorksAcrossVariants(t *testing.T) {
    variants := []Variant{variantA, variantB, variantC}
    for _, v := range variants {
        t.Run(v.Name, func(t *testing.T) {
            output := Process(v.Input)
            if !output.HasExpectedProperty() {
                t.Errorf("variant %s: missing expected property", v.Name)
            }
        })
    }
}
```

```rust
// Rust (cargo test) — iterate over cases
#[test]
fn test_feature_works_across_variants() {
    let variants = [variant_a(), variant_b(), variant_c()];
    for v in &variants {
        let output = process(&v.input);
        assert!(output.has_expected_property(),
            "variant {}: missing expected property", v.name);
    }
}
```

If parametrization doesn't fit, loop explicitly within a single test.

**Which tests should be cross-variant?** Any test verifying a property that *should* hold regardless of input type: entity identity, structural properties, required links, temporal fields, domain-specific semantics.

**After writing all tests, do a cross-variant audit.** Count cross-variant tests divided by total. If below 30%, convert more.

## Anti-Patterns to Avoid

These patterns look like tests but don't catch real bugs:

- **Existence-only checks** — Finding one correct result doesn't mean all are correct. Also check count or verify comprehensively.
- **Presence-only assertions** — Asserting a value exists only proves presence, not correctness. Assert the actual value.
- **Single-variant testing** — Testing one input type and hoping others work. Use parametrization.
- **Positive-only testing** — You must test that invalid input does NOT produce bad output.
- **Incomplete negative assertions** — When testing rejection, assert ALL consequences are absent, not just one.
- **Catching exceptions instead of checking output** — Testing that code crashes in a specific way isn't testing that it handles input correctly. Test the output.

### The Exception-Catching Anti-Pattern in Detail

```java
// Java — WRONG: tests the validation mechanism
@Test
void testBadValueRejected() {
    fixture.setField("invalid");  // Schema rejects this!
    assertThrows(ValidationException.class, () -> process(fixture));
    // Tells you nothing about output
}

// Java — RIGHT: tests the requirement
@Test
void testBadValueNotInOutput() {
    fixture.setField(null);  // Schema accepts null for Optional
    var output = process(fixture);
    assertFalse(output.contains(badProperty));  // Bad data absent
    assertTrue(output.contains(expectedType));   // Rest still works
}
```

```scala
// Scala — WRONG: tests the decoder, not the requirement
"bad value" should "be rejected" in {
  val input = fixture.copy(field = "invalid")  // Circe decoder fails!
  a [DecodingFailure] should be thrownBy process(input)
  // Tells you nothing about output
}

// Scala — RIGHT: tests the requirement
"missing optional field" should "not produce bad output" in {
  val input = fixture.copy(field = None)  // Option[String] accepts None
  val output = process(input)
  output should not contain badProperty  // Bad data absent
  output should contain (expectedType)   // Rest still works
}
```

```typescript
// TypeScript — WRONG: tests the validation mechanism
test('bad value rejected', () => {
    fixture.field = 'invalid';  // Zod schema rejects this!
    expect(() => process(fixture)).toThrow(ZodError);
    // Tells you nothing about output
});

// TypeScript — RIGHT: tests the requirement
test('bad value not in output', () => {
    fixture.field = undefined;  // Schema accepts undefined for optional
    const output = process(fixture);
    expect(output).not.toContain(badProperty);  // Bad data absent
    expect(output).toContain(expectedType);      // Rest still works
});
```

```python
# Python — WRONG: tests the validation mechanism
def test_bad_value_rejected(fixture):
    fixture.field = "invalid"  # Schema rejects this!
    with pytest.raises(ValidationError):
        process(fixture)
    # Tells you nothing about output

# Python — RIGHT: tests the requirement
def test_bad_value_not_in_output(fixture):
    fixture.field = None  # Schema accepts None for Optional
    output = process(fixture)
    assert field_property not in output  # Bad data absent
    assert expected_type in output  # Rest still works
```

```go
// Go — WRONG: tests the error, not the outcome
func TestBadValueRejected(t *testing.T) {
    fixture.Field = "invalid"  // Validator rejects this!
    _, err := Process(fixture)
    if err == nil { t.Fatal("expected error") }
    // Tells you nothing about output
}

// Go — RIGHT: tests the requirement
func TestBadValueNotInOutput(t *testing.T) {
    fixture.Field = ""  // Zero value is valid
    output, err := Process(fixture)
    if err != nil { t.Fatalf("unexpected error: %v", err) }
    if containsBadProperty(output) { t.Error("bad data should be absent") }
    if !containsExpectedType(output) { t.Error("expected data should be present") }
}
```

```rust
// Rust — WRONG: tests the error, not the outcome
#[test]
fn test_bad_value_rejected() {
    let input = Fixture { field: "invalid".into(), ..default() };
    assert!(process(&input).is_err());  // Tells you nothing about output
}

// Rust — RIGHT: tests the requirement
#[test]
fn test_bad_value_not_in_output() {
    let input = Fixture { field: None, ..default() };  // Option accepts None
    let output = process(&input).expect("should succeed");
    assert!(!output.contains(bad_property));  // Bad data absent
    assert!(output.contains(expected_type));   // Rest still works
}
```

Always check your Step 5b schema map before choosing mutation values.

## Testing at the Right Layer

Ask: "What does the *spec* say should happen?" The spec says "invalid data should not appear in output" — not "validation layer should reject it." Test the spec, not the implementation.

**Exception:** When a spec explicitly mandates a specific mechanism (e.g., "must fail-fast at the schema layer"), testing that mechanism is appropriate. But this is rare.

## Fitness-to-Purpose Scenario Tests

For each scenario in QUALITY.md, write a test. This is a 1:1 mapping:

```scala
// Scala (ScalaTest)
class FitnessScenarios extends FlatSpec with Matchers {
  // [Req: formal — QUALITY.md Scenario 1]
  "Scenario 1: [Name]" should "prevent [failure mode]" in {
    val result = process(fixture)
    result.property should equal (expectedValue)
  }
}
```

```python
# Python (pytest)
class TestFitnessScenarios:
    """Tests for fitness-to-purpose scenarios from QUALITY.md."""

    def test_scenario_1_memorable_name(self, fixture):
        """[Req: formal — QUALITY.md Scenario 1] [Name].
        Requirement: [What the code must do].
        """
        result = process(fixture)
        assert condition_that_prevents_the_failure
```

```java
// Java (JUnit 5)
class FitnessScenariosTest {
    @Test
    @DisplayName("[Req: formal — QUALITY.md Scenario 1] [Name]")
    void testScenario1MemorableName() {
        var result = process(fixture);
        assertTrue(conditionThatPreventsFailure(result));
    }
}
```

```typescript
// TypeScript (Jest)
describe('Fitness Scenarios', () => {
  test('[Req: formal — QUALITY.md Scenario 1] [Name]', () => {
    const result = process(fixture);
    expect(conditionThatPreventsFailure(result)).toBe(true);
  });
});
```

```go
// Go (testing)
func TestScenario1_MemorableName(t *testing.T) {
    // [Req: formal — QUALITY.md Scenario 1] [Name]
    // Requirement: [What the code must do]
    result := Process(fixture)
    if !conditionThatPreventsFailure(result) {
        t.Error("scenario 1 failed: [describe expected behavior]")
    }
}
```

```rust
// Rust (cargo test)
#[test]
fn test_scenario_1_memorable_name() {
    // [Req: formal — QUALITY.md Scenario 1] [Name]
    // Requirement: [What the code must do]
    let result = process(&fixture);
    assert!(condition_that_prevents_the_failure(&result));
}
```

## Boundary and Negative Tests

One test per defensive pattern from Step 5:

```typescript
// TypeScript (Jest)
describe('Boundaries and Edge Cases', () => {
  test('[Req: inferred — from functionName() guard] guards against X', () => {
    const input = { ...validFixture, field: null };
    const result = process(input);
    expect(result).not.toContainBadOutput();
  });
});
```

```python
# Python (pytest)
class TestBoundariesAndEdgeCases:
    """Tests for boundary conditions, malformed input, error handling."""

    def test_defensive_pattern_name(self, fixture):
        """[Req: inferred — from function_name() guard] guards against X."""
        # Mutate to trigger defensive code path
        # Assert graceful handling
```

```java
// Java (JUnit 5)
class BoundariesAndEdgeCasesTest {
    @Test
    @DisplayName("[Req: inferred — from methodName() guard] guards against X")
    void testDefensivePatternName() {
        fixture.setField(null);  // Trigger defensive code path
        var result = process(fixture);
        assertNotNull(result);  // Assert graceful handling
        assertFalse(result.containsBadData());
    }
}
```

```scala
// Scala (ScalaTest)
class BoundariesAndEdgeCases extends FlatSpec with Matchers {
  // [Req: inferred — from methodName() guard]
  "defensive pattern: methodName()" should "guard against X" in {
    val input = fixture.copy(field = None)  // Trigger defensive code path
    val result = process(input)
    result should equal (defined)
    result.get should not contain badData
  }
}
```

```go
// Go (testing)
func TestDefensivePattern_FunctionName_GuardsAgainstX(t *testing.T) {
    // [Req: inferred — from FunctionName() guard] guards against X
    input := defaultFixture()
    input.Field = nil  // Trigger defensive code path
    result, err := Process(input)
    if err != nil {
        t.Fatalf("expected graceful handling, got: %v", err)
    }
    // Assert result is valid despite edge-case input
}
```

```rust
// Rust (cargo test)
#[test]
fn test_defensive_pattern_function_name_guards_against_x() {
    // [Req: inferred — from function_name() guard] guards against X
    let input = Fixture { field: None, ..default_fixture() };
    let result = process(&input).expect("expected graceful handling");
    // Assert result is valid despite edge-case input
}
```

Use your Step 5b schema map when choosing mutation values. Every mutation must use a value the schema accepts.

Systematic approach:
- **Missing fields** — Optional field absent? Set to null.
- **Wrong types** — Field gets different type? Use schema-valid alternative.
- **Empty values** — Empty list? Empty string? Empty dict?
- **Boundary values** — Zero, negative, maximum, first, last.
- **Cross-module boundaries** — Module A produces unusual but valid output — does B handle it?

If you found 10+ defensive patterns but wrote only 4 boundary tests, go back and write more. Target a 1:1 ratio.
review_protocols.md 20.0 KB
# Review Protocols (Files 3 and 4)

## File 3: Code Review Protocol (`RUN_CODE_REVIEW.md`)

### Template

```markdown
# Code Review Protocol: [Project Name]

## Bootstrap (Read First)

Before reviewing, read these files for context:
1. `quality/QUALITY.md` — Quality constitution and fitness-to-purpose scenarios
2. [Main architectural doc]
3. [Key design decisions doc]
4. [Any other essential context]

## What to Check

### Focus Area 1: [Subsystem/Risk Area Name]

**Where:** [Specific files and functions]
**What:** [Specific things to look for]
**Why:** [What goes wrong if this is incorrect]

### Focus Area 2: [Subsystem/Risk Area Name]

[Repeat for 4–6 focus areas, mapped to architecture and risk areas from exploration]

## Guardrails

- **Line numbers are mandatory.** If you cannot cite a specific line, do not include the finding.
- **Read function bodies, not just signatures.** Don't assume a function works correctly based on its name.
- **If unsure whether something is a bug or intentional**, flag it as a QUESTION rather than a BUG.
- **Grep before claiming missing.** If you think a feature is absent, search the codebase. If found in a different file, that's a location defect, not a missing feature.
- **Do NOT suggest style changes, refactors, or improvements.** Only flag things that are incorrect or could cause failures.

## Output Format

Save findings to `quality/code_reviews/YYYY-MM-DD-reviewer.md`

For each file reviewed:

### filename.ext
- **Line NNN:** [BUG / QUESTION / INCOMPLETE] Description. Expected vs. actual. Why it matters.

### Summary
- Total findings by severity
- Files with no findings
- Overall assessment: SHIP IT / FIX FIRST / NEEDS DISCUSSION
```

### Phase 2: Regression Tests for Confirmed Bugs

After the code review produces findings, write regression tests that reproduce each BUG finding. This transforms the review from "here are potential bugs" into "here are proven bugs with failing tests."

**Why this matters:** A code review finding without a reproducer is an opinion. A finding with a failing test is a fact. Across multiple codebases (Go, Rust, Python), regression tests written from code review findings have confirmed bugs at a high rate — including data races, cross-tenant data leaks, state machine violations, and silent context loss. The regression tests also serve as the acceptance criteria for fixing the bugs: when the test passes, the bug is fixed.

**How to generate regression tests:**

1. **For each BUG finding**, write a test that:
   - Targets the exact code path and line numbers from the finding
   - Fails on the current implementation, confirming the bug exists
   - Uses mocking/monkeypatching to isolate from external services
   - Includes the finding description in the test docstring for traceability

2. **Name the test file** `quality/test_regression.*` using the project's language:
   - Python: `quality/test_regression.py`
   - Go: `quality/regression_test.go` (or in the relevant package's test directory)
   - Rust: `quality/regression_tests.rs` or a `tests/regression_*.rs` file in the relevant crate
   - Java: `quality/RegressionTest.java`
   - TypeScript: `quality/regression.test.ts`

3. **Each test should document its origin:**
   ```
   # Python example
   def test_webhook_signature_raises_on_malformed_input():
       """[BUG from 2026-03-26-reviewer.md, line 47]
       Webhook signature verification raises instead of returning False
       on malformed signatures, risking 500 instead of clean 401."""

   // Go example
   func TestRestart_DataRace_DirectFieldAccess(t *testing.T) {
       // BUG from 2026-03-26-claude.md, line 3707
       // Restart() writes mutex-protected fields without acquiring the lock
   }
   ```

4. **Run the tests and report results** as a confirmation table:
   ```
   | Finding | Test | Result | Confirmed? |
   |---------|------|--------|------------|
   | Webhook signature raises on malformed input | test_webhook_signature_... | FAILED (expected) | YES — bug confirmed |
   | Queued messages deleted before processing | test_message_queue_... | FAILED (expected) | YES — bug confirmed |
   | Thread active check fails open | test_is_thread_active_... | PASSED (unexpected) | NO — needs investigation |
   ```

5. **If a test passes unexpectedly**, investigate — either the finding was a false positive, or the test doesn't exercise the right code path. Report as NEEDS INVESTIGATION, not as a confirmed bug.

**Language-specific tips:**

- **Go:** Use `go test -race` to confirm data race findings. The race detector is definitive — if it fires, the race is real.
- **Rust:** Use `#[should_panic]` or assert on specific error conditions. For atomicity bugs, assert on cleanup state after injected failures.
- **Python:** Use `monkeypatch` or `unittest.mock.patch` to isolate external dependencies. Use `pytest.raises` for exception-path bugs.
- **Java:** Use Mockito or similar to isolate dependencies. Use `assertThrows` for exception-path bugs.

**Save the regression test output** alongside the code review: if the review is at `quality/code_reviews/2026-03-26-reviewer.md`, the regression tests go in `quality/test_regression.*` and the confirmation results go in the review file as an addendum or in `quality/results/`.

### Why These Guardrails Matter

These four guardrails often improve AI code review quality by reducing vague and hallucinated findings:

1. **Line numbers** force the model to actually locate the issue, not just describe a general concern
2. **Reading bodies** prevents the common failure of assuming a function works based on its name
3. **QUESTION vs BUG** reduces false positives that waste human time
4. **Grep before claiming missing** prevents the most common AI review hallucination: claiming something doesn't exist when it's in a different file

The "no style changes" rule keeps reviews focused on correctness. Style suggestions dilute the signal and waste review time.

---

## File 4: Integration Test Protocol (`RUN_INTEGRATION_TESTS.md`)

### Template

```markdown
# Integration Test Protocol: [Project Name]

## Working Directory

All commands in this protocol use **relative paths from the project root.** Run everything from the directory containing this file's parent (the project root). Do not `cd` to an absolute path or a parent directory — if a command starts with `cd /some/absolute/path`, it's wrong. Use `./scripts/`, `./pipelines/`, `./quality/`, etc.

## Safety Constraints

[If this protocol runs with elevated permissions:]
- DO NOT modify source code
- DO NOT delete files
- ONLY create files in the test results directory
- If something fails, record it and move on — DO NOT fix it

## Pre-Flight Check

Before running integration tests, verify:
- [ ] [Dependencies installed — specific command]
- [ ] [API keys / external services available — specific checks]
- [ ] [Test fixtures exist — specific paths]
- [ ] [Clean state — specific cleanup if needed]

## Test Matrix

| Check | Method | Pass Criteria |
|-------|--------|---------------|
| [Happy path flow] | [Specific command or test] | [Specific expected result] |
| [Variant A end-to-end] | [Command] | [Expected result] |
| [Variant B end-to-end] | [Command] | [Expected result] |
| [Output correctness] | [Specific assertion] | [Expected property] |
| [Component boundary A→B] | [Command] | [Expected result] |

### Design Principles for Integration Checks

- **Happy path** — Does the primary flow work from input to output?
- **Cross-variant consistency** — Does each variant produce correct output?
- **Output correctness** — Don't just check "output exists" — verify specific properties
- **Component boundaries** — Does Module A's output correctly feed Module B?

## Automated Integration Tests

Where possible, encode checks as automated tests:

```bash
[test runner] [integration test file] --verbose
```

## Manual Verification Steps

[Any checks requiring external systems, human judgment, or manual inspection]

## Execution UX (How to Present When Running This Protocol)

When an AI agent runs this protocol, it should communicate in three phases so the user can follow along without reading raw output:

### Phase 1: The Plan

Before running anything, show the user what's about to happen:

```
## Integration Test Plan

**Pre-flight:** Checking dependencies, API keys, and environment
**Tests to run:**

| # | Test | What It Checks | Est. Time |
|---|------|---------------|-----------|
| 1 | [Test name] | [One-line description] | ~30s |
| 2 | [Test name] | [One-line description] | ~2m |
| ... | | | |

**Total:** N tests, estimated M minutes
```

This gives the user a chance to say "skip test 4" or "actually, don't run the live API tests" before anything starts.

### Phase 2: Progress

As each test runs, report a one-line status update. Keep it compact — the user wants a heartbeat, not a log dump:

```
✓ Test 1: Expression evaluation — PASS (0.3s)
✓ Test 2: Schema validation — PASS (0.1s)
⧗ Test 3: Live pipeline (Gemini, realtime)... running
```

Use `✓` for pass, `✗` for fail, `⧗` for in-progress. If a test fails, show one line of context (the error message or assertion that failed), not the full stack trace. The user can ask for details if they want them.

### Phase 3: Results

After all tests complete, show a summary table and a recommendation:

```
## Results

| # | Test | Result | Time | Notes |
|---|------|--------|------|-------|
| 1 | Expression evaluation | ✓ PASS | 0.3s | |
| 2 | Schema validation | ✓ PASS | 0.1s | |
| 3 | Live pipeline (Gemini) | ✗ FAIL | 45s | Rate limited after 8 units |
| ... | | | | |

**Passed:** 7/8 | **Failed:** 1/8

**Recommendation:** FIX FIRST — Rate limit handling needs investigation.
```

Then save the detailed results to `quality/results/YYYY-MM-DD-integration.md`.

## Reporting (Saved to File)

Save results to `quality/results/YYYY-MM-DD-integration.md`

### Summary Table
| Check | Result | Notes |
|-------|--------|-------|
| ... | PASS/FAIL | ... |

### Detailed Findings
[Specific failures, unexpected behavior, performance observations]

### Recommendation
[SHIP IT / FIX FIRST / NEEDS INVESTIGATION]
```

### Tips for Writing Good Integration Checks

- Each check should exercise a real end-to-end flow, not just call a single function
- Pass criteria must be specific and verifiable — not "looks right" but "output contains exactly N records with property X"
- Include timing expectations where relevant (especially for batch/pipeline projects)
- If the project has multiple execution modes (batch vs. realtime, different providers), test each combination

### Live Execution Against External Services

Integration tests must exercise the project's actual external dependencies — APIs, databases, services, file systems. A protocol that only tests local validation and config parsing is not an integration test protocol; it's a unit test suite in disguise.

During exploration, identify:
- **External APIs the project calls** — Look for API keys in .env files, environment variable references, provider/client abstractions, HTTP client configurations
- **Execution modes** — batch vs. realtime, sync vs. async, different provider backends
- **Existing integration test runners** — Scripts or test files that already exercise end-to-end flows

Then design the test matrix as a **provider × pipeline × mode** grid. For example, if the project supports 3 API providers and 3 pipelines with batch and realtime modes, the protocol should run real executions across that matrix — not just validate configs locally.

**Structure runs for parallelism.** Group runs so that at most one run per provider executes simultaneously (to avoid rate limits). Use background processes and `wait` for concurrent execution within groups.

**Define per-pipeline quality checks.** Each pipeline produces different output with different correctness criteria. The protocol must specify what fields to check and what values are acceptable for each pipeline — not just "output exists."

**Include a post-run verification checklist.** For each run, verify: log file exists with completion message, manifest shows terminal state, validated output files exist and contain parseable data, sample records have expected fields populated, and any existing automated quality check scripts pass.

**Pre-flight must check API keys.** If keys are missing, stop and ask — don't skip the live tests silently.

The goal is that running this protocol exercises the full system under real-world conditions, catching issues that local-only testing would miss: provider-specific response format differences, timeout behavior, rate limiting, and output correctness with real LLM responses.

### Parallelism and Rate Limit Awareness

Sequential integration runs waste time. Group runs so that independent runs execute concurrently, with these constraints:

- **At most one run per external provider simultaneously** to avoid rate limits
- **Use background processes and `wait`** for concurrent execution within groups
- **Generate a shared timestamp** at the start of the session for consistent run directory naming

Example grouping for a project with 3 pipelines and 3 providers (9+ runs):

```
Group 1 (parallel): Pipeline_A × Provider_1 | Pipeline_B × Provider_2 | Pipeline_C × Provider_3
Group 2 (parallel): Pipeline_A × Provider_2 | Pipeline_B × Provider_3 | Pipeline_C × Provider_1
Group 3 (parallel): Pipeline_A × Provider_3 | Pipeline_B × Provider_1 | Pipeline_C × Provider_2
```

This pattern maximizes throughput while never hitting the same provider with concurrent requests. Adapt the grouping to the project's actual pipeline and provider count.

In the generated protocol, include the actual bash commands with `&` for background execution and `wait` between groups. Don't just describe parallelism — script it.

### Deriving Quality Gates from Code

Generic pass/fail criteria ("all units validated") miss domain-specific correctness issues. Derive pipeline-specific quality checks from the code itself:

1. **Read validation rules.** If the project validates output (schema validators, assertion functions, business rule checks), those rules define what "correct" looks like. Turn them into quality gates: "field X must satisfy condition Y for all output records."

2. **Read schema enums.** If schemas define enum fields (e.g., `outcome: ["fell_in_water", "reached_ship"]`), the quality gate is: "all outputs must use values from this set, and the distribution should be non-degenerate (not 100% one value)."

3. **Read generation logic.** If the project generates test data (items files, seed data, permutation strategies), understand what variants should appear. If there are 3 personality types, the quality gate is: "all 3 types must appear in output with sufficient sample size."

4. **Read existing quality checks.** Search for scripts or functions that already verify output quality (e.g., `integration_checks.py`, validation functions called after runs). Reference or call them directly from the protocol.

For each pipeline in the project, the integration protocol should have a dedicated "Quality Checks" section listing 2–4 specific checks with expected values derived from the exploration above. Do not use generic checks like "output exists" — every check must reference a specific field and acceptable value range.

### The Field Reference Table (Required Before Writing Quality Gates)

**Why this exists:** AI models confidently write wrong field names even when they've read the schemas. This happens because the model reads the schema during exploration, then writes the protocol hours (or thousands of tokens) later from memory. Memory drifts: `document_id` becomes `doc_id`, `sentiment_score` becomes `sentiment`, `float 0-1` becomes `int 0-100`. The protocol looks authoritative but the field names are hallucinated. When someone runs the quality gates against real data, they fail — and the user loses trust in the entire generated playbook.

**The fix is procedural, not instructional.** Don't just tell yourself to "cross-check later" — build the reference table FIRST, then write quality gates by copying from it.

Before writing any quality gate that references output field names, build a **Field Reference Table** by re-reading each schema file:

```
## Field Reference Table (built from schemas, not memory)

### Pipeline: WeatherForecast
Schema: pipelines/WeatherForecast/schemas/analyze.json
| Field | Type | Constraints |
|-------|------|-------------|
| region_name | string | — |
| temperature | number | min: -50, max: 60 |
| condition | string | enum: ["sunny", "cloudy", "rain", "snow"] |

### Pipeline: SentimentAnalysis
Schema: pipelines/SentimentAnalysis/schemas/evaluate.json
| Field | Type | Constraints |
|-------|------|-------------|
| document_id | string | — |
| sentiment_score | number | min: -1.0, max: 1.0 |
| classification | string | enum: ["positive", "negative", "neutral"] |
...
```

**The process:**
1. **Re-read each schema file IMMEDIATELY before writing each table row.** Do not write any row from memory. The file read and the table row must be adjacent — read the file, write the row, read the next file, write the next row. If you read all schemas earlier in the conversation, that doesn't count — you must read them AGAIN here because your memory of field names drifts over thousands of tokens.
2. **Copy field names character-for-character from the file contents.** Do not retype them. `document_id` is not `doc_id`. `sentiment_score` is not `sentiment`. `classification` is not `category`. Even small differences break quality gates.
3. **Include ALL fields from the schema, not just the ones you think are important.** If the schema has 8 required fields, the table has 8 rows. If you wrote fewer rows than the schema has fields, you skipped fields.
4. Write quality gates by copying field names from the completed table.
5. After writing, count fields: if the quality gates mention a field that isn't in the table, you hallucinated it. Remove it.

This table is an intermediate artifact — include it in the protocol itself (as a reference section) so future protocol users can verify field accuracy. The point is to create it as a concrete step that produces evidence of schema reading, not skip it because you "already know" the fields.

### Calibrating Scale

The number of units/records/iterations per integration test run matters:

- **Too few (1–3):** Fast and cheap, but misses concurrency bugs, distribution checks fail (can't verify "25–75% ratio" with 2 records), and fan-out/expansion logic untested at realistic scale.
- **Too many (100+):** Expensive and slow for a test protocol. Appropriate for production but not for quality verification.
- **Right range:** Enough to exercise the system meaningfully. Guidelines:
  - If the project has chunking/batching logic, use a count that spans at least 2 chunks (e.g., if chunk_size=10, use 15–30 units)
  - If the project has distribution checks, use at least 5–10× the number of categories (e.g., 3 outcome types → at least 15 units)
  - If the project has fan-out/expansion, use a count that produces a non-trivial number of children

Look for `chunk_size`, `batch_size`, or similar configuration in the project to calibrate. When in doubt, 10–30 records is usually the right range for integration testing — enough to catch real issues without burning API budget.

### Post-Run Verification Depth

A run that completes without errors may still be wrong. For each integration test run, verify at multiple levels:

1. **Process-level:** Did the process exit cleanly? Check log files for completion messages, not just exit codes.
2. **State-level:** Is the run in a terminal state? Check the run manifest/status file for "complete" (not stuck in "running" or "submitted").
3. **Data-level:** Does output data exist and parse correctly? Read actual output files, verify they contain valid JSON/CSV/etc.
4. **Content-level:** Do output records have the expected fields populated with reasonable values? Read 2–3 sample records and check key fields.
5. **Quality-level:** Do the pipeline-specific quality gates pass? Run any existing quality check scripts.
6. **UI-level (if applicable):** If the project has a dashboard/TUI/UI, verify the run appears correctly there.

Include all applicable levels in the generated protocol's post-run checklist. The common failure is stopping at level 2 (process completed) without checking levels 3–5.
schema_mapping.md 6.3 KB
# Schema Type Mapping (Step 5b)

If the project has a schema validation layer, you need to understand what each field accepts before writing boundary tests. Common validation layers by language: Pydantic models (Python), JSON Schema (any), TypeScript interfaces/Zod schemas (TypeScript), Bean Validation annotations (Java), case class codecs/Circe decoders (Scala), serde attributes (Rust). Without this mapping, you'll write mutations that the schema rejects before they reach the code you're trying to test — producing validation errors instead of meaningful boundary tests.

## Why This Matters

Consider this common mistake:

```typescript
// TypeScript — WRONG: tests the validation mechanism, not the requirement
test('bad value rejected', () => {
    fixture.field = 'invalid';  // Zod schema rejects this before processing!
    expect(() => process(fixture)).toThrow(ZodError);
    // Tells you nothing about the output
});

// TypeScript — RIGHT: tests the requirement using a schema-valid mutation
test('bad value not in output', () => {
    fixture.field = undefined;  // Schema accepts undefined for optional fields
    const output = process(fixture);
    expect(output).not.toContain(badProperty);  // Bad data absent
    expect(output).toContain(expectedType);      // Rest still works
});
```

```python
# Python — WRONG: tests the validation mechanism, not the requirement
def test_bad_value_rejected(fixture):
    fixture.field = "invalid"  # Pydantic rejects this before processing!
    with pytest.raises(ValidationError):
        process(fixture)
    # Tells you nothing about the output

# Python — RIGHT: tests the requirement using a schema-valid mutation
def test_bad_value_not_in_output(fixture):
    fixture.field = None  # Schema accepts None for Optional fields
    output = process(fixture)
    assert field_property not in output  # Bad data absent
    assert expected_type in output  # Rest still works
```

```java
// Java — WRONG: tests Bean Validation, not the requirement
@Test
void testBadValueRejected() {
    fixture.setField("invalid");  // @NotNull/@Pattern rejects this!
    assertThrows(ConstraintViolationException.class, () -> process(fixture));
}

// Java — RIGHT: tests the requirement using a schema-valid mutation
@Test
void testBadValueNotInOutput() {
    fixture.setField(null);  // nullable String field accepts null
    var output = process(fixture);
    assertFalse(output.contains(badProperty));
    assertTrue(output.contains(expectedType));
}
```

```scala
// Scala — WRONG: tests the decoder, not the requirement
"bad value" should "be rejected" in {
    val input = fixture.copy(field = "invalid")  // Circe decoder fails!
    a [DecodingFailure] should be thrownBy process(input)
}

// Scala — RIGHT: tests the requirement using a schema-valid mutation
"missing optional field" should "not produce bad output" in {
    val input = fixture.copy(field = None)  // Option[String] accepts None
    val output = process(input)
    output should not contain badProperty
}
```

```go
// Go — WRONG: tests validation, not the requirement
func TestBadValueRejected(t *testing.T) {
    fixture.Field = "invalid"  // Struct tag validator rejects this!
    _, err := Process(fixture)
    if err == nil { t.Fatal("expected validation error") }
    // Tells you nothing about the output
}

// Go — RIGHT: tests the requirement using a valid zero value
func TestBadValueNotInOutput(t *testing.T) {
    fixture.Field = ""  // Zero value is valid for optional string fields
    output, err := Process(fixture)
    if err != nil { t.Fatalf("unexpected error: %v", err) }
    // Assert bad data absent, rest still works
}
```

```rust
// Rust — WRONG: tests serde deserialization, not the requirement
#[test]
fn test_bad_value_rejected() {
    let input = Fixture { field: "invalid".into(), ..default() };
    // serde rejects before processing!
    assert!(process(&input).is_err());
}

// Rust — RIGHT: tests the requirement using a schema-valid mutation
#[test]
fn test_bad_value_not_in_output() {
    let input = Fixture { field: None, ..default() };  // Option<String> accepts None
    let output = process(&input).expect("should succeed");
    assert!(!output.contains(bad_property));
    assert!(output.contains(expected_type));
}
```

The WRONG tests fail with a validation/decoding error because the mutation value isn't schema-valid. The RIGHT tests use values the schema accepts (null, None, nil, zero values, empty Option) so the mutation reaches the actual processing logic.

## How to Build the Map

For every field you found a defensive pattern for in Step 5, record:

| Field | Schema Type | Accepts | Rejects |
|-------|-----------|---------|---------|
| `metadata` | optional object (`Optional[MetadataObject]` / `MetadataObject?` / `MetadataObject \| null`) | valid object, `null`/`undefined` | `string`, `number`, `array` |
| `count_field` | optional integer (`Optional[int]` / `number?` / `Integer`) | integer, `null` | `string`, `object` |
| `child_list` | array of objects (`List[Child]` / `Child[]` / `Seq[Child]`) | array of objects, `[]` | `[null, "invalid"]`, `null` |
| `optional_object` | optional object | `{"key": value}`, `null` | `"bad"`, `[1,2]` |

## Rules for Choosing Mutation Values

When writing boundary tests, always use values from the "Accepts" column. The idiomatic "missing/empty" value varies by language:

- **Optional/nullable fields:** Python `None`, Java `null`, Scala `None` (for `Option`), TypeScript `undefined`/`null`, Go zero value (`""`, `0`, `nil` for pointers), Rust `None` (for `Option<T>`)
- **Numeric fields:** `0`, negative values, or boundary values — language-agnostic
- **Arrays/lists:** Python `[]`, Java `List.of()`, Scala `Seq.empty`, TypeScript `[]`, Go `nil` or empty slice, Rust `Vec::new()`
- **Strings:** `""` (empty string) — language-agnostic
- **Objects/structs:** Python `{}`, Java `new Obj()` with missing fields, Scala `copy()` with `None`, TypeScript `{}`, Go zero-value struct, Rust `Default::default()` or builder with missing fields

Never use values from the "Rejects" column — they test the schema validator, not the business logic.

## When to Skip This Step

If the project has no schema validation layer (data flows directly into processing without type checking), you can skip the mapping and use any mutation values. But most modern projects have some form of validation, so check first.
spec_audit.md 7.2 KB
# Council of Three Spec Audit Protocol (File 5)

This is a static analysis protocol — AI models read the code and compare it to specifications. No code is executed. It catches a different class of problem than testing: spec-code divergence, undocumented features, phantom specs, and missing implementations.

## Why Three Models?

Different AI models have different blind spots — they're confident about different things and miss different things. Cross-referencing three independent reviews catches defects that any single model would miss.

## Template

```markdown
# Spec Audit Protocol: [Project Name]

## The Definitive Audit Prompt

Give this prompt identically to three independent AI tools (e.g., Claude, GPT, Gemini).

---

**Context files to read:**
1. [List all spec/intent documents with paths]
2. [Architecture docs]
3. [Design decision records]

**Task:** Act as the Tester. Read the actual code in [source directories] and compare it against the specifications listed above.

**Requirement confidence tiers:**
Requirements are tagged with `[Req: tier — source]`. Weight your findings by tier:
- **formal** — written by humans in a spec document. Authoritative. Divergence is a real finding.
- **user-confirmed** — stated by the user but not in a formal doc. Treat as authoritative unless contradicted by other evidence.
- **inferred** — deduced from code behavior. Lower confidence. Report divergence as NEEDS REVIEW, not as a definitive defect.

**Rules:**
- ONLY list defects. Do not summarize what matches.
- For EVERY defect, cite specific file and line number(s).
  If you cannot cite a line number, do not include the finding.
- Before claiming missing, grep the codebase.
- Before claiming exists, read the actual function body.
- Classify each finding: MISSING / DIVERGENT / UNDOCUMENTED / PHANTOM
- For findings against inferred requirements, add: NEEDS REVIEW

**Defect classifications:**
- **MISSING** — Spec requires it, code doesn't implement it
- **DIVERGENT** — Both spec and code address it, but they disagree
- **UNDOCUMENTED** — Code does it, spec doesn't mention it
- **PHANTOM** — Spec describes it, but it's actually implemented differently than described

**Project-specific scrutiny areas:**

[5–10 specific questions that force the auditor to read the most critical code. Target:]

1. [The most fragile module — force the auditor to read specific functions]
2. [External data handling — validation, normalization, error recovery]
3. [Assumptions that might not hold — field presence, value ranges, format consistency]
4. [Features that cross module boundaries]
5. [The gap between documentation and implementation]
6. [Specific edge cases from the QUALITY.md scenarios]

**Output format:**

### [filename.ext]
- **Line NNN:** [MISSING / DIVERGENT / UNDOCUMENTED / PHANTOM] [Req: tier — source] Description.
  Spec says: [quote or reference]. Code does: [what actually happens].
  *(Include the `[Req: tier — source]` tag so findings can be traced back to their requirement and confidence level.)*

---

## Running the Audit

1. Give the identical prompt to three AI tools
2. Each auditor works independently — no cross-contamination
3. Collect all three reports

## Triage Process

After all three models report, merge findings:

| Confidence | Found By | Action |
|------------|----------|--------|
| Highest | All three | Almost certainly real — fix or update spec |
| High | Two of three | Likely real — verify and fix |
| Needs verification | One only | Could be real or hallucinated — deploy verification probe |

### The Verification Probe

When models disagree on factual claims, deploy a read-only probe: give one model the disputed claim and ask it to read the code and report ground truth. Never resolve factual disputes by majority vote — the majority can be wrong about what code actually does.

### Categorize Each Confirmed Finding

- **Spec bug** — Spec is wrong, code is fine → update spec
- **Design decision** — Human judgment needed → discuss and decide
- **Real code bug** — Fix in small batches by subsystem
- **Documentation gap** — Feature exists but undocumented → update docs
- **Missing test** — Code is correct but no test verifies it → add to the functional test file
- **Inferred requirement wrong** — The inferred requirement doesn't match actual intent → remove or correct it in QUALITY.md

That last category is the bridge between the spec audit and the test suite. Every confirmed finding not already covered by a test should become one.

## Fix Execution Rules

- Group fixes by subsystem, not by defect number
- Never one mega-prompt for all fixes
- Each batch: implement, test, have all three reviewers verify the diff
- At least two auditors must confirm fixes pass before marking complete

## Output

Save audit reports to `quality/spec_audits/YYYY-MM-DD-[model].md`
Save triage summary to `quality/spec_audits/YYYY-MM-DD-triage.md`
```

## The Four Guardrails (Critical for All Auditors)

Some models confidently claim features are missing without checking code. These four rules embedded in the audit prompt materially improve output quality by reducing vague and hallucinated findings:

1. **Mandatory line numbers** — If you cannot cite a line number, do not include the finding. This eliminates vague claims.
2. **Grep before claiming missing** — Before saying a feature is absent, search the codebase. It may be in a different file.
3. **Read function bodies, not just signatures** — Don't assume a function works correctly based on its name.
4. **Classify defect type** — Forces structured thinking (MISSING/DIVERGENT/UNDOCUMENTED/PHANTOM) instead of vague "this looks wrong."

These guardrails are already embedded in the template above. They matter most for models that tend toward confident but unchecked claims.

## Model Selection Notes

Different models have different audit strengths. In practice:

- **Architecture-focused models** (e.g., Claude) tend to find the most issues with fewest false positives, excelling at silent data loss, cross-function data flow, and state machine bugs.
- **Edge-case focused models** (e.g., GPT-based tools) tend to catch boundary conditions other models miss (zero-length inputs, file collisions, off-by-one errors) and serve as effective verification cross-checkers.
- **Models that need structure** (e.g., some Gemini variants) may perform poorly on open-ended audit prompts but respond dramatically to the four guardrails above.

The specific models that excel will change over time. The principle holds: use multiple models with different strengths, and always include the four guardrails.

## Tips for Writing Scrutiny Areas

The scrutiny areas are the most important part of the prompt. Generic questions like "check if the code matches the spec" produce generic answers. Specific questions that name functions, files, and edge cases produce specific findings.

Good scrutiny areas:
- "Read `process_input()` in `pipeline.py` lines 45–120. The spec says it should handle missing fields by substituting defaults. Does it? Which fields have defaults and which silently produce null?"
- "The architecture doc says Module A passes validated data to Module B. Read both modules. Is there any path where unvalidated data reaches Module B?"

Bad scrutiny areas:
- "Check if the code is correct"
- "Look for bugs"
- "Verify the implementation matches the spec"
verification.md 6.9 KB
# Verification Checklist (Phase 3)

Before declaring the quality playbook complete, check every benchmark below. If any fails, go back and fix it.

## Self-Check Benchmarks

### 1. Test Count

Calculate the heuristic target: (testable spec sections) + (QUALITY.md scenarios) + (defensive patterns from Step 5).

- **Well below target** → You likely missed spec requirements or skimmed defensive patterns. Go back and check.
- **Near target** → Review whether you tested negative cases and boundaries.
- **Above target** → Fine, as long as every test is meaningful. Don't pad to hit a number.

### 2. Scenario Coverage

Count the scenarios in QUALITY.md. Count the scenario test functions in your functional test file. The numbers must match exactly.

### 3. Cross-Variant Coverage

If the project handles N input variants, what percentage of tests exercise all N?

Count: tests that loop or parametrize over all variants / total tests.

**Heuristic: ~30%.** If well below, look for single-variant tests that should be parametrized. Common candidates: structural completeness, identity verification, required field presence, data relationships, semantic correctness. The exact percentage matters less than ensuring cross-cutting properties are tested across all variants.

### 4. Boundary and Negative Test Count

Count the defensive patterns from Step 5. Count your boundary/negative tests. The ratio should be close to 1:1. If significantly lower, write more tests targeting untested defensive patterns.

### 5. Assertion Depth

Scan your assertions. How many are presence checks vs. value checks? If more than half are presence-only (`assert x is not None`, `assert x in output`), strengthen them to check actual values.

### 6. Layer Correctness

For each test, ask: "Am I testing the *requirement* or the *mechanism*?" If any test only asserts that a specific error type is raised without also verifying pipeline output, it's testing the mechanism. Rewrite to test the outcome.

### 7. Mutation Validity

For every test that mutates a fixture, verify the mutation value is in the "Accepts" column of your Step 5b schema map. If any mutation uses a type the schema rejects, the test fails with a validation error instead of testing defensive code. Fix it.

### 8. All Tests Pass — Zero Failures AND Zero Errors

Run the test suite using the project's test runner:

- **Python:** `pytest quality/test_functional.py -v`
- **Scala:** `sbt "testOnly *FunctionalSpec"`
- **Java:** `mvn test -Dtest=FunctionalTest` or `gradle test --tests FunctionalTest`
- **TypeScript:** `npx jest functional.test.ts --verbose`
- **Go:** `go test -v` targeting the generated test file's package — use the project's existing module and package layout
- **Rust:** `cargo test` targeting the generated test — either the integration test target in `tests/` or inline `#[cfg(test)]` tests, matching the project's conventions

**Check for both failures AND errors.** Most test frameworks distinguish between test failures (assertion errors) and test errors (setup failures, missing fixtures, import/resolution errors, exceptions during initialization). Both are broken tests. A common mistake: generating tests that reference shared fixtures or helpers that don't exist. These show up as setup errors, not assertion failures — but they are just as broken.

After running, check:
- All tests passed — count must equal total test count
- Zero failures
- Zero errors/setup failures

If there are setup errors, you forgot to create the fixture/setup file or you referenced helpers that don't exist. Go back and either create them or rewrite the tests to be self-contained.

### 9. Existing Tests Unbroken

Run the project's full test suite (not just your new tests). Your new files should not break anything.

## Documentation Verification

### 10. QUALITY.md Scenarios Reference Real Code and Label Sources

Every scenario should mention actual function names, file names, or patterns that exist in the codebase. Grep for each reference to confirm it exists.

If working from non-formal requirements, verify that each scenario and test includes a requirement tag using the canonical format: `[Req: formal — README §3]`, `[Req: inferred — from validate_input() behavior]`, `[Req: user-confirmed — "must handle empty input"]`. Inferred requirements should be flagged for user review in Phase 4.

### 11. RUN_CODE_REVIEW.md Is Self-Contained

An AI with no prior context should be able to read it and perform a useful review. Check: does it list bootstrap files? Does it have specific focus areas? Are the guardrails present?

### 12. RUN_INTEGRATION_TESTS.md Is Executable and Field-Accurate

Every command should work. Every check should have a concrete pass/fail criterion — not "verify it looks right" but a specific expected result.

**Verify quality gates were written from a Field Reference Table, not from memory.** Check that:

1. A Field Reference Table exists in RUN_INTEGRATION_TESTS.md with a row for every field in every schema
2. **Field count check:** For each schema, count the fields in the actual schema file and count the rows in your table. If the numbers don't match, you missed fields or invented fields. The most common failure: a schema has 8 fields but the table only has 2-3 "important" ones.
3. **Character-for-character check:** Re-read each schema file now and compare every field name in your table against the file contents. `document_id` ≠ `doc_id`. `sentiment_score` ≠ `sentiment`. `classification` ≠ `category`.
4. Every type and constraint matches the schema (`float 0-1` is not `int 0-100`, `string enum` is not `integer`)

If any field name, count, or type is wrong, fix it before proceeding. The table is the foundation — if the table is wrong, every quality gate built from it is wrong.

### 13. RUN_SPEC_AUDIT.md Prompt Is Copy-Pasteable

The definitive audit prompt should work when pasted into Claude Code, Cursor, and Copilot without modification (except file reference syntax).

## Quick Checklist Format

Use this as a final sign-off:

- [ ] Test count near heuristic target (spec sections + scenarios + defensive patterns)
- [ ] Scenario test count matches QUALITY.md scenario count
- [ ] Cross-variant tests ~30% of total (every cross-cutting property covered)
- [ ] Boundary tests ≈ defensive pattern count
- [ ] Majority of assertions check values, not just presence
- [ ] All tests assert outcomes, not mechanisms
- [ ] All mutations use schema-valid values
- [ ] All new tests pass (zero failures AND zero errors — check for fixture errors)
- [ ] All existing tests still pass
- [ ] QUALITY.md scenarios reference real code and include `[Req: tier — source]` tags
- [ ] If using inferred requirements: all `[Req: inferred — ...]` items are flagged for user review
- [ ] Code review protocol is self-contained
- [ ] Integration test quality gates were written from a Field Reference Table (not memory)
- [ ] Integration tests have specific pass criteria
- [ ] Spec audit prompt is copy-pasteable and uses `[Req: tier — source]` tag format

License (MIT)

View full license text
MIT License

Copyright (c) 2025 Andrew Stellman

Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.