The Bypass
We gave an AI coding agent a working POC and asked it to convert the code to production quality. On its first unconstrained run, the agent wrote all 6 processing tasks — tests and implementation together — in 3 minutes 59 seconds. No specification step. No test-first cycle. No separation between defining what the code should do and writing the code.
Everything looked correct. The tests passed. But the tests were derived from the implementation, not from an independent specification. They confirmed what the code did, not what it should do. A defect in the implementation produced a matching defect in the tests. Both artefacts shared the same blind spots because both originated from the same reasoning pass.
This is the default behaviour of every AI coding agent on the market today.
Why this happens
Tests confirm the implementation rather than specify behaviour. The same blind spots appear in both artefacts because both originate from the same reasoning pass.
The agent tests the happy path. Edge cases — zero values, empty inputs, off-by-one limits — are non-obvious without a deliberate specification step that asks "where can this break?"
When implementation output differs from expectations, the agent adjusts the test to match the code, rather than questioning the code. The test becomes a mirror of the implementation, not a constraint on it.
Tests describe what the code does, not what it should do. They cannot detect a defect because they were derived from the same defective reasoning that produced the implementation.
What This Costs in Life Sciences
In a web application, these failures cause bugs and downtime. In scientific R&D software, they cause wrong conclusions.
border_percent=0During conversion of a cell confluency assessment tool — used in drug discovery cell-based assay workflows — a boundary condition test for zero-value parameters discovered that mask[-0:] selects the entire NumPy array instead of an empty slice. This bug causes incorrect image processing output. It was caught because the test was written before the implementation existed, by a specification step that asked "what happens at zero?" Without test-first enforcement, the agent writes the implementation first, then writes tests that exercise the happy path. This bug ships. In an assay analysis tool, it corrupts results that inform drug candidate selection.
The pharmaceutical industry faces a ~90% failure rate for drug candidates entering clinical trials. AI agents that silently misreport results — because a database query failed without error handling, because a tool parameter was untyped and truncated a search, because boundary conditions were never tested — add risk to an already high-risk process.
A software quality bug in your domain is a scientific integrity bug. The discipline required to prevent it cannot be left to the AI's discretion.
What Changes for the Developer
This is not a tool that replaces your judgement. It is a process that preserves your judgement and constrains the AI to operate within it.
You review two things: architecture (module boundaries, exception hierarchy, dependency direction) and test quality (boundary conditions, error paths, cleanup). These are the decisions that require expertise. The AI handles implementation, linting, typing, coverage, and hygiene — under constraint. You become the architect and reviewer; the AI becomes the implementer who can't cut corners.
You spend time reviewing test quality at each specification phase. Filesystem locks (chmod a-w) and SHA-256 hash audits guarantee those tests cannot be silently modified during implementation. Your judgement holds. Across two validated projects and 34 specification-implementation cycles, zero hash violations were detected — the enforcement works.
Tests are written before implementation and locked. The AI cannot adjust assertions to match defective output. Boundary conditions are required (≥2 per task), not optional. Five independent verification checks run on every commit: functional correctness, lint, type safety, code hygiene, and coverage. A 30-check independent assessment runs after completion.
How It Works
The VP-model orchestrator converts a POC-to-production rewrite into a semi-automated, auditable workflow using three reinforcing mechanisms applied iteratively for every task in every module.
The agent writes test cases (the specification) and then, in a separate phase, writes the implementation. These are sequenced by a state machine. The specification phase forces reasoning about expected behaviour, boundaries, and error paths before implementation code exists to be influenced by.
Once specification is complete, test files are locked at the OS level and cryptographically fingerprinted (SHA-256). During implementation, the agent cannot modify tests — the kernel prevents writes, and any circumvention is detected by hash audit. This closes the most dangerous failure mode: adjusting tests to match defective code.
Each implementation must pass five independent checks before commit: functional correctness (pytest), code style (ruff), type safety (mypy strict), code hygiene (no debug artefacts, secrets, or unmanaged resources), and coverage (≥80%). After all modules, integration tests with real dependencies and a user-level E2E validation run.
Full workflow diagram and stage detail in the Workflow tab. Enforcement mechanisms, VP-model layer coverage, and gap analysis in the Methodology tab.
Evidence
Validated end-to-end on two POC-to-production conversions. Both achieved PRODUCTION READY against the 30-check independent assessment.
Confluency Assessment (V0.4)
A FastAPI cell confluency assessment application — image analysis for drug discovery cell-based assay workflows — converted from a working POC (monolithic routes, no tests, global state) to a production-grade codebase.
Tacit Knowledge Capture (V0.5)
A FastAPI application for extracting tacit knowledge from meeting transcripts using Ollama LLM, with Whisper audio transcription and a NetworkX knowledge graph — converted from a 2,109 LOC POC.
Zero compaction events. Zero hash violations. 71.7% mutation kill rate. ~35 atomic git commits.
Defects caught that would likely survive code review
| Defect | Detected by | Consequence if shipped |
|---|---|---|
mask[-0:] selects entire array (border_percent=0) | Boundary test (zero-value parameter) | Incorrect image processing output in assay analysis |
np.uint8(array) returns scalar, not ndarray | mypy strict mode | Silent type mismatch in downstream processing |
Pydantic v1 __construct__() deprecated | Test collection validation | Runtime failure on Pydantic v2 upgrade |
| Plan/skeleton ID format conflict | Red phase (tests vs stubs) | Inconsistent identifiers across system |
| Plan/skeleton status default conflict | Red phase (tests vs stubs) | Ambiguous state semantics |
Honest Limitations
The same AI model writes tests and implementation. Temporal separation via filesystem lock + hash audit prevents retroactive modification, but both artefacts originate from the same model and may share blind spots. A recommended dual-tool pattern (Claude App as independent reviewer) provides a second opinion but not true model separation.
The independent production readiness assessment is executed by an LLM session. Different sessions may judge the same code differently — one may pass observability "with gaps", another may fail it. This is inherent to LLM-based assessment. Mitigated by fix-and-reassess loops and by fixing borderline checks rather than relying on lenient assessors.
Validated against two FastAPI applications (1,045 and 2,109 LOC). Generalisability to CLI tools, libraries, data pipelines, non-FastAPI frameworks, and codebases above 2,500 LOC is unverified. Web-application assumptions are present in scope profiles and assessment checks.
The orchestrator uses Claude Code slash commands, filesystem locks, and skill dispatch. It does not work with Cursor, Copilot, or other AI coding tools without adaptation. The addressable user base is limited to Claude Code (Opus model) users.
Full technical detail including enforcement mechanisms, VP-model layer coverage, gap resolution matrix, and hardening progression is in the Methodology tab.
Explore Further
The VP-model extends the V-model with executable prototypes at each abstraction layer. The Methodology tab covers the four-layer framework, enforcement mechanisms, and gap analysis in full technical detail.
The Example tab walks through a complete conversion end-to-end. The Workflow tab covers stage detail, scope profiles, and the full pipeline diagram. The Assessment tab covers the 30-check production readiness specification.
How This Works — In Plain Language
This page explains AI Code Guard without assuming technical knowledge. If you want the engineering detail, switch to any of the other tabs.
What problem are we solving?
AI tools can write software very quickly. In minutes, they can produce a "proof of concept" — a rough working version of an application that demonstrates the idea works. But rough working versions are not the same as production-quality software. Production software needs to be reliable, secure, maintainable, and testable. It needs to handle errors gracefully, protect sensitive data, and be structured so that other developers can understand and extend it.
Converting a rough version into a production version is skilled, disciplined work. It requires careful planning, thorough testing, and methodical rebuilding. AI tools should be good at this — but without constraints, they take shortcuts. They skip the discipline. The result looks complete but contains hidden defects that only surface under real-world conditions.
What does the orchestrator do about this?
It forces the AI to follow a strict, step-by-step process where the "homework" is written separately from the "answer key" — and the answer key is locked before the homework is attempted. A human expert reviews the answer key for quality, and multiple automated checks verify the homework from different angles.
The process has five main stages:
The Five Stages
Before any code is written, the AI analyses the existing rough version and creates a plan for how to rebuild it properly. This plan breaks the application into logical modules (self-contained pieces), defines how those pieces connect to each other, and specifies how errors should be handled. The plan also establishes coding standards, logging, and configuration.
Who does what: The AI drafts the plan. A human engineer reviews the architectural decisions — module boundaries, error handling strategy, dependency structure. This is the most important human review point. If the plan is wrong, everything built from it will be wrong.
Analogy: An architect drawing blueprints before construction begins. The builder doesn't start pouring concrete until the architect and client agree on the structure.
For each piece of work in the plan, the AI writes a set of tests before writing the actual code. These tests are derived from the plan, not from the code — they describe what the code should do, including normal behaviour, edge cases (unusual inputs, extreme values, empty data), and error conditions (what should happen when things go wrong).
At this point, all tests will fail — because the code they're testing doesn't exist yet. That's expected and correct. The tests are a specification: a precise description of the required behaviour, written as verifiable checks.
Who does what: The AI writes the tests. A human engineer reviews test quality — are the edge cases meaningful? Are error conditions specific? Are there enough boundary tests? This is the second most important human review point.
Analogy: A teacher writing an exam paper before the students sit the exam. The exam defines what success looks like. It's written independently of any particular student's answer.
Now the AI writes the actual production code — the minimum implementation needed to make all the locked tests pass. It cannot change the tests. If the code doesn't satisfy a test, the AI must fix the code, not the test.
Once all tests pass, the code must also pass four additional automated checks: a style checker (consistent formatting and no bad patterns), a type checker (variables and functions use the correct data types), a hygiene check (no leftover debugging code, no hardcoded passwords, no temporary files left open), and a coverage check (at least 80% of the code is exercised by the tests).
After all checks pass, the test files are unlocked and their fingerprints are verified — confirming no test was altered during the process. The work is then committed to version control as one atomic, traceable unit.
Who does what: The AI writes the code. Automated tooling verifies correctness, style, types, hygiene, coverage, and test integrity. The human can review the final commit but is not required to — the automated checks are comprehensive.
Analogy: A student sitting a locked exam under invigilated conditions. They cannot see or change the mark scheme. Their work is graded against the pre-set criteria automatically.
Stages 2 and 3 repeat for every task in the plan. The plan is done once upfront for all modules; the build cycle then repeats for every module in the application. When all modules are complete, the process moves to verification of the whole system.
Stages 2 and 3 test each piece in isolation. But pieces that work individually can fail when connected — like components that fit perfectly in a lab but don't assemble correctly on site. Integration testing verifies that the modules communicate correctly across their boundaries: data passes between them in the right format, errors propagate and are handled at each handoff point, and the dependency structure matches the architectural plan.
Who does what: The AI writes integration tests. Automated tooling runs them.
A completely separate evaluation — run by a fresh AI session that has no knowledge of how the code was built. It cannot see the plan, the build history, or any notes from the conversion process. It examines only the finished code and runs 33 formal checks across four levels: Does the code itself meet standards? Do the tests adequately specify behaviour? Do the modules work together? Does the application function as a whole?
The assessment produces a formal report with one of three outcomes: Production Ready (all checks pass), Conditionally Ready (minor gaps with documented remediation), or Not Ready (fundamental issues that must be resolved).
Who does what: An independent AI session with no build context. No human involvement required — the checks are objective and evidence-based.
Analogy: An independent building inspector examining a completed construction. They weren't involved in the build. They have their own checklist. They issue a compliance certificate — or a list of defects.
Where do humans fit in?
The system is designed so humans review where it matters most, and automation handles the rest.
Architecture — does the plan decompose the application correctly? Are the module boundaries sensible? Is the error strategy right?
Test quality — are the tests asking the right questions? Are edge cases covered? Are error conditions specific?
These are expert judgement calls that cannot be fully automated. They happen at Stage 1 and Stage 2.
Test locking — OS-level file permissions prevent tests from being changed.
Cryptographic audit — fingerprints verify no test was altered.
Style, types, hygiene, coverage — four independent automated checks on every commit.
Independent assessment — 33 objective checks with no subjective judgement.
What does the output look like?
At the end of the process, you have:
One large file or a few tangled files. No tests. No type safety. Hardcoded configuration. No error handling. No logging. Works on the developer's machine; may not work anywhere else.
Modular codebase with clean separation of concerns. Comprehensive test suite (424 tests in the latest validation project, 169 in the first). Strict type checking. Structured error handling with a defined exception hierarchy. Structured logging. Externalised configuration. Atomic git history where every commit is traceable to a specific task.
What does the current version cover?
The current version validates both the individual pieces and the assembled whole. It covers all four levels of the VP-model — implementation, design, architecture, and user — with the User level's content determined by the project's scope profile. The pre-clinical baseline (demonstrated on the confluency and tacit knowledge apps) provides full lifecycle coverage for commercial software. Other scope profiles (scientific R&D, clinical trials, regulated medical) tighten thresholds and add domain-specific validation. See the Workflow tab.
Architecture validation
After all modules are built individually, the orchestrator verifies the assembled system:
Test that modules communicate correctly across their boundaries — data passes in the right format, errors are handled at each handoff, and the dependency structure matches the plan. No simulation or faking; real components talking to each other.
When something goes wrong deep inside the system, does the error travel correctly through each layer and arrive at the user as a clear, safe message? Or does it leak technical details, get lost, or produce the wrong error code? This is tested explicitly at every module boundary.
User validation
After integration passes, the orchestrator validates the application from the user's perspective:
Does the fully assembled application actually start? Does it have a health check endpoint that monitoring tools can use to verify it's alive? These sound basic, but components that work individually can fail to assemble — a missing configuration value, a circular dependency, a registration error.
Exercise the primary user journey through the real, fully assembled application — submit input, process it, retrieve the result. No faking, no shortcuts. This is the user-level acceptance test: does the system do what it's supposed to do?
Can an operator diagnose a failure without reading the source code? Does the application log startup events, errors with context, and provide a way to trace a user's error report back to the relevant log entry? Is there a README that explains how to install, configure, and run the application?
Automatically check that the actual code structure matches the planned architecture. If the plan says module A should never depend on module B, verify that no such dependency crept in during construction.
After the full process, the independent assessment (Stage 5) should produce a Production Ready determination against the project's scope profile. For the pre-clinical baseline, all 33 checks should pass. Other scope profiles add domain-specific checks.
Future candidates (not committed)
These are improvements identified from known weaknesses and assessment limitations. Whether they are built depends on what breaks when the orchestrator is applied to more projects.
Currently the same AI writes both the tests and the code (in separate phases). A stronger approach would use a second AI — or a human — to write the tests, providing genuine independence. This is architecturally significant and would require redesigning how the orchestrator coordinates work.
Currently some rules rely on the AI reading and respecting documents ("soft" enforcement). A future candidate would block the AI at the operating system level from writing production code unless the test-writing phase has been completed — removing the last gap where the AI could skip a step.
A technique that makes small deliberate changes to the code and checks whether the tests catch the change. Currently this is only run during the independent assessment. A future version could integrate it into the build process itself.
Vulnerability scanning, static security analysis, and performance benchmarks. These are standard production requirements not currently covered by the orchestrator.
The overall trajectory: the current version validates the parts and the whole (each module well-built, assembled system works correctly). Future versions would harden the process (close remaining enforcement gaps and add deeper quality checks).
What's the catch?
Honest constraints and limitations:
The tests and implementation are authored by the same AI in separate phases. Temporal separation helps, but the AI may share blind spots across both phases. A truly independent test author (second AI or human) would be stronger.
The state machine and coding conventions rely on the AI reading and respecting documents. The file locks and hash audits are "hard" (OS-level, cannot be bypassed), but the process sequencing has been bypassed once during validation. It was caught and corrected, but the risk exists.
The process does not include penetration testing, vulnerability scanning, or performance benchmarks. These would need to be added separately for a production deployment under load.
The orchestrator has been validated against two applications (both FastAPI web services). Generalisability to CLI tools, libraries, and non-FastAPI web apps is unverified.
Worked Example — Tacit Knowledge Capture
This is a real conversion completed using the orchestrator. A rough prototype for extracting unspoken knowledge from meeting recordings was rebuilt into production-quality software — automatically, with human oversight at key decision points.
What the application does
In scientific organisations, valuable knowledge is shared in meetings but rarely written down — the shortcuts people use, the lessons they've learned, the assumptions behind decisions. This application captures that knowledge automatically.
A recording or transcript of a meeting. The user uploads an audio file (MP3, WAV) or a text transcript. If audio, the system transcribes it first using Whisper (a speech-to-text tool that runs on the user's own machine).
A structured list of knowledge items — each classified by type (process, best practice, lesson learned, expertise, assumption), with a confidence score and supporting quotes from the original conversation. An interactive knowledge graph shows how items relate across multiple meetings.
What the prototype looked like
The prototype was a working application — you could upload a transcript and get results back. But it had the typical problems of AI-generated code built for speed rather than quality:
966 lines of code in a single file. Every feature — file upload, transcription, knowledge extraction, graph building, web pages — tangled together. Impossible to test or maintain independently.
Zero automated tests. No way to know if a change broke something without manually clicking through the entire application. No edge case handling — what happens with an empty file? A corrupt upload? A missing AI model?
Configuration values buried in the code. No structured error handling — failures produced raw technical errors instead of helpful messages. No logging to diagnose problems.
2,109 lines of Python across 6 files. Functional but fragile.
How the conversion worked
The orchestrator guided the rebuild through seven stages over several working sessions. At each stage, the AI did the heavy lifting under strict constraints, and a human engineer reviewed the critical decisions.
What changed
2,109 lines of Python across 6 files. One file alone was 966 lines — upload handling, AI calls, graph building, web pages, error handling all in one place. Zero tests. Configuration values scattered through the code. Errors produced raw stack traces. No logging.
16 well-structured source files across 6 independent modules. 424 automated tests covering normal operation, edge cases, and error conditions. Strict type checking. Centralised configuration. Structured error handling — every error produces a clear, safe message. Structured logging for diagnosis. Every commit traceable to a specific task.
The monolith problem — and how it was solved
The biggest challenge was the 966-line file that contained everything. The orchestrator's plan stage decomposed it into four focused files — one for file uploads, one for knowledge extraction operations, one for graph operations, and one for web pages. Each could then be tested independently, and the integration stage confirmed they all worked together correctly.
This is a common pattern in AI-generated code: the prototype works but everything is wired together in ways that make it impossible to test, maintain, or extend. The orchestrator's structured decomposition addresses this directly.
What the human did vs. what the AI did
Reviewed the architecture plan — confirmed the 6-module structure made sense for this application's domain.
Reviewed every test set — checked that edge cases were meaningful (empty files, missing AI service, corrupt uploads) and error conditions were specific.
Approved integration tests — confirmed they used real dependencies (no faking) and covered all module boundaries.
Answered two domain questions — which failure modes span multiple stages, and what correctness standard to test against.
Analysed all 2,109 lines of prototype code and proposed the module decomposition.
Wrote 424 tests before writing any production code — each set locked and cryptographically verified.
Implemented all 6 modules under constraint — every implementation passed 5 independent automated checks before being accepted.
Wrote 40 integration tests and verified the assembled system works end-to-end.
Key finding: zero context loss
The first project converted using this orchestrator (a smaller, 1,045-line application) experienced one "compaction event" — a point where the AI's working memory filled up and had to be summarised, risking loss of important context. For this larger project (2,109 lines), the orchestrator's V0.5 optimisations — particularly starting a fresh AI session at each module boundary — eliminated compaction entirely. The AI maintained full context throughout the entire conversion.
Independent assessment
After the conversion, a completely separate AI session — with no knowledge of how the code was built — examined the finished codebase against 30 formal quality criteria. It checked everything from test coverage to error handling to documentation quality.
29 of 30 checks passed outright. One partial pass (observability — missing request tracking IDs, a minor gap). All four quality layers passed: code standards, test quality, module architecture, and user experience.
99% test coverage — nearly every line of code is exercised by at least one test. 71.7% mutation kill rate — when small deliberate errors are introduced into the code, the tests catch them 72% of the time (threshold: 60%). 424 tests pass in random order across 3 runs — no test depends on another.
For comparison, the first project needed three assessment rounds before passing. The improvement reflects tighter quality gates in V0.5 — the orchestrator now produces code that meets the independent assessor's standards on the first attempt.
The VP-Model: V-Model with Prototyping
The VP-model extends the V-model by inserting a working prototype at each abstraction level. Where the V-model defines development and validation branches, the VP-model adds a feedback mechanism: a prototype exists at each level before that level is fully built, enabling defects to be caught at their level of origin rather than discovered later at a lower level. This pattern was established in systems engineering practice (Burst et al., 1998; Forsberg & Mooz, 1991; German Federal Ministry of Defence V-Modell, 1997; IEEE 1012-2016) and is applied here to AI-constrained software development.
In an agile context, the VP-model is applied iteratively: each increment passes through the same levels, with prototypes providing feedback before each level is committed. The VP-model does not prescribe sequence — it prescribes completeness: every development decision at every abstraction level must have both a corresponding validation activity and a prototype that makes the decision executable before full implementation.
Key References
| Reference | Contribution |
|---|---|
| Forsberg & Mooz (1991) | "The Relationship of System Engineering to the Project Cycle." Established the dual-branch decomposition/integration structure. Introduced the principle that validation artefacts are defined alongside (not after) development artefacts. |
| Burst et al. (1998) | "On Code Generation for Rapid Prototyping Using CDIF." Formalised the VP-model with three prototype insertion points along the V-model's development branch (concept, architecture, implementation levels). Established that prototypes validate at their abstraction level before that level is fully built — the distinguishing principle of VP over V. |
| German V-Modell (1997) | Formalised the V-model as a mandatory process standard for German Federal government IT projects. Demonstrated the V-model could be tailored to different project types and domains. |
| IEEE 1012-2016 | Standard for System, Software, and Hardware Verification and Validation. Defines V&V activities at each lifecycle phase. Establishes that verification (are we building it right?) and validation (are we building the right thing?) are distinct, concurrent activities. |
| Boehm (1979) | "Guidelines for Verifying and Validating Software Requirements and Design Specifications." Established the empirical finding that defects introduced at higher abstraction levels are exponentially more expensive to detect and fix at lower levels — the foundational economic argument for early-level prototyping. |
Three Governing Principles
The development branch (decomposition from requirements to implementation) is distinct from the verification branch (internal correctness) and the validation branch (does the built system satisfy the level above it).
Verification and validation are not afterthoughts. They are defined at each level before or alongside the development activity at that level.
Each level addresses a different scope of concern with distinct artefacts, test types, and failure modes. A missing requirement is not detectable by a unit test. An incorrect module interface is not detectable by testing either module in isolation.
Defects introduced at a higher abstraction level are more expensive to fix at a lower level (Boehm, 1979).
Before each level is fully built, a working prototype exists that validates the decisions at that level. The prototype is executable — it can be run, type-checked, or tested — and feeds defects back to the level above before implementation commits to them. The three prototype types in this workflow:
- System prototype (requirements) — a mock API returning hardcoded responses, built before the architecture plan. Validates that the response schema is fit for purpose and acceptance criteria are testable before any module design begins. The requirements stage is where POC-level schema flaws are caught at zero cost.
- Architecture prototype (skeleton) — type-annotated stubs for all modules. Validates interface contracts are correct and consistent before implementation begins. mypy strict must pass on the skeleton.
- Design prototype (test suite) — failing tests at the red phase. Validates module behaviour specifications before implementation is written. The red phase IS the design prototype.
Abstraction Layers and Artefacts
| Level | Prototype | Development Artefact | Validation Artefact |
|---|---|---|---|
| User | System prototype — mock API + acceptance criteria (requirements stage, pre-plan) | docs/0-requirements.md: acceptance criteria, agreed response schema | Acceptance tests, E2E workflows against AC-xxx criteria, POC parity |
| Architecture | Skeleton — type-checked stubs, all modules importable | Module boundaries, interface contracts, DAG | Integration tests, error propagation, DAG verification |
| Design | Test suite — failing tests specifying behaviour | Task specs: schemas, edge cases, error paths | Unit tests (locked before implementation) |
| Implementation | — | Source files (green phase, replacing stubs) | pytest, ruff, mypy strict, hygiene, coverage |
VP-Model Coverage
The orchestrator covers all four VP-model levels with both development and validation branches active. V0.5 adds token optimisation quick wins (QW-1 through QW-8) on top of V0.4's full VP-model lifecycle, validated on two projects. All four levels have active prototypes on the left branch and validation artefacts on the right.
| Level | Prototype | Dev Artefact | Validation Artefact | Status |
|---|---|---|---|---|
| User | System prototype: mock API + acceptance criteria (V0.4 ✓) | docs/0-requirements.md | E2E against acceptance criteria, POC parity | FULL ✓ |
| Architecture | Skeleton (V0.4 ✓) | Boundaries, contracts, DAG | Integration + system tests | FULL ✓ |
| Design | Red phase test suite | Task specs, edge cases, errors | Locked unit tests, quality-gated | FULL ✓ |
| Implementation | — | Green phase source | pytest + ruff + mypy + hygiene + coverage | FULL ✓ |
How It Is Applied Iteratively
The requirements stage runs first: acceptance criteria are defined and a mock API prototype validates the response schema before any architecture work. The plan is then done once upfront for all modules. The skeleton immediately follows, converting all interface contracts into type-checked stubs before any task implementation begins. Each module then passes through the build cycle — red (design prototype), green (implementation) — with verification at each step. After all modules are built, the composed system is validated at the architecture and user levels.
Requirements (once, pre-plan): define acceptance criteria, build mock API prototype, validate response schema with stakeholders. Fixes POC-level schema flaws before any architecture investment.
Plan (once, all modules): boundaries, exception hierarchy, validation strategy, dependency direction, interface contracts as text.
Skeleton (once): convert contracts to executable stubs. mypy strict + ruff + circular import check. Defects here cost nothing compared to finding them at integrate.
For each module, for each task: Red phase writes specification-derived failing tests (design prototype) → locked → Green phase implements (replacing stubs) → pytest/ruff/mypy/hygiene/coverage verify → hash audit → commit. Repeat for all tasks.
Integrate: integration tests exercise real dependencies across module boundaries, error propagation verified, DAG checked against declared architecture. Validate: scope-configured user-level checks — app startup, health, E2E workflow, error response quality, POC parity (scope-dependent), observability, documentation.
The orchestrator is a state machine. Every action — writing tests, implementing code, running integration checks, validating the application — is a transition between defined states. The state is persisted in conversion_state.yaml, so the process survives context window compaction, session restarts, and agent crashes. The status skill reads this file and determines what is permitted next: you cannot implement without first specifying tests, you cannot integrate without first building all modules, and you cannot validate without first passing integration.
Task Lifecycle
Each task within a module follows a strict red→green cycle. The agent writes failing test specifications (red), then implements code to pass them (green). No shortcuts: the filesystem lock and SHA-256 hash audit enforce temporal separation between specification and implementation.
Full Lifecycle
The complete conversion progresses through seven stages. Requirements defines acceptance criteria and validates the response schema via a mock API prototype — catching user-level flaws before any architecture work. Setup scaffolds the project. Plan decomposes into modules. Skeleton converts contracts to executable stubs. Build executes the red→green cycle per task. Integrate validates cross-module composition. Validate confirms the application satisfies the acceptance criteria end-to-end. Each stage gates the next.
Requirements Phase
The system-level prototype. Before the architecture plan is written, the agent analyses the POC, identifies the primary user workflow and current response schemas, and produces docs/0-requirements.md — acceptance criteria in testable form and an agreed response schema for each primary endpoint. A minimal mock API (hardcoded responses) is then built and run to verify the HTTP interface shape before any domain code exists. This is where POC-level schema flaws are caught: if an endpoint returns metadata instead of results, that is visible here at zero cost.
Precondition: setup complete. Output: docs/0-requirements.md committed. Human approves acceptance criteria and response schema before plan begins.
Skeleton Phase
The architecture-level prototype. The agent reads all interface contracts from docs/2-plan.md and generates complete stubs for every exported symbol: real type-annotated signatures, docstrings, __all__ declarations, and raise NotImplementedError bodies. The skeleton must pass mypy --strict, ruff, and circular import checks before build begins. Type errors in stubs are interface contract defects caught before any implementation is written.
Precondition: plan complete. Includes: all stubs generated, mypy strict clean, ruff clean, all modules import without circular errors. Human reviews interface contracts as executable artefacts.
Integrate Phase
Architecture-level validation. The agent reads the entire codebase, derives module boundaries and interface contracts, asks the user targeted multiple-choice questions about domain-specific failure modes, then writes integration tests that exercise real (non-mocked) cross-module interactions. A human reviews the tests before execution. After tests pass, automated checks verify the dependency DAG and interface contracts.
Precondition: all modules complete. Blocked by status skill. Includes: integration tests, error propagation, degradation, DAG check, interface contracts.
Validate Phase
User-level validation. The agent exercises the assembled application as a whole: can it start, does the health endpoint respond, does the core workflow complete end-to-end, do invalid inputs produce structured error responses, does documentation exist, and is operational observability in place. These checks catch wiring defects that are invisible when modules are tested in isolation.
Precondition: integrate_phase: complete. Sub-checks: startup, health, E2E, error responses, documentation, observability, POC parity.
Red Phase
The specification phase. The agent reads the design plan for the current task and writes test cases that define expected behaviour — including boundary conditions, error paths, and resource cleanup under failure. All new tests must fail (no implementation exists yet) while all existing tests continue to pass. The agent reports test quality counts so the human can assess coverage of edge cases before proceeding. SHA-256 hashes of all test files are captured and stored.
pytest --collect-only exit 0Green Phase
The implementation phase. Test files are locked at the OS level (chmod a-w) — the agent physically cannot modify them. The agent writes minimal code to pass all tests, then runs five verification checks: functional correctness, static analysis, strict type checking, code hygiene, and coverage. After verification, tests are unlocked and every file is re-hashed. If any hash differs from the snapshot taken during the red phase, the entire green phase fails. This is the core enforcement: the specification cannot be retroactively weakened to match the implementation.
chmod -R a-w tests/--cov-fail-under=80Bug Fix Paths
Not all failures require a full state reset. Implementation bugs discovered during the green phase are fixed by the agent iterating on the source code — the tests are correct, the code just doesn't satisfy them yet. Test bugs are different: the specification itself was wrong. This requires resetting state, fixing the test, re-running the red phase to generate new SHA-256 hashes, and re-entering the green phase. The distinction matters because test bug fixes invalidate the hash snapshot, so the integrity audit must restart from scratch.
Agent iterates autonomously. No state reset.
Reset state. Fix test. Re-run /project:red for new hashes. Re-enter green.
Currently undefined. Workaround: new task through standard cycle. Candidate /project:patch command.
V0.5 Operational Lessons
Findings from the tacit-knowledge validation run (2,109 LOC, 6 modules, 23 tasks). These supplement the V0.4 enforcement findings above.
Pure Pydantic data models are fully implemented in the skeleton stage — other modules' stubs import these types, so the skeleton necessarily includes complete model definitions. The correct approach: write tests that lock behavior, mark module complete without a red→green cycle. Not a TDD failure — a recognized edge case for declarative definitions.
Three spec conflicts surfaced between the plan and the skeleton (ID format, default status value, type constraint). The skeleton preserved POC behavior; the plan specified different behavior. Resolution: align tests with the skeleton. The skeleton is the ground truth for model contracts.
/project:green --auto-advance is not recognized by Claude Code's skill dispatch. Workaround: run /project:green manually, then commit and run /project:red separately. Known limitation.
Bare python3 -m pytest fails because dependencies aren't on the system path. Both .githooks/pre-commit and .claude/hooks/pre_commit_test.sh need uv run pytest. Fixed in V0.5 orchestrator template.
Zero compaction events across 6 modules / 23 tasks (vs 1 in V0.4 confluency). Starting a new Claude Code session per module is the single most effective context management technique.
The POC's monolithic main.py split into 4 sub-router files (upload, extraction, graph, pages) plus an app factory. Integration tests confirmed the wiring. Validates the orchestrator for monolith-to-module conversions at scale.
Design Decision: Lifecycle Extension
The lifecycle includes the requirements stage (system prototype, pre-plan), the skeleton stage (architecture prototype, pre-build), and two post-build stages. Integration tests exercise cross-module boundaries and cannot run until both sides of each boundary exist.
Stage 1: Integrate (/project:integrate)
Architecture-level validation. Writes and executes integration tests, verifies dependency DAG, checks interface contracts. Precondition: all modules have status complete.
Integration Test Authoring — 4-Step Process
Question A — Cross-stage failure modes: Agent identifies processing stages, proposes plausible failure modes spanning stages. User selects real risks.
Question B — Correctness criteria: Agent proposes concrete acceptance thresholds (exact structural, coarse ±15%, moderate ±5%, strict ±1%). User selects.
Selections recorded in
docs/integration-context.md for traceability.tests/integration/, tagged @pytest.mark.integration.
Human reviews before execution.
Error Propagation Verification
Per module boundary: at least one test that triggers an error in the lower-level module, verifies exception type at the crossing, verifies HTTP status code translation at API layer, verifies no exception is silently swallowed. Agent traces at least two error paths domain → service → API and documents the translation chain.
Graceful Degradation Testing
If batch/bulk operations exist: at least one integration test with mixed valid and invalid items, asserting per-item status reporting (or documenting all-or-nothing strategy). If no batch operations: recorded as N/A.
Dependency Direction Verification
After integration tests pass: extract all inter-module import statements, construct observed dependency graph, compare against declared DAG in docs/2-plan.md. Any upward dependency or circular import fails the integrate stage. This is an automated check, not a test.
Interface Contract Verification
After integration tests pass: every __init__.py declares __all__, every symbol in __all__ is importable without error, every exported symbol has a non-empty docstring, all packages import without circular dependency errors. Automated check.
Integrate State Machine
State tracked in conversion_state.yaml: integrate_phase, integration_tests_written, dependency_dag_verified, interface_contracts_verified. Status skill blocks /project:validate unless integrate_phase: complete.
Stage 2: Validate (/project:validate)
User-level validation. Verifies the assembled application works as a whole. Precondition: integrate_phase: complete.
Factory function runs without exception. Returns valid ASGI/WSGI instance. Completes in <10s. If startup fails, nothing else can be tested.
Check common paths (/health, /healthz, /ready). Verify HTTP 200 with valid JSON. Decision: required during build stage via plan template, not created during validate.
Primary user workflow via TestClient without mocks. Submit input → process → retrieve result. Tests in tests/e2e/, tagged @pytest.mark.e2e. Must complete without unhandled exceptions.
Send invalid requests to each endpoint: missing fields, wrong types, nonexistent IDs, malformed params. Verify structured JSON, 4xx (not 500), no stack traces or internal paths exposed.
README.md with description, installation, usage. API docs endpoint (/docs) returns HTTP 200. Decision: required during build via plan template.
Startup produces log output. Errors produce log output with context. Error responses include correlation mechanism (request ID, timestamp). Version identifier accessible. Verifies structured logging works end-to-end.
Validate State Machine
State tracked: validate_phase, plus individual booleans for each sub-check (startup, health, e2e, errors, docs, observability, poc_parity). Status skill reports overall production readiness after validate completes.
Plan Template Additions
| Addition | Rationale |
|---|---|
| Interface contracts per boundary | Exported signatures, exception types, data types exchanged. Architecture-level development artefact validated during integrate. |
| Health endpoint task (app module) | Goes through normal red/green cycle. Validate stage only verifies it works. |
| README task (final module) | Description, installation, usage, API reference. Validate stage only verifies it exists. |
Design Decisions
They verify existing behaviour (all modules already built and unit-tested), not specify new behaviour. Human reviews before execution, but no hash-lock or filesystem enforcement. Consistency argument was considered but rejected.
If created during validate, they bypass red/green cycle. Added to plan template as mandatory tasks. Validate stage only verifies they exist and work.
Assumes FastAPI / HTTP endpoints / TestClient. Non-web POCs (CLI, libraries, data pipelines) would need different user-level checks. Parameterised validate profiles are out of scope.
Too slow for per-task enforcement (minutes per module). More valuable as a post-build quality signal. 60% threshold is pragmatic, not rigorous.
Scope Profiles — Adapting the Workflow
The orchestrator's current configuration targets pre-clinical commercial software: production-quality code without regulatory certification. Every other scope profile is described as a delta from this baseline — what to add, what to tighten, what new stages are required.
Each profile gives step-by-step instructions for modifying the orchestrator files. You are creating a new V0.5 workflow instance for each scope — not modifying the baseline. Copy the orchestrator, apply the delta, and use the modified copy for that project.
These profiles describe what should change per scope. The orchestrator does not currently enforce scope-specific behaviour automatically — you are manually configuring it. Some profiles require stages (security scanning, formal UAT, traceability) that do not yet exist as orchestrator commands. Where this is the case, it is flagged as a manual step you must perform outside the orchestrator.
How Scope Profiles Map to VP-Model Layers
The orchestrator enforces the VP-model across four abstraction layers (see the VP-Model section above). Scope profiles parameterise all four layers, not just the User level — though the User level changes most dramatically. The table below shows which layers each profile modifies and the nature of the change.
| VP-Model Layer | Internal | Pre-clinical | Scientific R&D | Clinical Trials | Regulated Medical | Safety-Critical |
|---|---|---|---|---|---|---|
| User | Startup + health only | E2E, error quality, docs, observability | + reference comparison, reproducibility docs | + data export round-trip, e-signature flow | + formal UAT, traceability matrix, SRS/SDD | + witnessed UAT, safety case, certification |
| Architecture | Optional integration | Integration tests, DAG, error propagation | + numerical pipeline regression, provenance | + audit trail verification, edit checks | + hazard-scenario tests, risk control verification | + formal proof, MC/DC, independent verification |
| Design | Relaxed thresholds | 80% coverage, ≥3 boundary, quality gates | + determinism tests, regression fixtures | + domain edit-check tests, soft-delete tests | 90–100% coverage, mutation (Class C), sign-off | 100% MC/DC, formal specification |
| Implementation | TODOs allowed if tracked | Zero hygiene issues, structured logging | + Python version pinned, seed management | + audit log emission, no hard-delete | + security scanning, reviewer in commit | Replace toolchain (MISRA, Polyspace) |
The User level is the primary configuration surface — it determines what "production ready" means for each project. But the lower layers must tighten in step: there is no value in formal UAT (User) if the coverage threshold is 60% (Design) or integration tests are skipped (Architecture). Each profile is internally consistent across all four layers.
The four VP-model layers and the seven orchestrator stages (setup → requirements → plan → skeleton → build → integrate → validate) are fixed structure. The requirements stage content (what acceptance criteria to define, what schema decisions to validate) varies per project scope. What changes per project is the content within each layer: what the plan template requires, what thresholds the build enforces, what the integrate stage tests for, and what "validate" means. The pre-clinical baseline is one configuration — demonstrated on the confluency app. Every new development project can define its own scope profile, parameterising the same workflow to meet its specific regulatory, scientific, or commercial requirements.
Profile 0: Pre-Clinical Commercial BASELINE
This is what exists today. All other profiles reference it. No changes needed — use the orchestrator as-is.
| Dimension | Baseline Setting |
|---|---|
| Coverage threshold | ≥ 80% overall, ≥ 60% per file |
| Mutation testing | Assessment-only, ≥ 60% kill rate (sampled) |
| Boundary tests | ≥ 3 per module |
| Integration tests | ≥ 1 per module boundary, no mocks |
| Security | Path traversal in unit tests. No independent audit. |
| Traceability | None. POC is implicit requirement. |
| Acceptance testing | Lightweight E2E via TestClient. No formal UAT. |
| Documentation | README + auto-generated API docs |
| Observability | Structured logging. Correlation mechanism optional. |
| Assessment checks | 33 checks, 4 layers. PRODUCTION READY target. |
Profile 1: Clinical / Regulated Medical
When to use: Software that will be submitted to or reviewed by a regulatory body (FDA, EMA, MHRA, PMDA). Includes medical device software (IEC 62304), clinical trial data systems (21 CFR Part 11), GxP laboratory software, and diagnostic tools intended for clinical decision-making.
VP-Model layers modified: User ●●● — formal UAT, traceability, SRS/SDD · Architecture ●● — hazard-scenario integration tests · Design ●●● — 90–100% coverage, mutation build-time · Implementation ●● — security scanning, audit log, reviewer sign-off
The orchestrator was not designed for regulated medical software. These modifications bring it closer, but do not constitute a validated software development lifecycle. You still need a quality management system (QMS), risk management per ISO 14971, and regulatory expertise. This profile adds engineering rigour — it does not replace regulatory process.
Applicable standards: IEC 62304 (medical device software lifecycle), IEC 62366 (usability engineering), ISO 14971 (risk management), 21 CFR Part 11 (electronic records), EU MDR 2017/745.
Step-by-step modifications
1. Plan template — add requirements traceability
In docs/2-plan.md, add a Requirements section before module decomposition. Each requirement gets a unique ID (e.g., REQ-001), acceptance criteria, risk classification (IEC 62304 Class A/B/C), and traceability forward to the integration/E2E test that validates it. The plan template should enforce: every requirement has at least one acceptance test ID. Every acceptance test ID traces back to at least one requirement.
2. Plan template — add risk classification per module
IEC 62304 classifies software by safety class (A = no injury, B = non-serious injury, C = death/serious injury). Each module in the plan must declare its class. Class C modules require: 100% statement coverage, mandatory mutation testing during build (not assessment-only), and formal code review sign-off.
3. Build stage — tighten thresholds
Modify .claude/skills/green/SKILL.md:
| Dimension | Pre-clinical | Clinical |
|---|---|---|
| Coverage (overall) | ≥ 80% | ≥ 90% (Class B/C modules: 100% statement) |
| Coverage (per file) | ≥ 60% | ≥ 80% |
| Mutation testing | Assessment-only | Build-time for Class C modules, ≥ 80% kill rate |
| Boundary tests | ≥ 3 per module | ≥ 5 per module, covering all identified risks |
| Error path tests | One per public function that raises | One per public function that raises + one per risk control |
4. Build stage — add human review gate
After each green phase, add a mandatory hold: the agent outputs its diff and waits for explicit human approval before committing. Modify the green skill to require the user to type APPROVED before the commit step. Record the reviewer identity and timestamp in the commit message.
5. Integrate stage — add system-level hazard tests
Modify .claude/skills/integrate/SKILL.md. In addition to the standard integration tests, require hazard-scenario tests derived from the ISO 14971 risk analysis. Each identified hazard must have a corresponding integration test that demonstrates the risk control is effective. These are separate from functional integration tests — they test safety, not features.
6. Validate stage — formalise acceptance testing
Modify .claude/skills/validate/SKILL.md. Replace lightweight E2E with formal UAT:
| Dimension | Pre-clinical | Clinical |
|---|---|---|
| Acceptance tests | Agent-authored E2E via TestClient | Derived from REQ-IDs, traceable, human-witnessed |
| Error responses | No stack traces, structured JSON | + no internal identifiers (DB IDs, file paths, UUIDs) |
| Observability | Correlation optional | Mandatory request ID, audit log for all state-changing operations |
| Documentation | README + /docs | + Software Requirements Specification, Software Design Description, Test Report |
7. New stage — add /project:audit (manual)
After validate, produce a traceability matrix: Requirement → Design Task → Unit Test(s) → Integration Test → Acceptance Test. This does not exist as an orchestrator command. Create it manually or write a script that parses docs/2-plan.md, test file names, and pytest markers to generate the mapping. Save to docs/traceability-matrix.md.
8. New stage — add security scanning (manual)
Run pip-audit for dependency vulnerabilities. Run bandit for static security analysis. Record results in docs/security-scan-report.md. Neither is orchestrated — run manually after validate.
9. Assessment — extend checks
Add to production-readiness-assessment-spec.md:
| New Check | Layer | Pass Condition |
|---|---|---|
| Requirements traceability | User | Every REQ-ID has ≥ 1 acceptance test. No orphan tests. |
| Risk control verification | Architecture | Every identified hazard has a corresponding integration test. |
| Audit log presence | Implementation | State-changing operations produce audit log entries. |
| Security scan | Implementation | Zero critical/high vulnerabilities in dependencies. |
| Reviewer sign-off | Design | Every commit message includes reviewer identity. |
Usability engineering (IEC 62366). Clinical evaluation. Post-market surveillance. Labelling. Manufacturing records. These are QMS-level concerns outside the scope of a code quality orchestrator. You need a regulatory affairs specialist and a QMS — this tool produces better-quality software artefacts for that QMS to reference.
Profile 2: Scientific R&D / Computational Tools
When to use: Research software that produces results cited in publications, grant deliverables, or internal scientific decisions. Includes image analysis pipelines, statistical analysis tools, simulation code, bioinformatics workflows, and computational notebooks converted to production tools. The priority is reproducibility and numerical correctness, not regulatory compliance.
VP-Model layers modified: User ●● — reference comparison, reproducibility docs, methodology · Architecture ●● — numerical pipeline regression, provenance tracking · Design ●● — determinism tests, regression fixtures with provenance · Implementation ● — Python version pinned, seed management
Relevant frameworks: FAIR principles (Findable, Accessible, Interoperable, Reusable), Software Sustainability Institute guidelines, NIH/UKRI software sharing policies, journal reproducibility requirements.
Step-by-step modifications
1. Plan template — add reference dataset specification
In docs/2-plan.md, add a Validation Data section. For each module that produces numerical output, specify: a reference input (real or synthetic), expected output (with provenance — how was this "expected" value determined?), and acceptable tolerance. The tolerance must be justified (floating point accumulation, stochastic algorithm, approximation method) — not arbitrary.
2. Plan template — add reproducibility contract
Document: given identical input + identical configuration + identical dependencies (lock file), does the output match exactly or within tolerance? If stochastic, document the seed strategy. If hardware-dependent (GPU, SIMD), document known sources of non-determinism.
3. Build stage — add numerical regression tests
Modify .claude/skills/red/SKILL.md. For every module that produces a numerical result, the red phase must include at least one regression test: a known input → expected output pair that is checked on every build. Store reference values in tests/fixtures/ with provenance metadata (source, date, method). The green phase must not change reference values without human approval.
4. Build stage — add determinism tests
Modify the red skill. For each stochastic operation, require a test that: runs the operation twice with the same seed, asserts identical output. For non-stochastic operations: run twice, assert bitwise equality. This catches hidden state or non-deterministic dependencies.
5. Integrate stage — add end-to-end numerical validation
Modify .claude/skills/integrate/SKILL.md. Integration tests must include a full-pipeline regression: known input through the entire processing chain, output compared to a reference with documented tolerance. This catches accumulation of small numerical errors across module boundaries that are invisible at the unit test level.
6. Integrate stage — add data provenance tracking
If the application processes input data and produces output, the integration test must verify that the output includes or is accompanied by provenance metadata: what input was used, what version of the code processed it, what parameters were applied, and a timestamp. This is not a software quality concern — it is a scientific reproducibility concern.
7. Validate stage — add comparison against external reference
Modify .claude/skills/validate/SKILL.md. If an external reference implementation or published benchmark exists, the validate stage must include a comparison. Document: what the reference is, how the comparison was performed, what level of agreement was achieved, and any known reasons for disagreement. Save to docs/validation-against-reference.md.
8. Assessment — extend checks
| New Check | Layer | Pass Condition |
|---|---|---|
| Numerical regression tests | Design | Every numerical-output module has ≥ 1 regression test with documented reference. |
| Determinism verification | Design | Stochastic operations produce identical output with same seed. |
| Dependency reproducibility | Implementation | Lock file present AND Python version pinned. |
| Provenance metadata | Architecture | Output includes code version, input hash, parameters, timestamp. |
| External validation | User | Documented comparison against reference implementation or benchmark, if one exists. |
9. Documentation — extend README
Add sections to the README task: Methodology (what algorithms, what assumptions), Validation (how numerical accuracy was verified), Reproducing Results (exact commands to regenerate published outputs from raw input), Known Limitations (where the tool is known to be inaccurate or unreliable).
Formal verification of algorithms. Statistical validation of entire analysis pipelines (e.g., type I/II error rates). Peer review of the scientific methodology. Performance benchmarking under production data volumes. These require domain expertise and are outside a code quality orchestrator's scope.
Profile 3: Clinical Trials Data Systems
When to use: Software that captures, stores, processes, or reports clinical trial data. Includes electronic data capture (EDC) tools, randomisation systems, adverse event reporting, CDISC data transformation pipelines, and analysis tools that feed into regulatory submissions. The regulatory concern is data integrity — 21 CFR Part 11, EU Annex 11, ICH E6(R2) GCP.
VP-Model layers modified: User ●●● — data export round-trip, e-signature flow, ALCOA+ mapping · Architecture ●● — audit trail verification, e-signature integration · Design ●● — domain edit-check tests, soft-delete verification · Implementation ●● — audit log emission, no hard-delete pattern
Key difference from Profile 1: Profile 1 (regulated medical) is about the software being a medical device. This profile is about the software handling clinical trial data — different regulatory requirements, different risk profile.
Step-by-step modifications
1. Plan template — add data integrity requirements
In docs/2-plan.md, add an ALCOA+ Compliance section. For every module that writes, modifies, or deletes data, document how the ALCOA+ principles are maintained: Attributable (who changed it), Legible (can be read/interpreted), Contemporaneous (recorded at the time), Original (preserved original or certified copy), Accurate (correct and complete). Plus: Complete, Consistent, Enduring, Available.
2. Build stage — add audit trail tests
Modify the red skill. For every state-changing operation (create, update, delete), require a test that verifies an audit trail entry is produced containing: user identity, timestamp, what changed, old value, new value. Deletion operations must verify soft-delete (record marked as deleted, not physically removed) unless hard-delete is explicitly justified.
3. Build stage — add data validation tests
For any module that accepts clinical data input: require edit-check tests. These verify that out-of-range values, logically inconsistent entries (e.g., death date before birth date), and format violations are caught at entry, not downstream. This is distinct from general input validation — it is domain-specific clinical data validation.
4. Integrate stage — add electronic signature verification
If the application supports electronic signatures (21 CFR Part 11 requirement): integration tests must verify the signature workflow end-to-end — authenticate, present data for review, capture signature, lock the record, verify the locked record cannot be modified without a new signature event.
5. Validate stage — add data export verification
If the application exports data (CDISC, CSV, SAS transport): the validate stage must include a round-trip test — export data, re-import it, verify equality. For CDISC formats: validate against the published CDISC validation rules (OpenCDISC/Pinnacle 21).
6. Assessment — extend checks
| New Check | Layer | Pass Condition |
|---|---|---|
| Audit trail completeness | Architecture | Every state-changing operation produces an audit record. |
| Soft delete verification | Design | Delete operations preserve original record. |
| Edit checks | Design | Domain-specific validation rules tested at entry point. |
| Electronic signature flow | User | Sign → lock → modify-attempt-rejected workflow tested E2E. |
| Export round-trip | User | Export + re-import produces identical data set. |
System validation protocols (IQ/OQ/PQ). Computerised system validation master plans. Role-based access control design. Data archival and retention policies. 21 CFR Part 11 compliance assessment. These are IT infrastructure and GCP concerns beyond code quality.
Profile 4: Safety-Critical Systems
When to use: Software where failure can cause physical harm, environmental damage, or loss of life. Includes automotive control systems (ISO 26262), industrial automation (IEC 61508), aerospace (DO-178C), and nuclear (IEC 60880). These domains have formal certification requirements that are fundamentally different from medical device or clinical data regulations.
VP-Model layers modified: User ●●● — witnessed UAT, safety case, certification body submission · Architecture ●●● — formal proof, independent verification team · Design ●●● — 100% MC/DC, formal specification, independent test author · Implementation ●●● — replace toolchain (MISRA, Polyspace, Astrée)
This orchestrator was designed for Python web applications. Safety-critical systems are typically written in C, C++, Ada, or Rust with specialised compilers, static analysers (Polyspace, LDRA, Astrée), and formal methods tools. The orchestrator's enforcement mechanisms (ruff, mypy, pytest) are Python-specific. Applying this profile requires replacing the toolchain, not just adjusting thresholds. This profile describes what to add — implementing it requires significant engineering beyond configuration changes.
Conceptual modifications (not directly configurable)
1. Replace static analysis toolchain. Ruff and mypy are insufficient for safety-critical code. Replace with: MISRA C/C++ checker (for C/C++), Polyspace or Astrée (for formal absence of runtime errors), or equivalent for the target language. The green skill would need to invoke these instead of ruff/mypy.
2. Coverage: MC/DC required. Line coverage and branch coverage are insufficient. Modified Condition/Decision Coverage (MC/DC) is required by DO-178C Level A and ISO 26262 ASIL D. This requires specialised coverage tools (e.g., VectorCAST, LDRA TBrun). The 80% line coverage threshold is replaced by 100% MC/DC for the highest safety integrity levels.
3. Formal requirements with bidirectional traceability. Every requirement traces forward to design, code, and test. Every test traces backward to requirement. Every line of code traces to a design element. Orphan code (code with no traceability) must be justified or removed. This is Profile 1's traceability, but enforced bidirectionally and at a finer granularity.
4. Independence between test author and implementer. The orchestrator's known weakness (same agent writes tests and code) becomes a certification blocker. Safety-critical standards require independence between verification and development. At minimum, a different human must review and approve all test cases. At higher integrity levels, a different team must author them.
5. Formal methods for critical modules. For the highest integrity levels (ASIL D, SIL 4, DAL A), formal proof of correctness may be required for critical algorithms. This is outside the orchestrator's capabilities entirely — it requires tools like SPARK/Ada, Frama-C, or TLA+.
6. Configuration management with baseline control. Every artefact (source, test, document, tool configuration) must be under configuration management with formal baselines. The orchestrator's git-based approach is a starting point, but needs formal baseline tagging, change control records, and impact analysis for each change.
The orchestrator provides value for safety-critical projects at the process discipline level: enforced test-first, hash-audited integrity, atomic commits. But it cannot replace the specialised tooling, formal methods, and independent verification required by IEC 61508 / ISO 26262 / DO-178C. Treat this as a development discipline supplement, not a certification pathway.
Profile 5: Internal Tooling / Early-Stage Prototypes
When to use: Internal tools, dashboards, data pipelines, or prototypes where the audience is your own team and the cost of failure is rework, not harm. You want code quality discipline without the overhead of full production hardening. The priority is speed with guardrails.
VP-Model layers modified: User ▽ — startup + health only, skip E2E/error quality/observability · Architecture ▽ — integration optional for small projects · Design ▽ — 60% coverage, ≥1 boundary test, TODOs allowed · Implementation ○ — unchanged (hard constraints cost nothing)
Step-by-step modifications
1. Build stage — relax thresholds
| Dimension | Pre-clinical | Internal |
|---|---|---|
| Coverage (overall) | ≥ 80% | ≥ 60% |
| Coverage (per file) | ≥ 60% | No per-file minimum |
| Boundary tests | ≥ 3 per module | ≥ 1 per module |
| Error path tests | One per public function | One per module (most critical path only) |
| Code hygiene | Zero TODOs | TODOs allowed if tracked (issue reference) |
2. Integrate stage — optional
For small projects (1–2 modules), skip the integrate stage entirely. The unit tests provide sufficient coverage. For larger projects (3+ modules), keep integration tests but reduce to one test for the primary workflow path only — skip error propagation and graceful degradation checks.
3. Validate stage — simplify
Keep: application startup verification and health endpoint check. Drop: E2E workflow test (covered by integration), error response quality (accept framework defaults), observability checks (add logging when you need it). Documentation: README with installation and usage only — no API reference section required.
4. Assessment — reduce scope
Run only Layer 4 (Implementation) and Layer 3 (Design) checks. Skip Layer 2 (Architecture) and Layer 1 (User). Target: CONDITIONALLY READY is acceptable. The purpose is code quality discipline, not production certification.
5. Keep the hard constraints
Even for internal tools, keep: filesystem lock, SHA-256 hash audit, test-first cycle, ruff zero warnings, mypy strict. These cost nothing once configured and prevent the most common quality regressions. The overhead is in thresholds and documentation, not in the core enforcement mechanism.
Relaxed thresholds make it easier to accumulate technical debt that becomes expensive when the internal tool is later promoted to production or external use. If there is any chance the tool will be externally facing, use the pre-clinical baseline from the start. Retrofitting quality is significantly harder than building it in.
Comparison Matrix
| Dimension | Internal | Pre-clinical | Scientific R&D | Clinical Trials | Regulated Medical | Safety-Critical |
|---|---|---|---|---|---|---|
| Coverage threshold | 60% | 80% | 80% | 80% | 90–100% | 100% MC/DC |
| Mutation testing | Skip | Assessment | Assessment | Assessment | Build (Class C) | Build (all) |
| Integration tests | Optional | Required | Required + numerical | Required + audit | Required + hazard | Required + formal |
| Requirements traceability | None | None | Provenance only | Partial (data flows) | Full (bidirectional) | Full + MC/DC mapping |
| Human review gates | Tests only | Tests + plan | Tests + plan + references | All phases | All phases + sign-off | Independent team |
| Security scanning | None | None | Dependency audit | OWASP + dependency | OWASP + dependency | Formal analysis |
| Acceptance testing | Skip | Lightweight E2E | Reference comparison | Data round-trip | Formal UAT | Formal UAT + witness |
| Documentation | README | README + API | + methodology + validation | + ALCOA+ mapping | + SRS + SDD + test report | + safety case |
| Assessment target | CONDITIONAL | PROD READY | PROD READY + repro | PROD READY + audit | PROD READY + trace | Certification body |
| Orchestrator feasibility | ✓ Full | ✓ Full | ✓ Full | ◐ Mostly | ◐ Partial | ✗ Supplement only |
Creating a New Workflow Instance
For every new project, follow this sequence:
Copy the orchestrator into your project as described in the Usage Guide (Step 1). This gives you the pre-clinical baseline.
Choose the profile that matches your project's regulatory and quality context. If in doubt, use pre-clinical — you can tighten later, but relaxing after the fact loses the value of early discipline.
Follow the step-by-step modifications for your selected profile. Edit the skill files, plan template, and assessment spec as described. Each modification is a specific file edit — not a conceptual suggestion.
Add a docs/scope-profile.md file to the project recording: which profile was selected, which modifications were applied, any deviations from the profile (with justification), and any additional scope-specific requirements not covered by the profile.
Run the workflow as normal. The profile modifications will take effect through the modified skill files and plan template. The assessment at the end will use the extended checks if you modified the assessment spec.
A scope: field in conversion_state.yaml could automate profile selection — the status skill would enforce the correct thresholds and require the correct stages based on the declared scope. This is a candidate for a future version, not current functionality. Today, you are manually configuring the orchestrator for each scope.
Runbook
Lifecycle: init → setup → requirements → plan → skeleton → build (per module: red → green → commit) → integrate → validate → assess → done
Your review points: requirements (response schema), skeleton (interface contracts), plan (architecture), red phase (test quality), integrate (approve integration tests), two domain questions. Everything else is automated.
Prompts below are copy-paste into Claude Code unless marked TERMINAL.
Dual-tool pattern: Claude Code does the work. Claude app (this Project) acts as an independent advisor at key checkpoints — marked with CLAUDE APP. This prevents Claude Code from grading its own homework at the moments where judgement matters most. Ad-hoc prompts for deviation detection, recovery, and session continuity are in the reference section at the bottom.
Claude Code must be configured correctly before starting. The wrong model or thinking level produces shortcuts, weak tests, and protocol bypasses.
Model: Use the latest Opus (the most capable model). In Claude Code, type /model to open the model selector and choose Opus.
Thinking level: Set to high. The skill files are complex and the agent needs extended reasoning to follow them precisely — especially during red/green cycles where it is tempted to write tests and implementation together. You can check and change thinking level with Shift+Tab to cycle through levels, or /model to see the full selector.
Why Opus on high? The documented bypass incident (Phase B, first attempt) happened because the agent took shortcuts — it wrote all tasks in under 4 minutes, tests and implementation together, without invoking any slash commands. Higher-capability models with extended thinking reduce this risk. You can drop to medium for simple steps (commits, status checks), but keep it on high for all build, integrate, and validate stages.
Agent eagerness: The agent will often say "Shall I proceed?" or "Let me start building" and immediately begin working without waiting for your answer. This is the single most common protocol deviation. The prompt "Show me your analysis before you start making changes" is your primary defence. Always insist on seeing its plan before it acts.
Commit discipline: The agent frequently commits source files but forgets state file changes (conversion_state.yaml). After every stage transition, run git status and commit any uncommitted changes before proceeding. Build this habit from setup onward — it prevents state drift that confuses the agent after compaction.
Context compaction: Long sessions trigger compaction (summarisation). This is normal. The state machine is designed for it — the agent recovers from conversion_state.yaml and docs/2-plan.md. If the agent seems confused after compaction: Read conversion_state.yaml and docs/2-plan.md. What module, what task, what tdd_phase? Resume.
Coverage scoping: During the build stage, the green phase coverage check (≥ 80%) may fail because stub files in other modules have 0% coverage. The agent should scope coverage to the current module only (e.g., --cov=confluency.analysis not --cov=src/). If the agent gets stuck on coverage, tell it: Scope coverage to the current module only. Stubs in other modules are not yet implemented.
Hook compatibility: Git hooks may fail if they call pytest, ruff, or mypy directly instead of via uv run. If you see "command not found" from hooks, update the hook scripts to prefix each command with uv run. This is a one-time fix during setup.
Some tests pass during red — that's sometimes OK: When the skeleton has already implemented a data class (e.g., ColorStats.__init__), tests that exercise the class structure will pass immediately. What matters is that tests exercising the function logic fail with NotImplementedError. Similarly, FastAPI routing tests (e.g., "POST to a GET-only endpoint returns 405") pass because FastAPI handles routing, not the stub. These are testing framework behaviour, not implementation.
0 — Initialise TERMINAL
Save a snapshot of your POC as it is today, then install the orchestrator alongside it. This creates a clean starting point so every change from here forward is tracked and reversible.
Take a snapshot of your POC code as it exists today. If anything goes wrong, you can always return to this point.
cd <YOUR_POC_PATH> git init git add -A git commit -m "POC baseline" git tag poc-baseline
Already a git repo? Skip git init. Dirty tree? Commit or stash first.
Install the orchestrator's rules, checklists, and enforcement tools into your project directory.
cp -r ~/Desktop/Desktop/Claude/test-orchestrator/.claude . cp -r ~/Desktop/Desktop/Claude/test-orchestrator/docs . cp -r ~/Desktop/Desktop/Claude/test-orchestrator/tasks . cp ~/Desktop/Desktop/Claude/test-orchestrator/CLAUDE.md . cp ~/Desktop/Desktop/Claude/test-orchestrator/conversion_state.yaml .
Personalise the orchestrator for your project and save the setup to version history.
sed -i '' 's/new-project/<YOUR_PROJECT_NAME>/' conversion_state.yaml git add -A git commit -m "Add V0.5 TDD orchestrator"
Confirm the orchestrator installed correctly and configure the AI for maximum reliability. The right model and thinking level prevents the most common failure mode: the agent skipping TDD discipline.
cd <YOUR_POC_PATH> claude
Once Claude Code opens: type /model and select Opus with high thinking. Then paste: /project:status
Expect: 7-stage lifecycle displayed, stage: setup, REQUIREMENTS PHASE and SKELETON PHASE sections visible in the dashboard. If slash command not recognised → check .claude/skills/ has 5 entries (status, red, green, integrate, validate).
1 — Setup
The AI reorganises your rough POC into a clean, professional project structure with automated quality checks. This is the foundation everything else builds on — if the scaffold is wrong, every subsequent step inherits the problem.
The AI analyses your POC, then rebuilds it into a proper project layout with type checking, linting, error handling, configuration management, and structured logging.
Read docs/1-setup.md. This is the setup checklist. Analyse the existing POC codebase — look at all source files, understand the structure, identify the main packages and entry points. Then work through every item in the checklist. For each item: 1. Do the work (create files, configure tools, etc.) 2. Check it off in docs/1-setup.md Key requirements: - Production layout must use src/<project_name>/ (not the POC's current layout) - pyproject.toml with uv, ruff, mypy strict mode - Git hooks: pre-commit (runs pytest), test guard (blocks test edits during green phase) - Import POC code into src/ layout (reorganise, don't just copy) - Structured logging: configure Python logging or structlog (log level, format, handler) - Centralised config: Pydantic BaseSettings class (all env-dependent values from env vars with sensible defaults) - Base exception hierarchy: AppException base class with at least ValidationError, NotFoundError subclasses - ruff + mypy must be clean before setup is complete Show me your analysis of the POC structure and your plan before you start making changes.
Verify the AI correctly understood your code before it starts restructuring. Mistakes in comprehension propagate into every later step.
Agent shows its analysis before proceeding. Check: correctly identifies functionality? Sensible src/ layout? Nothing lost from POC? Correct it before it starts if wrong.
⚠ Watch for: The agent frequently says "Shall I proceed?" and immediately starts building without waiting. If it begins creating files before you have reviewed the analysis, the work may be based on a wrong understanding of your POC. You can always let it finish and then check — the setup stage is relatively low-risk — but establishing the pattern of "analysis first, approval second" here prevents bigger problems at the plan and build stages.
If the AI gets interrupted or loses context mid-setup, this brings it back on track.
/project:status
Then: Read docs/1-setup.md. In your response, write out every checkbox line verbatim — I need to see the exact state of each item. Then complete any remaining unchecked items.
Confirm every setup item is done and all automated quality checks pass before moving on. Nothing proceeds until this is clean.
I need to verify setup is complete. For each item below, write the results directly in your response — do not just run commands, I need to read the output in your reply: 1. Read docs/1-setup.md — write out every checkbox line verbatim so I can see which are checked 2. List the directory structure of src/ (3 levels deep) 3. Run: uv run pytest tests/ -v --tb=short — report the full test results including pass/fail counts 4. Run: uv run ruff check src/ — report the full output 5. Run: uv run mypy src/ — report the full output 6. Read conversion_state.yaml — write out the complete file contents Do not modify anything. Write all results in your response text.
All items [x]. Ruff zero. Mypy zero. Tests pass.
Save the clean project scaffold and advance to the requirements stage — not directly to plan. The requirements stage validates your response schema before you invest in architecture.
Commit all setup work with message "Setup stage complete" and update conversion_state.yaml stage to "requirements".
Tip: After the agent commits, check git status. The agent often commits source files but forgets the state file change. If conversion_state.yaml is uncommitted, tell it to commit the remaining changes. This happens at every stage transition — build the habit of checking now.
Recovery: setup errors
mypy errors: Fix all mypy strict errors. Every function needs full type annotations including return types.
ruff warnings: Fix all ruff warnings in src/. Then run uv run ruff check src/ and write the full output in your response so I can verify it's clean.
tests fail: These base tests are failing: [paste failures]. Fix them.
Hook errors ("command not found" for pytest/ruff/mypy): The git hooks call tools directly instead of via uv run. Fix each hook script in .claude/hooks/ or .git/hooks/: replace pytest with uv run pytest, ruff with uv run ruff, mypy with uv run mypy. One-time fix — the agent usually handles this automatically when it encounters the error, but verify it did so.
Unused POC dependencies: The agent may import all POC dependencies into pyproject.toml including ones that are no longer needed after restructuring (e.g., pandas when only numpy is used). Run uv run python -c "import <package>" for each suspicious dependency and remove unused ones. Clean dependencies now prevents confusion during the assessment stage.
1.5 — Requirements SYSTEM PROTOTYPE · BEFORE PLAN
The AI analyses your POC, identifies what it actually returns versus what callers need, and produces testable acceptance criteria. A minimal mock API is built and run to verify the response schema is correct before any architecture work begins. POC-level schema flaws — the most expensive to fix late — are caught here at zero cost.
The AI reads your POC, maps its primary workflows, identifies every response schema, and surfaces any mismatch between what the POC returns and what is actually useful to the caller.
Read docs/0-requirements.md — this is the requirements checklist. Analyse the POC codebase — look at all source files, identify: 1. The primary user workflow(s) — what does a user actually do with this application? 2. Every HTTP endpoint and its current response schema — exact field names, types, and what each field contains 3. Any mismatch between what the POC returns and what a caller would actually need (e.g. metadata fields where result data is expected) Then write docs/0-requirements.md containing: ACCEPTANCE CRITERIA — numbered (AC-001, AC-002, ...), one per primary user workflow. Each must be testable: "Given X input, the system returns Y" — not vague goals. AGREED RESPONSE SCHEMA — for each primary endpoint, the exact response schema you intend to implement. If the POC schema is wrong or incomplete, define the correct one here. Every field: name, type, and what it contains. After writing docs/0-requirements.md, build a minimal mock API: - A FastAPI app with hardcoded responses matching the agreed schema (placeholder values, no real logic) - Run it: uv run uvicorn mock_api:app --port 8001 - Confirm it starts without error Read docs/0-requirements.md and write the complete contents in your response. Then show the mock API startup output.
This is your only chance to catch a wrong schema before it propagates through every module. Once the plan is written against it, fixing it requires reworking module interfaces, service contracts, and API response models.
☐ Acceptance criteria are testable — each has a specific input and expected output, not a vague goal
☐ Response schemas contain actual results, not just metadata or confirmation messages
☐ Every field a caller would need is present (confidence scores? bounding boxes? raw metrics?)
☐ No fields that exist only because the POC happened to produce them
☐ Mock API starts and returns the agreed schema shape
If schema is wrong: The response schema needs changes: [describe what fields are missing or incorrect]. Update docs/0-requirements.md and rebuild the mock API. Read the revised file and write its complete contents in your response.
Lock in the agreed acceptance criteria and response schema before architecture work begins. The mock API is disposable — delete it after verification.
Check off the final checkbox in docs/0-requirements.md ("This file committed as the requirements artefact"). Delete mock_api.py if it still exists. Commit everything with message "Requirements stage complete — acceptance criteria and response schema defined". Update conversion_state.yaml stage to "plan".
What just happened: You now have a locked-in contract for what the API must return. Every module, test, and endpoint from this point forward is built against docs/0-requirements.md. If you later discover the schema was wrong, you will need to rework interfaces — that is why this step exists before the plan, not after. The mock API served its purpose (verifying the schema shape) and is deleted.
2 — Plan ALL MODULES BEFORE BUILDING ANY
The AI designs the blueprint: breaking the project into modules with clear boundaries, deciding how they connect, and defining what each piece must do. Architecture mistakes here are the most expensive to fix later.
The AI decomposes your project into modules with documented boundaries, connection points, error handling strategy, and an ordered task list for each. All modules are planned together because their interfaces depend on each other.
Analyse the POC code that was imported during setup. Based on the code's structure and responsibilities, decompose it into modules. For a web application, typical modules are:
- A domain/processing module (core logic, algorithms, data transforms)
- A service module (orchestrates domain logic, manages resources, file I/O)
- An API module (HTTP routes, request/response schemas, dependency injection)
- An app module (application factory, middleware, exception handlers, startup)
You may need more or fewer modules depending on the project's complexity.
For EACH module, write a plan in docs/2-plan.md containing:
1. MODULE DESCRIPTION — what it does, what POC code maps to it
2. FILE STRUCTURE — source files to create under src/
3. TASK BREAKDOWN — ordered list of implementation tasks (each becomes a red/green cycle)
4. DESIGN DECISIONS (all mandatory):
a. Exception hierarchy — what domain exceptions this module raises, semantic meaning, HTTP mapping if applicable
b. Input validation strategy — where validation happens (API boundary, not domain logic)
c. Dependency direction — what this module imports, what imports it. Must be a DAG.
d. Configuration management — what config values this module needs, sourced from settings class
5. INTERFACE CONTRACTS (per module boundary):
- Exported function signatures (name, parameter types, return type)
- Exception types that may cross the boundary
- Data types exchanged (Pydantic models, dataclasses, primitives)
MANDATORY TASKS to include in the app module:
- Health endpoint: GET /health → {"status": "ok"} (HTTP 200)
- README.md: description, installation, usage, API reference
Order modules by dependency: lowest-level (no internal dependencies) first, highest-level (app factory) last.
Read docs/2-plan.md. Write out the complete file contents in your response — every module, every task, every interface contract, every design decision. I need to review the full plan. Do not summarise or omit anything.
You verify the architecture makes sense for your domain. Everything that follows is built on these decisions — a wrong boundary or missing error case here becomes a structural problem later.
Read the entire plan. Check:
☐ Module boundaries make sense
☐ Dependency direction is a DAG (no circular, lower doesn't import upper)
☐ Exception hierarchy covers real error cases (not generic Exception)
☐ Input validation at API boundaries, not inside domain logic
☐ Interface contracts documented per module boundary
☐ Health endpoint + README in app module tasks
☐ Task ordering sensible (foundational first)
☐ Reasonable number of tasks per module
If wrong: The plan needs changes: [describe]. Update docs/2-plan.md, then read the file and write out the complete revised contents in your response.
Get a second opinion on the architecture before committing to it. Claude app has the full spec context and can spot structural issues you might miss — circular dependencies, incomplete contracts, exception gaps.
Paste into this Claude Project conversation:
Claude Code just produced this module plan for my POC-to-production conversion. Review it independently against the V0.5 orchestrator spec. Be blunt. Check specifically: 1. Are module boundaries sensible for this domain? Any that should be split or merged? 2. Is the dependency direction a clean DAG? Any circular or upward risks? 3. Is the exception hierarchy complete — are there real error cases this domain needs that are missing? 4. Are interface contracts specific enough (types, return values, exceptions) or vague? 5. Is input validation at the API boundary, not buried in domain logic? 6. Any tasks that are too large (should be split) or too small (overhead)? 7. Anything that will cause problems at the integrate stage? Here is the plan: [PASTE CLAUDE CODE'S FULL PLAN OUTPUT]
If issues found → feed the feedback back to Claude Code: The plan needs changes: [paste Claude app's feedback]. Update docs/2-plan.md, then read the file and write out the complete revised contents in your response.
Lock in the approved blueprint and move to skeleton generation. The modules list in the state file must also be populated — this is what the status skill uses for progress tracking and gating.
Commit docs/2-plan.md with message "Plan stage complete — [N] modules, [M] total tasks". Then update conversion_state.yaml: set stage to "skeleton" and populate the modules list as specified in the plan (each module with status: pending and its task list). Commit the state change with message "Advance to skeleton stage".
2.5 — Skeleton ARCHITECTURE PROTOTYPE · BEFORE BUILD
The AI converts every interface contract in the plan into executable Python stubs — real type-annotated signatures, docstrings, and raise NotImplementedError bodies. Type errors in stubs are interface contract defects caught before any implementation is committed. The skeleton must pass mypy --strict, ruff, and circular import checks before build begins.
Important: If setup already created real implementations (which it typically does — importing and reorganising POC logic), the skeleton replaces those implementations with NotImplementedError stubs. This feels counterproductive but is intentional: the build stage will re-implement everything through proper red/green TDD cycles, replacing each stub with tested code. The skeleton validates the architecture contract (types, imports, exports) before any implementation is committed to it. Existing setup-stage tests should still pass because they test base patterns (config, exceptions, models), not the domain logic being stubbed out.
The AI reads the plan's interface contracts and generates all source file stubs. No real logic — only signatures, types, and NotImplementedError bodies.
Read docs/2-plan.md. You have a complete module plan with interface contracts for all modules. Generate the skeleton: for EACH module, create all source files under src/ with: - Type-annotated signatures for every exported function and class (matching the interface contracts exactly) - Docstrings on every exported symbol - __all__ declarations in every __init__.py listing every exported symbol - raise NotImplementedError bodies (no real logic) - Correct imports — each module imports only from modules it is permitted to depend on per the declared DAG After generating all stubs, run these three checks and write the results in your response: 1. uv run mypy src/ — must report zero errors 2. uv run ruff check src/ — must report zero warnings 3. For each module package, verify it imports without circular errors: python -c "import <project>.<module>" for each module Do not proceed until all three pass. If mypy reports type errors, fix the stubs — those errors are interface contract defects. Show the full output of each check.
The stubs are the interface contracts made executable. A type error here is a contract defect — fix it now, before any implementation is written against it.
☐ mypy strict: zero errors
☐ ruff: zero warnings
☐ All modules import without circular dependency errors
☐ Function signatures match the interface contracts in the plan (parameter names, types, return types)
☐ Dependency direction correct — no module imports from a module above it in the DAG
☐ Exception types used in signatures exist in the exception hierarchy
☐ Response types match the agreed schema from the requirements stage
If contracts are wrong: The skeleton has interface issues: [describe]. Fix the stubs — do not write any real implementation. Re-run mypy, ruff, and circular import checks, and write the full output in your response.
Verify existing tests still pass with the stubs in place, then lock in the architecture before implementation begins.
Before committing, verify existing tests still pass with the skeleton stubs: uv run pytest tests/ -v --tb=short If they pass, commit with message "Skeleton stage complete — architecture prototype, N stub functions across M modules". Then update conversion_state.yaml: set skeleton_phase to "complete" and stage to "build". Commit the state change with message "Advance to build stage".
What just happened: You now have a complete, type-checked, importable architecture — but with no real logic. Every function raises NotImplementedError. The build stage will replace each stub with tested implementation, one task at a time, through locked red/green cycles. The stubs are guardrails: if an implementation later deviates from the declared interface, mypy will catch it.
Recovery: skeleton errors
mypy errors: Type errors in stubs are interface contract defects — fix the stub signatures. Mypy reports [N] errors in the skeleton. Fix the stubs — do not write any implementation logic. Show me mypy output after fixes.
Circular import: A module imports from a module it should not depend on. Circular import detected between [module A] and [module B]. Fix the import structure to match the DAG in the plan. Show me the corrected imports.
Signature mismatch: A stub's signature doesn't match the plan. The stub for [function] has signature [X] but the plan says [Y]. Update the stub to match the plan. Do not change the plan.
3 — Build PER MODULE, IN DEPENDENCY ORDER
Each task is built in two locked phases: first write tests that define what the code should do (red), then write the code to pass those tests (green). The AI cannot cheat — test files are physically locked during coding and cryptographically verified afterward.
Point the AI at the first module to build (lowest-level, fewest dependencies first). Set both the module and its first task in a single prompt, then immediately trigger the red phase. This keeps the agent focused on one specific deliverable.
Set current_module to "<first_module_name>" and current_task to "<first_task_name>" in conversion_state.yaml. Then run /project:red
Recommended commit pattern: After each green phase, give a specific commit message: Commit with message "analysis-1: core stats calculation — 13 tests, 100% module coverage". Include the task name, test count, and what was done. Then advance: Set current_task to "<next_task_name>" and run /project:red. Combining commit + advance + red in one prompt keeps the rhythm tight and prevents the agent from drifting between steps.
The AI writes tests that define what the code should do — before any code exists. These tests must fail because the code hasn't been written yet. This is the specification: it defines "done" before work begins.
/project:red
Weak tests produce weak code. The AI only writes enough code to pass its tests, so if the tests miss edge cases, the code will too.
☐ Tests specify behaviour, not implementation details
☐ Boundary condition tests: zero, empty, limits, off-by-one (≥ 2 per task)
☐ Error path tests assert specific exception types (not bare Exception)
☐ Failure-path cleanup tests if task manages resources
☐ Descriptive names (test_empty_input_returns_error, not test_1)
If weak: The tests need improvement: [specific feedback]. Rewrite the tests and re-run /project:red.
The AI wrote both the plan and the tests — it may share blind spots with itself. A second opinion catches edge cases and weak assertions that look reasonable at first glance.
Paste into this Claude Project conversation:
Claude Code just completed /project:red for module [MODULE], task [TASK]. Review these tests independently. Be blunt — weak tests here mean weak code later. Check specifically: 1. Do tests specify behaviour or just mirror likely implementation? 2. Are boundary conditions meaningful for this domain, or trivial/obvious? 3. Do error path tests assert specific exception types from the hierarchy? 4. What edge cases should be tested but aren't? What could go wrong that these tests wouldn't catch? 5. Are any tests testing implementation details that would break on a valid refactor? Here are the tests: [PASTE THE TEST CODE FROM CLAUDE CODE'S OUTPUT]
If issues found → feed back to Claude Code: The tests need improvement: [paste Claude app's feedback]. Rewrite the tests and re-run /project:red.
The AI writes the minimum code to pass the locked tests. Test files are locked at the OS level and cryptographically hash-verified afterward — the AI cannot silently weaken its own tests to make implementation easier.
/project:green
Locks tests → implements → pytest + ruff + mypy + coverage ≥ 80% + hygiene → unlocks → SHA-256 audit.
Shortcut: /project:green --auto-advance auto-commits + runs red for next task.
Known issue (V0.5): Claude Code's skill dispatch may not parse the --auto-advance flag. Workaround: run /project:green manually, commit, then run /project:red for the next task.
Tip — coverage scoping: Overall coverage may fall below 80% because stub files in unbuilt modules have 0% coverage. The agent should scope coverage to the current module: --cov=<project>.<module> not --cov=src/. If the agent gets stuck retrying the coverage check, tell it: Scope coverage to the current module only. Stubs in other modules are not yet implemented.
Tip — testing routes before the app factory exists: When building route modules before the app factory is implemented, the red-phase tests create a minimal test-local FastAPI app that mounts just the router. This is the correct pattern — it tests the route handler in isolation. The app factory (exception handlers, middleware) is tested separately in its own module's tasks.
Save this unit of work as one clean, traceable change in the project history.
Commit this task with a descriptive message.
→ Repeat 3.1–3.3 for each task in the module.
Mark this module as done and advance to the next one in a single prompt. Combining mark-complete + set-next-module + run-red keeps the workflow tight.
Commit with message "<task-name>: <description> — N tests". Mark <current_module> as complete in conversion_state.yaml. Set current_module to "<next_module>" and current_task to "<first_task>". Run /project:red
Health check before starting the next module — verify tests are growing, coverage is holding, and nothing is falling behind.
/project:status
Previous module complete. Tests growing. Coverage ≥ 80%. → Start next module from 3.0.
Recovery: build phase errors
pytest fails: Tests [list] are still failing. The error is [paste]. Fix the implementation — do not modify any test files.
ruff fails: Ruff has warnings. Run ruff check --fix src/ and then re-verify. (Common: EN DASH characters in docstrings copied from the plan — replace with hyphens.)
mypy fails: Mypy reports [N] errors. Fix all type errors. Every function needs full type annotations. Then run uv run mypy src/ and write the full output in your response so I can verify it's clean.
Coverage < 80%: First check if the agent is measuring coverage across the whole src/ (which includes stubs at 0%). Tell it: Scope coverage to the current module only: --cov=<project>.<module_name>. If coverage is genuinely low for the current module, add tests for uncovered paths.
Hygiene fails: Code hygiene check found: [paste]. Remove all bare print() (use logger), remove TODO/FIXME, remove debug code. Re-run /project:green.
SHA-256 audit fails: SHA-256 audit failed — test files were modified during green phase. Reset: set tdd_phase to null and test_lock to false in conversion_state.yaml. Re-run /project:red to re-hash.
Tests don't fail (pass during red): Some tests may legitimately pass — e.g., data class structure tests pass against the skeleton, or FastAPI routing tests (405 for wrong method) pass via framework behaviour. The key question: do the tests exercising function logic fail with NotImplementedError? If yes, proceed. If all tests pass, the task scope may overlap with existing work.
Existing tests break: Existing tests are now failing. Fix the new test code without changing existing tests.
Agent skips red/green: STOP. You must follow the red/green cycle. Reset: set tdd_phase to null in conversion_state.yaml. Now run /project:red for the current task.
Documentation-only tasks (e.g., README): The plan may mark a task as "documentation only". These do not go through the red/green TDD cycle — there is no meaningful test for README content. Write the document directly, commit, and advance. The validate stage will verify it exists with required sections.
Context compaction: Read conversion_state.yaml and docs/2-plan.md. You are on module "<name>", task "<name>". The tdd_phase is "<phase>". Resume from this point.
Crash mid-green (files locked): Terminal: chmod -R u+w tests/ then Claude Code: Set test_lock to false and tdd_phase to null in conversion_state.yaml. Re-run /project:green.
4 — Integrate ONCE, AFTER ALL MODULES COMPLETE
Individual modules were tested in isolation. This phase tests how they work together using real dependencies — no fakes or simulations. This is where interface mismatches and wiring bugs surface.
Confirm every module is built and marked complete before testing their connections.
/project:status
Must show all modules complete.
The AI reads the entire codebase and maps how all modules connect — data flow, error paths, and interface contracts.
/project:integrate
Step 1: Automated codebase analysis. No action from you.
You provide domain knowledge the AI cannot infer from code alone — which failure modes actually matter in your field, and how precise the testing needs to be.
Question A — Cross-stage failure modes. Agent proposes 4–5 options. Select the real risks in your domain. If unsure, select all plausible.
Question B — Correctness criteria. Choose tolerance: scientific/medical → moderate ±5% or strict ±1% · data processing → coarse ±15% or moderate ±5% · CRUD/web → structural correctness
Verify the tests use real components, not mocked substitutes. The whole point of this phase is testing genuine connections between modules.
Before I approve, I need to review the integration tests. Write all of this in your response — do not just run commands: 1. Read every test file in tests/integration/ and write out the complete contents of each 2. For each test function: state which module boundary it exercises (one line each) 3. Run: grep -rn "mock\|patch\|MagicMock\|override\|Mock\|unittest.mock" tests/integration/ — report the results (should be zero matches)
☐ Zero mock/patch/override references — reject immediately if any found
☐ ≥ 1 test per module boundary
☐ Real or programmatically generated test inputs
☐ ≥ 1 error propagation test (lower module error → HTTP status at API)
☐ Batch/degradation test if applicable
☐ All tagged @pytest.mark.integration
If mocks found: Integration tests must use REAL dependencies, no mocks. Remove all mock/patch/override/MagicMock usage. Rewrite the tests, then read each revised test file and write out the complete contents in your response.
Run the approved integration tests, then automatically verify the project's dependency structure and module interface contracts.
I approve the integration tests. Proceed with execution, DAG verification, and interface contract checks.
Save the integration evidence — tests, dependency verification, and contract checks.
Commit with message "Integrate stage complete — integration tests, DAG verified, contracts verified"
Recovery: integrate failures
Integration tests fail: This is the stage working as designed — it found real cross-module issues. Common: interface mismatch, missing config value, exception type not caught at boundary. Agent diagnoses; may need source fixes via red/green cycle, then re-run /project:integrate.
DAG fails: Circular or upward imports. Fix the import structure.
Interface contracts fail: Missing __all__, missing docstrings, broken imports. Fix and re-run.
5 — Validate
Test the finished application as a real user would. Does it start? Does it respond to requests? Does it handle bad input gracefully? Is it documented and observable when something goes wrong?
Seven checks from the end user's perspective: can it start, respond, complete a real workflow, handle mistakes, explain itself, be diagnosed when things go wrong, and reproduce the POC's outputs?
/project:validate
Runs 7 checks in order:
| Startup | App factory runs, returns ASGI instance, <10s |
| Health | /health → 200 + JSON |
| E2E | Primary workflow via TestClient, no mocks, in tests/e2e/ |
| Errors | Invalid inputs → structured JSON 4xx, no stack traces |
| Docs | README sections + /docs returns 200 |
| Observability | Startup + error logging, correlation mechanism |
| POC Parity | Production outputs match POC outputs on known inputs (acceptance criteria from docs/0-requirements.md) |
= hard gate (blocks if fail) · = soft (gaps recorded)
Why POC parity? The production system must do at least what the POC did. If the POC processed an image and returned brightness values, the production system should return equivalent values for the same input. This catches subtle regressions introduced during restructuring — different image mode handling, rounding changes, dropped fields.
Read the verdict on whether the application meets production standards.
Read docs/validate-report.md. Write out the complete file contents in your response — every check, every verdict, the final determination. Do not summarise.
PRODUCTION READY all pass · CONDITIONALLY READY hard gates pass, soft gaps · NOT READY hard gate fails → fix and re-run
Save the validation evidence and determination.
Commit with message "Validate stage complete — [RESULT]"
Recovery: validate failures
Startup fails: Startup failed with: [paste error]. This is usually a wiring issue that only surfaces when the full app assembles. Diagnose the root cause and fix it. Then re-run /project:validate.
E2E fails: The E2E test failed. Read every test file in tests/e2e/ and write out the complete contents in your response. Then write out the actual route handler signature it's trying to hit and the error message. The request format probably doesn't match the route's expected input.
Stack traces in errors: Error response check found stack traces in: [paste]. The exception handler chain is incomplete. Fix the exception handling and re-run /project:validate.
POC parity fails: The production system returns different values than the POC for the same input. Common causes: different image mode handling (RGB vs RGBA vs palette), different rounding, changed field names. Compare the POC's computation logic to the production implementation and fix the discrepancy. POC parity check shows a mismatch: [describe difference]. Compare the POC logic in app/main.py to the production implementation and fix the discrepancy. The production system must reproduce the POC's behaviour for known inputs.
6 — Production Readiness Assessment OPTIONAL · SEPARATE SESSION
A completely independent review. A fresh AI session — with no memory of the build process — examines the finished codebase against 33 formal quality criteria across four VP-model layers. This is the final quality gate: an auditor that evaluates only what exists in the repository.
The V0.4 confluency project needed 3 assessment runs to reach PRODUCTION READY. The V0.5 tacit knowledge project achieved it on the first run — higher test coverage (99%), mutation testing (71.7% kill rate), and better observability contributed. As the orchestrator's quality gates tighten, the gap between what it produces and what the assessor expects narrows.
Assessor variance: Different assessor sessions may judge the same code differently — one may pass observability "with gaps", another may fail it. This is inherent to LLM-based assessment. If a check is borderline, fix it rather than relying on a lenient assessor.
Start a clean AI session so the assessment has no bias or memory from the build process.
cd <YOUR_POC_PATH> claude
Must be a separate session — no orchestrator context, no build history. Set model to Opus / high.
The independent AI runs 33 formal checks across four VP-model layers: user experience, architecture, test quality, and code quality. It evaluates only what exists in the repository.
Read the production readiness assessment spec at ~/Desktop/Desktop/Claude/test-orchestrator/production-readiness-assessment-spec.md This project uses uv for dependency management. Use "uv run" to execute all commands (e.g. "uv run pytest", "uv run ruff check", "uv run mypy"). Install assessor tools with "uv pip install" (e.g. "uv pip install pytest-randomly mutmut"). Execute every check in order, from User through Implementation level. Collect all evidence. Do not skip any check. Produce the final report in the format specified at the end of the document. Save the report to docs/production-readiness-report.md.
33 checks across 4 VP-model layers. Expect: User PASS or CONDITIONAL, Architecture PASS, Design PASS (3.5/3.6 may surface gaps), Implementation PASS.
Fix the specific gaps identified in the report, then re-run the assessment in a fresh session. Each fix-and-reassess loop is typically small and targeted.
Resume the build session (not the assessment session):
The independent production readiness assessment found these gaps: [PASTE THE FAIL VERDICTS AND REMEDIATION STEPS FROM THE REPORT] Fix each gap. Run all tests to verify nothing breaks. Commit with message "Fix [gap description]".
Then close the build session, open a new assessment session, and re-run step 6.2. Repeat until PRODUCTION READY.
Common gaps on first run: Mutation kill rate below 60% (add targeted tests for surviving mutants), observability depth (add logger.error() in exception handlers, add startup log event), error message assertion tightness (use match= parameter in pytest.raises).
Mark the project as complete in version history.
git tag poc-to-production-complete -m "Full conversion complete — all VP-model levels validated"
Recovery: assessment issues
Assessor can't find dependencies ("ModuleNotFoundError"): The assessor may try bare pytest instead of uv run pytest. If you see import errors, tell it: This project uses uv. Run all commands with "uv run" prefix (e.g. "uv run pytest tests/ -v"). Install additional tools with "uv pip install".
Mutmut v3 segfaults: Mutmut v3 can segfault when mutation-testing code that uses numpy/PIL native extensions. If this happens, the assessor should use prior mutation testing evidence from git history (commit messages document kill rates). Alternatively, install mutmut v2: uv pip install "mutmut<3".
Mutmut v3 dropped --paths-to-mutate: The assessment spec says to run mutation testing on the highest-test-count module only, but mutmut v3 removed per-module scoping. The assessor will run against the full codebase, which includes logging/config code that drags the kill rate down. If the kill rate is marginal (55–65%), tightening error message assertions and adding logging tests is the fastest fix.
pytest-randomly not installed: This is an assessor tool, not a project dependency. The assessor should install it: uv pip install pytest-randomly.
Assessor variance: Two independent sessions may produce different verdicts on the same code. This is inherent to LLM-based assessment. If a borderline check passes in one run but fails in another, fix the underlying gap rather than relying on the lenient interpretation.
CLAUDE APP Ad-hoc Prompts
Use these anytime during the workflow — not at fixed checkpoints but when something seems off, breaks, or needs a handoff.
When Claude Code's output looks suspicious — tests and code appearing together, steps being skipped, unexpected file changes. Faster than re-reading the spec yourself.
Claude Code just produced this output. I'm on module [MODULE], task [TASK], phase [red/green/etc]. Does this follow the TDD orchestrator protocol? Specifically: - Was the red/green sequence respected, or did it write tests and implementation together? - Were any steps skipped? - Is there anything here I should reject or push back on? If it deviated, give me the exact correction prompt to paste into Claude Code. [PASTE CLAUDE CODE'S OUTPUT]
When something breaks and you don't know the right fix. Claude app can diagnose the error and produce the exact prompt to paste into Claude Code.
Claude Code hit an error during the [PHASE] phase for module [MODULE], task [TASK]. Diagnose the issue and give me: 1. What went wrong and why 2. The exact prompt I should paste into Claude Code to fix it 3. Whether I need to reset any state (tdd_phase, test_lock) before the fix Here is the error output: [PASTE THE ERROR]
After Claude Code compaction, a crash, or starting a new day. Claude app remembers the project history and can draft the optimal re-entry prompt so Claude Code picks up exactly where it left off.
I need to resume the TDD orchestrator workflow in Claude Code. [Context compaction happened / I'm starting a new session / Claude Code crashed]. I was on module [MODULE], task [TASK], phase [PHASE]. [Optional: here's what happened in the last session — PASTE ANY RELEVANT CONTEXT] Draft the optimal continuation prompt I should paste into the new Claude Code session, including: - What to read first (state file, plan, lessons) - Where to resume - Any warnings about common issues at this point in the workflow
At the integrate stage, before you approve. Claude app can independently verify the integration tests use real dependencies and cover the right module boundaries.
Claude Code produced these integration tests at the /project:integrate stage. Review them independently. The critical constraint is: NO mocks, patches, MagicMock, or dependency overrides anywhere. Check specifically: 1. Are there ANY mock/patch/override references? (reject immediately if so) 2. Does every module boundary have at least one integration test? 3. Is there an error propagation test (error in lower module → correct HTTP status at API)? 4. Are test inputs real or programmatically generated (not empty stubs)? 5. Any module boundaries that are untested? Here are the integration tests: [PASTE INTEGRATION TEST CODE]
Commands
| Command | When |
|---|---|
/project:status | Anytime — health, progress, gating, next step |
/project:red | Start of each task — write failing test specs |
/project:green | After reviewing tests — implement + verify |
/project:green --auto-advance | Green + auto-commit + auto-red for next task |
/project:integrate | After all modules complete — integration tests, DAG, contracts |
/project:validate | After integrate — startup, health, E2E, errors, docs, observability |
Production Readiness Assessment
A separate, independent verification tool that evaluates whether code produced by the orchestrator meets production standards. It is not part of the orchestrator — it runs after the orchestrator completes, in a separate Claude Code session with no access to orchestrator state, lessons, or history. It evaluates repository artefacts only.
Relationship to the Orchestrator
Enforces process during conversion: test-first cycle, locks, hashes, hygiene, coverage. Has access to state, plan, lessons. Operates per-task.
Evaluates output after conversion. No orchestrator context. Runs all 33 checks against the repository as-built. Produces a formal report.
Assessment Principles
The assessor has no knowledge of the orchestrator's internal state, lessons, or session history. It evaluates only what exists in the repository.
Every check has a defined pass condition. No subjective judgements. Where a threshold is required, it is stated explicitly.
Checks grouped by VP-model layer. A failure at a higher layer is not compensated by strength at a lower layer.
Every verdict must cite the specific command output, file, or metric that supports it. No assertions without evidence.
How to Run
Open a new Claude Code session in the target repository (not the session used for conversion). Provide the assessment spec and instruct:
"Read the production readiness assessment spec. Execute every check in order, from User level through Implementation level. Collect all evidence. Do not skip any check. Produce the final report in the format specified. Save to docs/production-readiness-report.md."
Checks — Layer 1: User (8)
Does the assembled application work from the consumer's perspective?
| # | Check | Threshold | Enforced by |
|---|---|---|---|
| 1.0 | Requirements artefact | docs/0-requirements.md exists with testable AC-xxx criteria | Requirements stage |
| 1.1 | App startup | Factory runs, valid app instance, <10s | Validate |
| 1.2 | Health endpoint | HTTP 200 with JSON | Build + Validate |
| 1.3 | Core E2E | Primary workflow, full stack, no mocks | Validate |
| 1.4 | Error responses | Structured JSON, no internals, 4xx not 500 | Validate |
| 1.5 | Documentation | README (install/usage/API) + /docs loads | Build + Validate |
| 1.6 | Observability | Startup + error logging, correlation, version | Validate |
| 1.7 | POC parity | Production reproduces POC outputs (scope-dependent) | Validate (soft) |
Checks — Layer 2: Architecture (8)
Do the modules compose correctly? Are boundaries respected?
| # | Check | Threshold | Enforced by |
|---|---|---|---|
| 2.1 | Module structure | Separated packages, __all__ declared | Skeleton + Integrate |
| 2.2 | Dependency DAG | No upward coupling | Skeleton + Integrate (auto-verified) |
| 2.3 | Interface contracts | Exports importable, typed, documented | Skeleton + Integrate (auto-verified) |
| 2.4 | Integration tests | ≥1 per boundary, no mocks | Integrate |
| 2.5 | Config management | Centralised, env-driven, defaults | Plan + Setup |
| 2.6 | Error propagation | Explicit translation per boundary | Integrate |
| 2.7 | Graceful degradation | Per-item reporting or documented all-or-nothing | Integrate |
| 2.8 | Skeleton artefact | Type-checked stubs pre-date build commits | Skeleton stage |
Checks — Layer 3: Design (9)
Do the tests adequately and independently specify expected behaviour?
| # | Check | Threshold | Enforced by |
|---|---|---|---|
| 3.1 | Coverage | ≥80% overall, ≥60% per file | Build (green) |
| 3.2 | Test mapping | Every src file → test file | Build (red) |
| 3.3 | Boundary tests | ≥3 per module | Build (quality gate) |
| 3.4 | Error path tests | Per raising function, specific type | Build (quality gate) |
| 3.5 | Test independence | Random order ×3 | Assessment only |
| 3.6 | Mutation testing | ≥60% kill rate (sampled) | Assessment only → Candidate |
| 3.7 | Exception hierarchy | Base class, subclasses, tested, consistent | Plan template |
| 3.8 | Boundary validation | Typed models at entry, not in domain | Plan template |
| 3.9 | Failure-path cleanup | Tested per resource type | Build (quality gate) |
Checks — Layer 4: Implementation (8)
Does the code compile, execute, and conform to technical standards?
| # | Check | Threshold | Enforced by |
|---|---|---|---|
| 4.1 | Tests pass | 100%, zero errors, zero unexplained skips | Build (green) |
| 4.2 | Lint (ruff) | Zero warnings, config present | Build (green) |
| 4.3 | Types (mypy strict) | Zero errors, strict mode | Build (green) |
| 4.4 | No debug code | Zero breakpoint/pdb/print/TODO/FIXME | Build (hygiene gate) |
| 4.5 | No secrets/paths | Zero credential literals, zero absolute paths | Build (hygiene gate) |
| 4.6 | Deps locked | Lock file exists, all pinned | Setup |
| 4.7 | Resource cleanup | All resources in context managers | Build (conventions) |
| 4.8 | Structured logging | Configured, used for errors + operations | Setup + Validate |
Report Format & Determination
The assessment produces a structured markdown report saved to docs/production-readiness-report.md with per-check pass/fail verdicts, cited evidence, and an overall determination.
All layers pass. Every check has a PASS verdict with cited evidence.
Implementation and Design pass. Architecture or User has non-critical failures with documented remediation.
Implementation or Design layer has failures. These are release blockers regardless of higher-layer results.
Expected Assessment Outcomes
CONDITIONALLY READY. L4 Implementation + L3 Design pass. L2 Architecture gaps at 2.4 (integration tests), 2.6 (error propagation), 2.7 (degradation). L1 User POC-dependent.
PRODUCTION READY against the project's scope profile. Pre-clinical baseline: 33 checks pass. Demonstrated on both confluency assessment and tacit knowledge capture. Other profiles add domain-specific checks (traceability, audit, numerical regression).
Assessment itself would need expanding per scope profile: security scanning, performance baselines, mutation thresholds. See Scope Profiles in the Workflow tab for per-profile assessment extensions.
Empirical Assessment Results
Both validated projects have completed the independent production readiness assessment. Results below are from actual assessment runs — not projections.
| Project | POC LOC | Assessment Runs | Result | Checks | Coverage | Mutation Kill Rate |
|---|---|---|---|---|---|---|
| Confluency assessment (V0.4) | 1,045 | 3 (NOT READY → CONDITIONAL → PRODUCTION READY) | PRODUCTION READY | 30/30 | ~80% | Not measured |
| Tacit knowledge capture (V0.5) | 2,109 | 1 (PRODUCTION READY on first run) | PRODUCTION READY | 29/30 (1 partial) | 99% | 71.7% |
V0.5 improvement: The tacit knowledge project achieved PRODUCTION READY on the first assessment run — a significant improvement over V0.4's confluency which needed 3 runs. Contributing factors: higher test coverage (99% vs ~80%), better test quality (71.7% mutation kill rate), and cleaner observability setup. The one partial pass (4.6 Observability) was for missing request correlation IDs and version exposure — known non-blocking gaps.
Assessment Limitations
| Limitation | Impact |
|---|---|
| Assessor variance | Different independent sessions may judge the same code differently — one may pass observability "with gaps", another may fail it. This is inherent to LLM-based assessment. Multiple runs surface different interpretations of conformance criteria. Fix borderline checks rather than relying on lenient assessors. |
| Single-session execution | Complex repos may exceed context limits, requiring sampling rather than exhaustive verification. |
| Mutation testing tool compatibility | Mutmut v3 dropped --paths-to-mutate (per-module scoping) and can segfault with numpy/PIL native extensions. The spec's per-module instruction may not be executable. Full-codebase runs include logging/config mutants that drag kill rates down. Assessors may need to use prior mutation evidence from git history or install mutmut v2. |
| Mutation testing is sampled | Only the highest-test-count module is intended to be tested. Full-codebase runs (forced by mutmut v3) include modules with no dedicated tests, depressing the kill rate. |
| Assessor tool dependencies | pytest-randomly and mutmut are assessor-installed tools, not project dependencies. The assessor must install them separately. Projects using uv require uv pip install rather than pip install. |
| Integration check is presence-based | Verifies integration tests exist, not that they are comprehensive against interface contracts. |
| User-level checks are lightweight | Appropriate for pre-clinical baseline. Scope profiles for regulated, clinical, or safety-critical contexts add domain-specific checks. See Scope Profiles in the Workflow tab. |
| No security audit | Path traversal may be unit-tested, but no independent OWASP review or dependency vulnerability scan. |
| Workflow exercise is schema-dependent | Core E2E test (1.3) requires discovering the API schema. Non-standard or undocumented APIs may be incompletely tested. |