The Problem
AI coding agents can generate a working proof-of-concept in minutes. But POCs are not production code β they lack test coverage, error handling, type safety, and modular architecture. Converting a POC to a production-grade codebase traditionally requires a senior developer to decompose the work, write specifications, define test cases, and methodically rebuild each module.
This is disciplined, structured work that AI agents should excel at β but left unconstrained, they bypass the discipline. They produce code that passes its own tests but has not been verified against an independent specification.
What goes wrong without constraints
Tests confirm the implementation rather than specify behaviour. The same blind spots appear in both artefacts because both originate from the same reasoning pass.
The agent tests the happy path. Edge cases β zero values, empty inputs, off-by-one limits β are non-obvious without a deliberate specification step that asks "where can this break?"
When implementation output differs from expectations, the agent adjusts the test to match the code, rather than questioning the code. The test becomes a mirror of the implementation, not a constraint on it.
Tests describe what the code does, not what it should do. They cannot detect a defect because they were derived from the same defective reasoning that produced the implementation.
What the Orchestrator Is
A process enforcement system that converts POC-to-production rewrites into a semi-automated, auditable workflow. It uses Claude Code custom slash commands, filesystem locks, SHA-256 hash audits, and git hooks to force an AI agent through a strict specification-then-implementation cycle for every task. A human engineer reviews architectural decisions and test case quality; the AI handles implementation under constraint.
It is not a framework, library, or CI pipeline. It is a set of development constraints that enforce the discipline expected of a developer following strict test-first practices during a codebase rewrite.
How It Produces Production-Ready Code
The method achieves production quality through four reinforcing mechanisms, applied iteratively for every task in every module:
1. Temporal separation of specification and implementation
The agent writes test cases (the specification) and then, in a separate phase, writes the implementation. These are not concurrent activities β they are sequenced by a state machine that blocks advancement until each phase is complete. The specification phase forces the agent to reason about expected behaviour, boundaries, and error paths before it has implementation code to be influenced by. This is the core discipline: the test defines what the code must do, not what the code happens to do.
2. Hard enforcement that specification cannot be retroactively changed
Once the specification phase is complete, test files are locked at the OS level (chmod a-w) and their contents are cryptographically fingerprinted (SHA-256). During implementation, the agent cannot modify test files β the kernel prevents writes, and any circumvention attempt is detected by the hash audit. This closes the most dangerous failure mode: the agent adjusting tests to match a defective implementation.
3. Multi-layer verification at every commit
Each implementation must pass five independent checks before it can be committed: functional correctness (pytest), code style (ruff), type safety (mypy strict), code hygiene (no debug artefacts, secrets, or unmanaged resources), and coverage (β₯80%). A failure on any check blocks the commit. These checks operate at the implementation level of the VP-model, catching mechanical defects that tests alone cannot detect β type mismatches, shadowed variables, leaked resources, hardcoded credentials.
4. Structured decomposition through the VP-model
The conversion is not a monolithic rewrite. It follows the VP-model lifecycle: the requirements stage produces a system-level prototype (mock API + acceptance criteria) before architecture work begins, the skeleton stage produces an architecture-level prototype before implementation begins, and the test suite at each red phase is the design-level prototype. Each level has distinct artefacts, distinct test types, and distinct failure modes. A defect at one level cannot be compensated by rigour at another. See the VP-Model tab for the full framework.
End-to-End Workflow
The diagram below shows the complete pipeline from POC input to assessed production code. Each step is tagged by who performs it: the human engineer, the AI agent, automated tooling, or an independent assessor. The plan (step 2) is done once upfront for all modules. The build phase (steps 3β4) repeats per task within each module.
Validation Evidence (Phase B)
Validated on a real project: a FastAPI cell confluency assessment application converted from a working POC (monolithic routes, no tests, global state) to a production-grade codebase (4 modules, 18 source files, 169 unit tests, strict type checking, zero lint warnings).
14 atomic git commits, each scoped to a single task. 3 real defects detected that would likely survive conventional code review (see Risks & Gaps tab).
VP-Model Coverage
How This Works β In Plain Language
This page explains the test-first orchestrator without assuming technical knowledge. If you want the engineering detail, switch to any of the other tabs.
What problem are we solving?
AI tools can write software very quickly. In minutes, they can produce a "proof of concept" β a rough working version of an application that demonstrates the idea works. But rough working versions are not the same as production-quality software. Production software needs to be reliable, secure, maintainable, and testable. It needs to handle errors gracefully, protect sensitive data, and be structured so that other developers can understand and extend it.
Converting a rough version into a production version is skilled, disciplined work. It requires careful planning, thorough testing, and methodical rebuilding. AI tools should be good at this β but without constraints, they take shortcuts. They skip the discipline. The result looks complete but contains hidden defects that only surface under real-world conditions.
What does the orchestrator do about this?
It forces the AI to follow a strict, step-by-step process where the "homework" is written separately from the "answer key" β and the answer key is locked before the homework is attempted. A human expert reviews the answer key for quality, and multiple automated checks verify the homework from different angles.
The process has five main stages:
The Five Stages
Before any code is written, the AI analyses the existing rough version and creates a plan for how to rebuild it properly. This plan breaks the application into logical modules (self-contained pieces), defines how those pieces connect to each other, and specifies how errors should be handled. The plan also establishes coding standards, logging, and configuration.
Who does what: The AI drafts the plan. A human engineer reviews the architectural decisions β module boundaries, error handling strategy, dependency structure. This is the most important human review point. If the plan is wrong, everything built from it will be wrong.
Analogy: An architect drawing blueprints before construction begins. The builder doesn't start pouring concrete until the architect and client agree on the structure.
For each piece of work in the plan, the AI writes a set of tests before writing the actual code. These tests are derived from the plan, not from the code β they describe what the code should do, including normal behaviour, edge cases (unusual inputs, extreme values, empty data), and error conditions (what should happen when things go wrong).
At this point, all tests will fail β because the code they're testing doesn't exist yet. That's expected and correct. The tests are a specification: a precise description of the required behaviour, written as verifiable checks.
Who does what: The AI writes the tests. A human engineer reviews test quality β are the edge cases meaningful? Are error conditions specific? Are there enough boundary tests? This is the second most important human review point.
Analogy: A teacher writing an exam paper before the students sit the exam. The exam defines what success looks like. It's written independently of any particular student's answer.
Now the AI writes the actual production code β the minimum implementation needed to make all the locked tests pass. It cannot change the tests. If the code doesn't satisfy a test, the AI must fix the code, not the test.
Once all tests pass, the code must also pass four additional automated checks: a style checker (consistent formatting and no bad patterns), a type checker (variables and functions use the correct data types), a hygiene check (no leftover debugging code, no hardcoded passwords, no temporary files left open), and a coverage check (at least 80% of the code is exercised by the tests).
After all checks pass, the test files are unlocked and their fingerprints are verified β confirming no test was altered during the process. The work is then committed to version control as one atomic, traceable unit.
Who does what: The AI writes the code. Automated tooling verifies correctness, style, types, hygiene, coverage, and test integrity. The human can review the final commit but is not required to β the automated checks are comprehensive.
Analogy: A student sitting a locked exam under invigilated conditions. They cannot see or change the mark scheme. Their work is graded against the pre-set criteria automatically.
Stages 2 and 3 repeat for every task in the plan. The plan is done once upfront for all modules; the build cycle then repeats for every module in the application. When all modules are complete, the process moves to verification of the whole system.
Stages 2 and 3 test each piece in isolation. But pieces that work individually can fail when connected β like components that fit perfectly in a lab but don't assemble correctly on site. Integration testing verifies that the modules communicate correctly across their boundaries: data passes between them in the right format, errors propagate and are handled at each handoff point, and the dependency structure matches the architectural plan.
Who does what: The AI writes integration tests. Automated tooling runs them.
A completely separate evaluation β run by a fresh AI session that has no knowledge of how the code was built. It cannot see the plan, the build history, or any notes from the conversion process. It examines only the finished code and runs 33 formal checks across four levels: Does the code itself meet standards? Do the tests adequately specify behaviour? Do the modules work together? Does the application function as a whole?
The assessment produces a formal report with one of three outcomes: Production Ready (all checks pass), Conditionally Ready (minor gaps with documented remediation), or Not Ready (fundamental issues that must be resolved).
Who does what: An independent AI session with no build context. No human involvement required β the checks are objective and evidence-based.
Analogy: An independent building inspector examining a completed construction. They weren't involved in the build. They have their own checklist. They issue a compliance certificate β or a list of defects.
Where do humans fit in?
The system is designed so humans review where it matters most, and automation handles the rest.
Architecture β does the plan decompose the application correctly? Are the module boundaries sensible? Is the error strategy right?
Test quality β are the tests asking the right questions? Are edge cases covered? Are error conditions specific?
These are expert judgement calls that cannot be fully automated. They happen at Stage 1 and Stage 2.
Test locking β OS-level file permissions prevent tests from being changed.
Cryptographic audit β fingerprints verify no test was altered.
Style, types, hygiene, coverage β four independent automated checks on every commit.
Independent assessment β 30 objective checks with no subjective judgement.
What does the output look like?
At the end of the process, you have:
One large file or a few tangled files. No tests. No type safety. Hardcoded configuration. No error handling. No logging. Works on the developer's machine; may not work anywhere else.
Modular codebase with clean separation of concerns. Comprehensive test suite (169 tests in the validation project). Strict type checking. Structured error handling with a defined exception hierarchy. Structured logging. Externalised configuration. Atomic git history where every commit is traceable to a specific task.
What does the current version cover?
The current version validates both the individual pieces and the assembled whole. It covers all four levels of the VP-model β implementation, design, architecture, and user β with the User level's content determined by the project's scope profile. The pre-clinical baseline (demonstrated on the confluency app) provides full lifecycle coverage for commercial software. Other scope profiles (scientific R&D, clinical trials, regulated medical) tighten thresholds and add domain-specific validation. See the Scope Profiles tab.
Architecture validation
After all modules are built individually, the orchestrator verifies the assembled system:
Test that modules communicate correctly across their boundaries β data passes in the right format, errors are handled at each handoff, and the dependency structure matches the plan. No simulation or faking; real components talking to each other.
When something goes wrong deep inside the system, does the error travel correctly through each layer and arrive at the user as a clear, safe message? Or does it leak technical details, get lost, or produce the wrong error code? This is tested explicitly at every module boundary.
User validation
After integration passes, the orchestrator validates the application from the user's perspective:
Does the fully assembled application actually start? Does it have a health check endpoint that monitoring tools can use to verify it's alive? These sound basic, but components that work individually can fail to assemble β a missing configuration value, a circular dependency, a registration error.
Exercise the primary user journey through the real, fully assembled application β submit input, process it, retrieve the result. No faking, no shortcuts. This is the user-level acceptance test: does the system do what it's supposed to do?
Can an operator diagnose a failure without reading the source code? Does the application log startup events, errors with context, and provide a way to trace a user's error report back to the relevant log entry? Is there a README that explains how to install, configure, and run the application?
Automatically check that the actual code structure matches the planned architecture. If the plan says module A should never depend on module B, verify that no such dependency crept in during construction.
After the full process, the independent assessment (Stage 5) should produce a Production Ready determination against the project's scope profile. For the pre-clinical baseline, all 33 checks should pass. Other scope profiles add domain-specific checks.
Future candidates (not committed)
These are improvements identified from known weaknesses and assessment limitations. Whether they are built depends on what breaks when the orchestrator is applied to more projects.
Currently the same AI writes both the tests and the code (in separate phases). A stronger approach would use a second AI β or a human β to write the tests, providing genuine independence. This is architecturally significant and would require redesigning how the orchestrator coordinates work.
Currently some rules rely on the AI reading and respecting documents ("soft" enforcement). A future candidate would block the AI at the operating system level from writing production code unless the test-writing phase has been completed β removing the last gap where the AI could skip a step.
A technique that makes small deliberate changes to the code and checks whether the tests catch the change. Currently this is only run during the independent assessment. A future version could integrate it into the build process itself.
Vulnerability scanning, static security analysis, and performance benchmarks. These are standard production requirements not currently covered by the orchestrator.
The overall trajectory: the current version validates the parts and the whole (each module well-built, assembled system works correctly). Future versions would harden the process (close remaining enforcement gaps and add deeper quality checks).
What's the catch?
Honest constraints and limitations:
The tests and implementation are authored by the same AI in separate phases. Temporal separation helps, but the AI may share blind spots across both phases. A truly independent test author (second AI or human) would be stronger.
The state machine and coding conventions rely on the AI reading and respecting documents. The file locks and hash audits are "hard" (OS-level, cannot be bypassed), but the process sequencing has been bypassed once during validation. It was caught and corrected, but the risk exists.
The process does not include penetration testing, vulnerability scanning, or performance benchmarks. These would need to be added separately for a production deployment under load.
The orchestrator has been validated against one application (a FastAPI image processing service). Generalisability to CLI tools, libraries, and non-FastAPI web apps is unverified.
Version History
Each version advances the orchestrator up the VP-model. V0.4 is the current implementation, completing the VP-model by adding the architecture-level prototype (skeleton stage). Future candidates are not formally defined β extrapolated from documented gaps and assessment limitations.
Timeline Detail
Test-first enforcement in status skill. Test collection validation. Code hygiene (debug, secrets, paths). Test quality (boundary β₯2, error paths, cleanup). Plan-stage design patterns (exception hierarchy, validation, DAG, config). Coverage β₯80%. Self-improvement loop. Auto-advance. Crash recovery.
CONDITIONALLY READY. L4 Implementation + L3 Design pass. L2 Architecture gaps at 2.4 (integration tests), 2.6 (error propagation). L1 User depends on POC.
/project:integrate, /project:validate) close the remaining VP-model layers. 18 change items: 7 Critical, 6 High, 5 Medium.Integration tests (real dependencies, no mocks), error propagation verification, dependency DAG check, interface contract verification. User validate: startup, health, E2E workflow, error response quality, documentation, observability. Scope-configurable user level.
Architecture prototype absent: interface contracts text-only until integrate stage discovers mismatches post-build. No POC parity check at validate stage. User level had no development branch prototype.
Skeleton stage: All module stubs generated from interface contracts before any task implementation. mypy strict + ruff + circular import check required to pass. Human reviews contracts as executable code, not text. Interface defects found here, not at integrate.
POC parity check: Validate stage checks that production system reproduces POC outputs on known inputs (scope-dependent: required for scientific/numerical apps, N/A for CRUD).
Assessment spec: 33 checks across 4 VP-model layers. Includes check 2.8 (skeleton artefact) and check 1.7 (POC parity).
With V0.4, all three VP-model prototype levels are covered:
β’ Concept: POC β parity check at validate
β’ User: requirements β mock API + acceptance criteria pre-plan
β’ Architecture: skeleton β type-checked pre-build
β’ Design: test suite β locked before implementation
Known remaining gaps: no mutation testing in build phase, single-agent architecture (both prototype and implementation from same model family).
Architecture Validation
| Item | What It Does | Assessment Checks Closed |
|---|---|---|
| Integration test stage | Real dependencies across module boundaries, no mocks | 2.4 |
| System test stage | Full app from HTTP to filesystem output | 2.4 |
| Error propagation verification | Exception translation at each boundary | 2.6 |
| Graceful degradation | Mixed valid/invalid batch input tested | 2.7 |
| Interface contract checks | Plan documents signatures + types, skeleton verifies | 2.3 |
| Dependency DAG verification | Automated import check against declared graph | 2.2 (auto) |
| Skeleton artefact | Type-checked stubs pre-date build commits | 2.8 |
User Validation (Pre-Clinical)
| Item | What It Does | Assessment Checks Closed |
|---|---|---|
| Requirements artefact | Acceptance criteria + response schema pre-plan | 1.0 |
| App startup verification | Factory assembles + starts without error | 1.1 |
| Health endpoint | Readiness endpoint β HTTP 200 | 1.2 |
| Core workflow E2E | Primary workflow through full app (no mocks) | 1.3 |
| Error response quality | Structured JSON, no internals exposed | 1.4 |
| Documentation | README + API docs accessible | 1.5 |
| Observability | Startup + error logging, correlation | 1.6 |
| POC parity | Production reproduces POC outputs (scope-dependent) | 1.7 |
PRODUCTION READY against the pre-clinical baseline (33 checks). Scope profiles add domain-specific checks for other contexts.
Also Scoped (Regulated Only)
Acceptance criteria before decomposition. Requirements traceability matrix. Formal UAT stage. Hazard-scenario tests. Audit trails. These are required by specific scope profiles (regulated medical, clinical trials, safety-critical) β not by the pre-clinical baseline.
Blocks src/ writes when tdd_phase: null. Needs stage-awareness to not block setup. Would close the last soft-constraint gap. Add only if bypass recurs.
Currently assessment-only, sampled on one module. If integrated into the orchestrator as a post-green check, it would provide hard evidence that tests detect code changes. Threshold: β₯60% kill rate (assessment), β₯80% target (mature).
Same model writes tests + implementation = shared blind spots. A dual-agent architecture (one for specs, one for implementation) would provide genuine independence. Architecturally non-trivial β requires orchestrator redesign.
No fix mechanism exists for bugs found after task/module completion. Workaround: new task through standard cycle. A formal patch command with rollback would address real-world usage patterns.
Assessment states "No security audit" as a limitation. No OWASP, no pip-audit/safety. A /project:security stage could integrate dependency + static analysis.
"Performance tests: None" in technical review. No load tests, latency benchmarks, throughput. For batch-processing apps this matters. Unclear if this belongs in the orchestrator or a separate CI pipeline.
Currently a manual step in regulated scope profiles. Could be productised as a /project:audit command: requirement β acceptance criteria β integration test β unit test mapping, auto-generated from plan and test markers.
Feature Matrix
| Capability | Current | Future? |
|---|---|---|
| Locked + hash-audited unit tests | β | β |
| Static analysis + strict types | β | β |
| Code hygiene + coverage β₯80% | β | β |
| Test quality requirements (soft) | β | β |
| Plan-stage design patterns | β | β |
| Self-improvement loop | β | β |
| Integration tests (no mocks) | β | β |
| System tests (full app E2E) | β | β |
| Error propagation + degradation | β | β |
| DAG automated verification | β | β |
| User-level validation (startup, health, E2E) | β | β |
| Observability + documentation | β | β |
| Hard test-first enforcement (PreToolUse) | if needed | CANDIDATE |
| Mutation testing in build | β | CANDIDATE |
| Multi-agent test authoring | β | CANDIDATE |
| Patch workflow | β | CANDIDATE |
| Security scanning | β | CANDIDATE |
| Performance testing | β | CANDIDATE |
| Requirements traceability | scope-dependent | SCOPE PROFILE |
Current Specification Detail
The orchestrator covers all four VP-model levels: Implementation, Design, Architecture (including skeleton prototype), and User. Validated against the confluency assessment application. Target: all 33 production readiness assessment checks pass.
Design Decision: Lifecycle Extension
The lifecycle includes the requirements stage (system prototype, pre-plan), the skeleton stage (architecture prototype, pre-build), and two post-build stages. Integration tests exercise cross-module boundaries and cannot run until both sides of each boundary exist.
Stage 1: Integrate (/project:integrate)
Architecture-level validation. Writes and executes integration tests, verifies dependency DAG, checks interface contracts. Precondition: all modules have status complete.
Integration Test Authoring β 4-Step Process
Question A β Cross-stage failure modes: Agent identifies processing stages, proposes plausible failure modes spanning stages. User selects real risks.
Question B β Correctness criteria: Agent proposes concrete acceptance thresholds (exact structural, coarse Β±15%, moderate Β±5%, strict Β±1%). User selects.
Selections recorded in
docs/integration-context.md for traceability.tests/integration/, tagged @pytest.mark.integration.
Human reviews before execution.
Error Propagation Verification
Per module boundary: at least one test that triggers an error in the lower-level module, verifies exception type at the crossing, verifies HTTP status code translation at API layer, verifies no exception is silently swallowed. Agent traces at least two error paths domain β service β API and documents the translation chain.
Graceful Degradation Testing
If batch/bulk operations exist: at least one integration test with mixed valid and invalid items, asserting per-item status reporting (or documenting all-or-nothing strategy). If no batch operations: recorded as N/A.
Dependency Direction Verification
After integration tests pass: extract all inter-module import statements, construct observed dependency graph, compare against declared DAG in docs/2-plan.md. Any upward dependency or circular import fails the integrate stage. This is an automated check, not a test.
Interface Contract Verification
After integration tests pass: every __init__.py declares __all__, every symbol in __all__ is importable without error, every exported symbol has a non-empty docstring, all packages import without circular dependency errors. Automated check.
Integrate State Machine
State tracked in conversion_state.yaml: integrate_phase, integration_tests_written, dependency_dag_verified, interface_contracts_verified. Status skill blocks /project:validate unless integrate_phase: complete.
Stage 2: Validate (/project:validate)
User-level validation. Verifies the assembled application works as a whole. Precondition: integrate_phase: complete.
Factory function runs without exception. Returns valid ASGI/WSGI instance. Completes in <10s. If startup fails, nothing else can be tested.
Check common paths (/health, /healthz, /ready). Verify HTTP 200 with valid JSON. Decision: required during build stage via plan template, not created during validate.
Primary user workflow via TestClient without mocks. Submit input β process β retrieve result. Tests in tests/e2e/, tagged @pytest.mark.e2e. Must complete without unhandled exceptions.
Send invalid requests to each endpoint: missing fields, wrong types, nonexistent IDs, malformed params. Verify structured JSON, 4xx (not 500), no stack traces or internal paths exposed.
README.md with description, installation, usage. API docs endpoint (/docs) returns HTTP 200. Decision: required during build via plan template.
Startup produces log output. Errors produce log output with context. Error responses include correlation mechanism (request ID, timestamp). Version identifier accessible. Verifies structured logging works end-to-end.
Validate State Machine
State tracked: validate_phase, plus individual booleans for each sub-check (startup, health, e2e, errors, docs, observability, poc_parity). Status skill reports overall production readiness after validate completes.
Plan Template Additions
| Addition | Rationale |
|---|---|
| Interface contracts per boundary | Exported signatures, exception types, data types exchanged. Architecture-level development artefact validated during integrate. |
| Health endpoint task (app module) | Goes through normal red/green cycle. Validate stage only verifies it works. |
| README task (final module) | Description, installation, usage, API reference. Validate stage only verifies it exists. |
Design Decisions
They verify existing behaviour (all modules already built and unit-tested), not specify new behaviour. Human reviews before execution, but no hash-lock or filesystem enforcement. Consistency argument was considered but rejected.
If created during validate, they bypass red/green cycle. Added to plan template as mandatory tasks. Validate stage only verifies they exist and work.
Assumes FastAPI / HTTP endpoints / TestClient. Non-web POCs (CLI, libraries, data pipelines) would need different user-level checks. Parameterised validate profiles are out of scope.
Too slow for per-task enforcement (minutes per module). More valuable as a post-build quality signal. 60% threshold is pragmatic, not rigorous.
Assessment Coverage
| Check | Layer | Enforcement |
|---|---|---|
| 4.1β4.8 | Implementation | 8/8 Build stage (green phase) |
| 3.1β3.4, 3.7β3.9 | Design | 7/9 Build stage (red + green phases) |
| 3.5 Test independence | Design | Assessment-only (not enforced) |
| 3.6 Mutation testing | Design | Assessment-only (not enforced) |
| 2.1 Module structure | Architecture | β Skeleton + Integrate: contract check |
| 2.2 Dependency DAG | Architecture | β Skeleton + Integrate: DAG auto-verification |
| 2.3 Interface contracts | Architecture | β Skeleton + Integrate: contract check |
| 2.4 Integration tests | Architecture | β Integrate: test authoring |
| 2.5 Config management | Architecture | β Plan template + setup |
| 2.6 Error propagation | Architecture | β Integrate: error propagation |
| 2.7 Graceful degradation | Architecture | β Integrate: degradation tests |
| 2.8 Skeleton artefact | Architecture | β Skeleton stage |
| 1.0 Requirements artefact | User | β Requirements stage |
| 1.1 App startup | User | β Validate: startup |
| 1.2 Health endpoint | User | β Validate: health |
| 1.3 Core E2E | User | β Validate: E2E workflow |
| 1.4 Error responses | User | β Validate: error quality |
| 1.5 Documentation | User | β Validate: docs |
| 1.6 Observability | User | β Validate: observability |
| 1.7 POC parity | User | β Validate: POC parity (scope-dependent) |
Change Items (18 total)
| ID | Change | Priority | Risk |
|---|---|---|---|
| 1a | Integration test authoring | CRITICAL | MEDIUM |
| 1b | Error propagation verification | CRITICAL | LOW |
| 1c | Graceful degradation testing | HIGH | LOW |
| 1d | Dependency DAG auto-verification | HIGH | LOW |
| 1e | Interface contract verification | HIGH | LOW |
| 1f | Integrate stage state machine | CRITICAL | LOW |
| 2a | Application startup verification | CRITICAL | LOW |
| 2b | Health endpoint verification | HIGH | LOW |
| 2c | Core workflow E2E test | CRITICAL | MEDIUM |
| 2d | Error response quality check | HIGH | LOW |
| 2e | Documentation verification | MEDIUM | LOW |
| 2f | Observability verification | MEDIUM | LOW |
| 2g | Validate stage state machine | CRITICAL | LOW |
| 3a | Plan template: interface contracts | HIGH | LOW |
| 3b | Plan template: health endpoint task | MEDIUM | LOW |
| 3c | Plan template: README task | MEDIUM | LOW |
| 4 | Status skill: integrate/validate phases | CRITICAL | LOW |
| 5 | Usage guide updates | MEDIUM | LOW |
Critical path: Items 1a, 1f, 2a, 2c, 2g, 4 β the two new commands, their state machines, the status skill updates, and the two hardest validation checks (startup and E2E workflow).
Risks and Limitations
Mitigated by the two-step approach: agent derives architecture from code (high confidence), asks user multiple-choice questions for domain knowledge (failure modes, correctness criteria). Human review gate before execution. Lower risk than unit test authorship because agent has full codebase context.
Integration tests run after all modules built. A fundamental interface mismatch requires rework across completed modules. No incremental feedback during build. This is inherent to the mocked-unit-test approach.
Core workflow E2E (2c) requires discovering the API schema and constructing valid multi-step requests. Complex workflows (upload β poll β retrieve) may need agent to infer sequencing not explicit in any single file.
Validated against one application (confluency). Generalisability to CLI, library, non-FastAPI web is unverified. Web-application assumptions baked in.
The orchestrator validates functional correctness at architecture and user levels. Performance under load is a separate concern not addressed.
If created during validate, they bypass red/green. Decision: add to plan template as build-stage tasks. Validate only verifies existence.
The VP-Model: V-Model with Prototyping
The VP-model extends the V-model by inserting a working prototype at each abstraction level. Where the V-model defines development and validation branches, the VP-model adds a feedback mechanism: a prototype exists at each level before that level is fully built, enabling defects to be caught at their level of origin rather than discovered later at a lower level. This pattern was established in systems engineering practice (Burst et al., 1998; Forsberg & Mooz, 1991; German Federal Ministry of Defence V-Modell, 1997; IEEE 1012-2016) and is applied here to AI-constrained software development.
In an agile context, the VP-model is applied iteratively: each increment passes through the same levels, with prototypes providing feedback before each level is committed. The VP-model does not prescribe sequence β it prescribes completeness: every development decision at every abstraction level must have both a corresponding validation activity and a prototype that makes the decision executable before full implementation.
Key References
| Reference | Contribution |
|---|---|
| Forsberg & Mooz (1991) | "The Relationship of System Engineering to the Project Cycle." Established the dual-branch decomposition/integration structure. Introduced the principle that validation artefacts are defined alongside (not after) development artefacts. |
| Burst et al. (1998) | "On Code Generation for Rapid Prototyping Using CDIF." Formalised the VP-model with three prototype insertion points along the V-model's development branch (concept, architecture, implementation levels). Established that prototypes validate at their abstraction level before that level is fully built β the distinguishing principle of VP over V. |
| German V-Modell (1997) | Formalised the V-model as a mandatory process standard for German Federal government IT projects. Demonstrated the V-model could be tailored to different project types and domains. |
| IEEE 1012-2016 | Standard for System, Software, and Hardware Verification and Validation. Defines V&V activities at each lifecycle phase. Establishes that verification (are we building it right?) and validation (are we building the right thing?) are distinct, concurrent activities. |
| Boehm (1979) | "Guidelines for Verifying and Validating Software Requirements and Design Specifications." Established the empirical finding that defects introduced at higher abstraction levels are exponentially more expensive to detect and fix at lower levels β the foundational economic argument for early-level prototyping. |
Three Governing Principles
The development branch (decomposition from requirements to implementation) is distinct from the verification branch (internal correctness) and the validation branch (does the built system satisfy the level above it).
Verification and validation are not afterthoughts. They are defined at each level before or alongside the development activity at that level.
Each level addresses a different scope of concern with distinct artefacts, test types, and failure modes. A missing requirement is not detectable by a unit test. An incorrect module interface is not detectable by testing either module in isolation.
Defects introduced at a higher abstraction level are more expensive to fix at a lower level (Boehm, 1979).
Before each level is fully built, a working prototype exists that validates the decisions at that level. The prototype is executable β it can be run, type-checked, or tested β and feeds defects back to the level above before implementation commits to them. The three prototype types in this workflow:
- System prototype (requirements) β a mock API returning hardcoded responses, built before the architecture plan. Validates that the response schema is fit for purpose and acceptance criteria are testable before any module design begins. The requirements stage is where POC-level schema flaws are caught at zero cost.
- Architecture prototype (skeleton) β type-annotated stubs for all modules. Validates interface contracts are correct and consistent before implementation begins. mypy strict must pass on the skeleton.
- Design prototype (test suite) β failing tests at the red phase. Validates module behaviour specifications before implementation is written. The red phase IS the design prototype.
Abstraction Layers and Artefacts
| Level | Prototype | Development Artefact | Validation Artefact |
|---|---|---|---|
| User | System prototype β mock API + acceptance criteria (requirements stage, pre-plan) | docs/0-requirements.md: acceptance criteria, agreed response schema | Acceptance tests, E2E workflows against AC-xxx criteria, POC parity |
| Architecture | Skeleton β type-checked stubs, all modules importable | Module boundaries, interface contracts, DAG | Integration tests, error propagation, DAG verification |
| Design | Test suite β failing tests specifying behaviour | Task specs: schemas, edge cases, error paths | Unit tests (locked before implementation) |
| Implementation | β | Source files (green phase, replacing stubs) | pytest, ruff, mypy strict, hygiene, coverage |
VP-Model Coverage
The orchestrator covers all four VP-model levels with both development and validation branches active. V0.4 completes this by adding the requirements stage (User level development branch β mock API + acceptance criteria before the plan) alongside the skeleton stage (Architecture level development branch). All four levels now have active prototypes on the left branch and validation artefacts on the right.
| Level | Prototype | Dev Artefact | Validation Artefact | Status |
|---|---|---|---|---|
| User | System prototype: mock API + acceptance criteria (V0.4 β) | docs/0-requirements.md | E2E against acceptance criteria, POC parity | FULL β |
| Architecture | Skeleton (V0.4 β) | Boundaries, contracts, DAG | Integration + system tests | FULL β |
| Design | Red phase test suite | Task specs, edge cases, errors | Locked unit tests, quality-gated | FULL β |
| Implementation | β | Green phase source | pytest + ruff + mypy + hygiene + coverage | FULL β |
How It Is Applied Iteratively
The requirements stage runs first: acceptance criteria are defined and a mock API prototype validates the response schema before any architecture work. The plan is then done once upfront for all modules. The skeleton immediately follows, converting all interface contracts into type-checked stubs before any task implementation begins. Each module then passes through the build cycle β red (design prototype), green (implementation) β with verification at each step. After all modules are built, the composed system is validated at the architecture and user levels.
Requirements (once, pre-plan): define acceptance criteria, build mock API prototype, validate response schema with stakeholders. Fixes POC-level schema flaws before any architecture investment.
Plan (once, all modules): boundaries, exception hierarchy, validation strategy, dependency direction, interface contracts as text.
Skeleton (once): convert contracts to executable stubs. mypy strict + ruff + circular import check. Defects here cost nothing compared to finding them at integrate.
For each module, for each task: Red phase writes specification-derived failing tests (design prototype) β locked β Green phase implements (replacing stubs) β pytest/ruff/mypy/hygiene/coverage verify β hash audit β commit. Repeat for all tasks.
Integrate: integration tests exercise real dependencies across module boundaries, error propagation verified, DAG checked against declared architecture. Validate: scope-configured user-level checks β app startup, health, E2E workflow, error response quality, POC parity (scope-dependent), observability, documentation.
The orchestrator is a state machine. Every action β writing tests, implementing code, running integration checks, validating the application β is a transition between defined states. The state is persisted in conversion_state.yaml, so the process survives context window compaction, session restarts, and agent crashes. The status skill reads this file and determines what is permitted next: you cannot implement without first specifying tests, you cannot integrate without first building all modules, and you cannot validate without first passing integration.
Task Lifecycle
Each task within a module follows a strict redβgreen cycle. The agent writes failing test specifications (red), then implements code to pass them (green). No shortcuts: the filesystem lock and SHA-256 hash audit enforce temporal separation between specification and implementation.
Full Lifecycle
The complete conversion progresses through seven stages. Requirements defines acceptance criteria and validates the response schema via a mock API prototype β catching user-level flaws before any architecture work. Setup scaffolds the project. Plan decomposes into modules. Skeleton converts contracts to executable stubs. Build executes the redβgreen cycle per task. Integrate validates cross-module composition. Validate confirms the application satisfies the acceptance criteria end-to-end. Each stage gates the next.
Requirements Phase
The system-level prototype. Before the architecture plan is written, the agent analyses the POC, identifies the primary user workflow and current response schemas, and produces docs/0-requirements.md β acceptance criteria in testable form and an agreed response schema for each primary endpoint. A minimal mock API (hardcoded responses) is then built and run to verify the HTTP interface shape before any domain code exists. This is where POC-level schema flaws are caught: if an endpoint returns metadata instead of results, that is visible here at zero cost.
Precondition: setup complete. Output: docs/0-requirements.md committed. Human approves acceptance criteria and response schema before plan begins.
Skeleton Phase
The architecture-level prototype. The agent reads all interface contracts from docs/2-plan.md and generates complete stubs for every exported symbol: real type-annotated signatures, docstrings, __all__ declarations, and raise NotImplementedError bodies. The skeleton must pass mypy --strict, ruff, and circular import checks before build begins. Type errors in stubs are interface contract defects caught before any implementation is written.
Precondition: plan complete. Includes: all stubs generated, mypy strict clean, ruff clean, all modules import without circular errors. Human reviews interface contracts as executable artefacts.
Integrate Phase
Architecture-level validation. The agent reads the entire codebase, derives module boundaries and interface contracts, asks the user targeted multiple-choice questions about domain-specific failure modes, then writes integration tests that exercise real (non-mocked) cross-module interactions. A human reviews the tests before execution. After tests pass, automated checks verify the dependency DAG and interface contracts.
Precondition: all modules complete. Blocked by status skill. Includes: integration tests, error propagation, degradation, DAG check, interface contracts.
Validate Phase
User-level validation. The agent exercises the assembled application as a whole: can it start, does the health endpoint respond, does the core workflow complete end-to-end, do invalid inputs produce structured error responses, does documentation exist, and is operational observability in place. These checks catch wiring defects that are invisible when modules are tested in isolation.
Precondition: integrate_phase: complete. Sub-checks: startup, health, E2E, error responses, documentation, observability, POC parity.
Red Phase
The specification phase. The agent reads the design plan for the current task and writes test cases that define expected behaviour β including boundary conditions, error paths, and resource cleanup under failure. All new tests must fail (no implementation exists yet) while all existing tests continue to pass. The agent reports test quality counts so the human can assess coverage of edge cases before proceeding. SHA-256 hashes of all test files are captured and stored.
pytest --collect-only exit 0Green Phase
The implementation phase. Test files are locked at the OS level (chmod a-w) β the agent physically cannot modify them. The agent writes minimal code to pass all tests, then runs five verification checks: functional correctness, static analysis, strict type checking, code hygiene, and coverage. After verification, tests are unlocked and every file is re-hashed. If any hash differs from the snapshot taken during the red phase, the entire green phase fails. This is the core enforcement: the specification cannot be retroactively weakened to match the implementation.
chmod -R a-w tests/--cov-fail-under=80Bug Fix Paths
Not all failures require a full state reset. Implementation bugs discovered during the green phase are fixed by the agent iterating on the source code β the tests are correct, the code just doesn't satisfy them yet. Test bugs are different: the specification itself was wrong. This requires resetting state, fixing the test, re-running the red phase to generate new SHA-256 hashes, and re-entering the green phase. The distinction matters because test bug fixes invalidate the hash snapshot, so the integrity audit must restart from scratch.
Agent iterates autonomously. No state reset.
Reset state. Fix test. Re-run /project:red for new hashes. Re-enter green.
Currently undefined. Workaround: new task through standard cycle. Candidate /project:patch command.
Enforcement Mechanisms
| Mechanism | Type | Prevents | Stage |
|---|---|---|---|
chmod a-w tests/ | Hard | Test modification during green | Build |
| SHA-256 hash audit | Hard | Any test change between spec/impl | Build |
| Test guard hook | Hard | Accidental test edits | Build |
| Pre-commit hook | Hard | Commit with failing tests | Build |
| State machine + status skill | Soft | Skipping red/green | All |
| CLAUDE.md constraints | Soft | Agent drift | All |
| DAG import verification | Hard | Upward/circular deps | Integrate |
| Integration test gate | Hard | Interface mismatches | Integrate |
| User validate gate | Hard | Broken assembly, bad errors | Validate |
| PreToolUse hook | Hard | Writing impl without red | Candidate |
| Mutation gate | Hard | Weak tests | Candidate |
Bypass Incident
Agent wrote all 6 tasks β tests + code β in 3m 59s. No commands invoked. Rules were advisory. Resolution: mandatory language + status blocks. Candidate: PreToolUse hook would make it impossible.
Hardening Progression
Filesystem lock + hash audit. Pre-commit blocks failing tests. Test guard rejects writes.
Integration tests gate release. DAG auto-verified. Interface contracts checked. User validate gate: startup, health, E2E, error quality, observability.
PreToolUse blocks src/ writes outside green. Mutation gate in build. Security scan.
Production Readiness Assessment
A separate, independent verification tool that evaluates whether code produced by the orchestrator meets production standards. It is not part of the orchestrator β it runs after the orchestrator completes, in a separate Claude Code session with no access to orchestrator state, lessons, or history. It evaluates repository artefacts only.
Relationship to the Orchestrator
Enforces process during conversion: test-first cycle, locks, hashes, hygiene, coverage. Has access to state, plan, lessons. Operates per-task.
Evaluates output after conversion. No orchestrator context. Runs all 33 checks against the repository as-built. Produces a formal report.
Assessment Principles
The assessor has no knowledge of the orchestrator's internal state, lessons, or session history. It evaluates only what exists in the repository.
Every check has a defined pass condition. No subjective judgements. Where a threshold is required, it is stated explicitly.
Checks grouped by VP-model layer. A failure at a higher layer is not compensated by strength at a lower layer.
Every verdict must cite the specific command output, file, or metric that supports it. No assertions without evidence.
How to Run
Open a new Claude Code session in the target repository (not the session used for conversion). Provide the assessment spec and instruct:
"Read the production readiness assessment spec. Execute every check in order, from User level through Implementation level. Collect all evidence. Do not skip any check. Produce the final report in the format specified. Save to docs/production-readiness-report.md."
Checks β Layer 1: User (8)
Does the assembled application work from the consumer's perspective?
| # | Check | Threshold | Enforced by |
|---|---|---|---|
| 1.0 | Requirements artefact | docs/0-requirements.md exists with testable AC-xxx criteria | Requirements stage |
| 1.1 | App startup | Factory runs, valid app instance, <10s | Validate |
| 1.2 | Health endpoint | HTTP 200 with JSON | Build + Validate |
| 1.3 | Core E2E | Primary workflow, full stack, no mocks | Validate |
| 1.4 | Error responses | Structured JSON, no internals, 4xx not 500 | Validate |
| 1.5 | Documentation | README (install/usage/API) + /docs loads | Build + Validate |
| 1.6 | Observability | Startup + error logging, correlation, version | Validate |
| 1.7 | POC parity | Production reproduces POC outputs (scope-dependent) | Validate (soft) |
Checks β Layer 2: Architecture (8)
Do the modules compose correctly? Are boundaries respected?
| # | Check | Threshold | Enforced by |
|---|---|---|---|
| 2.1 | Module structure | Separated packages, __all__ declared | Skeleton + Integrate |
| 2.2 | Dependency DAG | No upward coupling | Skeleton + Integrate (auto-verified) |
| 2.3 | Interface contracts | Exports importable, typed, documented | Skeleton + Integrate (auto-verified) |
| 2.4 | Integration tests | β₯1 per boundary, no mocks | Integrate |
| 2.5 | Config management | Centralised, env-driven, defaults | Plan + Setup |
| 2.6 | Error propagation | Explicit translation per boundary | Integrate |
| 2.7 | Graceful degradation | Per-item reporting or documented all-or-nothing | Integrate |
| 2.8 | Skeleton artefact | Type-checked stubs pre-date build commits | Skeleton stage |
Checks β Layer 3: Design (9)
Do the tests adequately and independently specify expected behaviour?
| # | Check | Threshold | Enforced by |
|---|---|---|---|
| 3.1 | Coverage | β₯80% overall, β₯60% per file | Build (green) |
| 3.2 | Test mapping | Every src file β test file | Build (red) |
| 3.3 | Boundary tests | β₯3 per module | Build (quality gate) |
| 3.4 | Error path tests | Per raising function, specific type | Build (quality gate) |
| 3.5 | Test independence | Random order Γ3 | Assessment only |
| 3.6 | Mutation testing | β₯60% kill rate (sampled) | Assessment only β Candidate |
| 3.7 | Exception hierarchy | Base class, subclasses, tested, consistent | Plan template |
| 3.8 | Boundary validation | Typed models at entry, not in domain | Plan template |
| 3.9 | Failure-path cleanup | Tested per resource type | Build (quality gate) |
Checks β Layer 4: Implementation (8)
Does the code compile, execute, and conform to technical standards?
| # | Check | Threshold | Enforced by |
|---|---|---|---|
| 4.1 | Tests pass | 100%, zero errors, zero unexplained skips | Build (green) |
| 4.2 | Lint (ruff) | Zero warnings, config present | Build (green) |
| 4.3 | Types (mypy strict) | Zero errors, strict mode | Build (green) |
| 4.4 | No debug code | Zero breakpoint/pdb/print/TODO/FIXME | Build (hygiene gate) |
| 4.5 | No secrets/paths | Zero credential literals, zero absolute paths | Build (hygiene gate) |
| 4.6 | Deps locked | Lock file exists, all pinned | Setup |
| 4.7 | Resource cleanup | All resources in context managers | Build (conventions) |
| 4.8 | Structured logging | Configured, used for errors + operations | Setup + Validate |
Report Format & Determination
The assessment produces a structured markdown report saved to docs/production-readiness-report.md with per-check pass/fail verdicts, cited evidence, and an overall determination.
All layers pass. Every check has a PASS verdict with cited evidence.
Implementation and Design pass. Architecture or User has non-critical failures with documented remediation.
Implementation or Design layer has failures. These are release blockers regardless of higher-layer results.
Expected Assessment Outcomes
CONDITIONALLY READY. L4 Implementation + L3 Design pass. L2 Architecture gaps at 2.4 (integration tests), 2.6 (error propagation), 2.7 (degradation). L1 User POC-dependent.
PRODUCTION READY against the project's scope profile. Pre-clinical baseline: 33 checks pass. Other profiles add domain-specific checks (traceability, audit, numerical regression).
Assessment itself would need expanding per scope profile: security scanning, performance baselines, mutation thresholds. See Scope Profiles tab for per-profile assessment extensions.
Assessment Limitations
| Limitation | Impact |
|---|---|
| Assessor variance | Different independent sessions may judge the same code differently β one may pass observability "with gaps", another may fail it. This is inherent to LLM-based assessment. Multiple runs surface different interpretations of conformance criteria. Fix borderline checks rather than relying on lenient assessors. |
| Single-session execution | Complex repos may exceed context limits, requiring sampling rather than exhaustive verification. |
| Mutation testing tool compatibility | Mutmut v3 dropped --paths-to-mutate (per-module scoping) and can segfault with numpy/PIL native extensions. The spec's per-module instruction may not be executable. Full-codebase runs include logging/config mutants that drag kill rates down. Assessors may need to use prior mutation evidence from git history or install mutmut v2. |
| Mutation testing is sampled | Only the highest-test-count module is intended to be tested. Full-codebase runs (forced by mutmut v3) include modules with no dedicated tests, depressing the kill rate. |
| Assessor tool dependencies | pytest-randomly and mutmut are assessor-installed tools, not project dependencies. The assessor must install them separately. Projects using uv require uv pip install rather than pip install. |
| Integration check is presence-based | Verifies integration tests exist, not that they are comprehensive against interface contracts. |
| User-level checks are lightweight | Appropriate for pre-clinical baseline. Scope profiles for regulated, clinical, or safety-critical contexts add domain-specific checks. See Scope Profiles tab. |
| No security audit | Path traversal may be unit-tested, but no independent OWASP review or dependency vulnerability scan. |
| Workflow exercise is schema-dependent | Core E2E test (1.3) requires discovering the API schema. Non-standard or undocumented APIs may be incompletely tested. |
Runbook
Lifecycle: init β setup β requirements β plan β skeleton β build (per module: red β green β commit) β integrate β validate β assess β done
Your review points: requirements (response schema), skeleton (interface contracts), plan (architecture), red phase (test quality), integrate (approve integration tests), two domain questions. Everything else is automated.
Prompts below are copy-paste into Claude Code unless marked TERMINAL.
Dual-tool pattern: Claude Code does the work. Claude app (this Project) acts as an independent advisor at key checkpoints β marked with CLAUDE APP. This prevents Claude Code from grading its own homework at the moments where judgement matters most. Ad-hoc prompts for deviation detection, recovery, and session continuity are in the reference section at the bottom.
Claude Code must be configured correctly before starting. The wrong model or thinking level produces shortcuts, weak tests, and protocol bypasses.
Model: Use the latest Opus (the most capable model). In Claude Code, type /model to open the model selector and choose Opus.
Thinking level: Set to high. The skill files are complex and the agent needs extended reasoning to follow them precisely β especially during red/green cycles where it is tempted to write tests and implementation together. You can check and change thinking level with Shift+Tab to cycle through levels, or /model to see the full selector.
Why Opus on high? The documented bypass incident (Phase B, first attempt) happened because the agent took shortcuts β it wrote all tasks in under 4 minutes, tests and implementation together, without invoking any slash commands. Higher-capability models with extended thinking reduce this risk. You can drop to medium for simple steps (commits, status checks), but keep it on high for all build, integrate, and validate stages.
Agent eagerness: The agent will often say "Shall I proceed?" or "Let me start building" and immediately begin working without waiting for your answer. This is the single most common protocol deviation. The prompt "Show me your analysis before you start making changes" is your primary defence. Always insist on seeing its plan before it acts.
Commit discipline: The agent frequently commits source files but forgets state file changes (conversion_state.yaml). After every stage transition, run git status and commit any uncommitted changes before proceeding. Build this habit from setup onward β it prevents state drift that confuses the agent after compaction.
Context compaction: Long sessions trigger compaction (summarisation). This is normal. The state machine is designed for it β the agent recovers from conversion_state.yaml and docs/2-plan.md. If the agent seems confused after compaction: Read conversion_state.yaml and docs/2-plan.md. What module, what task, what tdd_phase? Resume.
Coverage scoping: During the build stage, the green phase coverage check (β₯ 80%) may fail because stub files in other modules have 0% coverage. The agent should scope coverage to the current module only (e.g., --cov=confluency.analysis not --cov=src/). If the agent gets stuck on coverage, tell it: Scope coverage to the current module only. Stubs in other modules are not yet implemented.
Hook compatibility: Git hooks may fail if they call pytest, ruff, or mypy directly instead of via uv run. If you see "command not found" from hooks, update the hook scripts to prefix each command with uv run. This is a one-time fix during setup.
Some tests pass during red β that's sometimes OK: When the skeleton has already implemented a data class (e.g., ColorStats.__init__), tests that exercise the class structure will pass immediately. What matters is that tests exercising the function logic fail with NotImplementedError. Similarly, FastAPI routing tests (e.g., "POST to a GET-only endpoint returns 405") pass because FastAPI handles routing, not the stub. These are testing framework behaviour, not implementation.
0 β Initialise TERMINAL
Save a snapshot of your POC as it is today, then install the orchestrator alongside it. This creates a clean starting point so every change from here forward is tracked and reversible.
Take a snapshot of your POC code as it exists today. If anything goes wrong, you can always return to this point.
cd <YOUR_POC_PATH> git init git add -A git commit -m "POC baseline" git tag poc-baseline
Already a git repo? Skip git init. Dirty tree? Commit or stash first.
Install the orchestrator's rules, checklists, and enforcement tools into your project directory.
cp -r ~/Desktop/Desktop/Claude/test-orchestrator/.claude . cp -r ~/Desktop/Desktop/Claude/test-orchestrator/docs . cp -r ~/Desktop/Desktop/Claude/test-orchestrator/tasks . cp ~/Desktop/Desktop/Claude/test-orchestrator/CLAUDE.md . cp ~/Desktop/Desktop/Claude/test-orchestrator/conversion_state.yaml .
Personalise the orchestrator for your project and save the setup to version history.
sed -i '' 's/new-project/<YOUR_PROJECT_NAME>/' conversion_state.yaml git add -A git commit -m "Add V0.4 TDD orchestrator"
Confirm the orchestrator installed correctly and configure the AI for maximum reliability. The right model and thinking level prevents the most common failure mode: the agent skipping TDD discipline.
cd <YOUR_POC_PATH> claude
Once Claude Code opens: type /model and select Opus with high thinking. Then paste: /project:status
Expect: 7-stage lifecycle displayed, stage: setup, REQUIREMENTS PHASE and SKELETON PHASE sections visible in the dashboard. If slash command not recognised β check .claude/skills/ has 5 entries (status, red, green, integrate, validate).
1 β Setup
The AI reorganises your rough POC into a clean, professional project structure with automated quality checks. This is the foundation everything else builds on β if the scaffold is wrong, every subsequent step inherits the problem.
The AI analyses your POC, then rebuilds it into a proper project layout with type checking, linting, error handling, configuration management, and structured logging.
Read docs/1-setup.md. This is the setup checklist. Analyse the existing POC codebase β look at all source files, understand the structure, identify the main packages and entry points. Then work through every item in the checklist. For each item: 1. Do the work (create files, configure tools, etc.) 2. Check it off in docs/1-setup.md Key requirements: - Production layout must use src/<project_name>/ (not the POC's current layout) - pyproject.toml with uv, ruff, mypy strict mode - Git hooks: pre-commit (runs pytest), test guard (blocks test edits during green phase) - Import POC code into src/ layout (reorganise, don't just copy) - Structured logging: configure Python logging or structlog (log level, format, handler) - Centralised config: Pydantic BaseSettings class (all env-dependent values from env vars with sensible defaults) - Base exception hierarchy: AppException base class with at least ValidationError, NotFoundError subclasses - ruff + mypy must be clean before setup is complete Show me your analysis of the POC structure and your plan before you start making changes.
Verify the AI correctly understood your code before it starts restructuring. Mistakes in comprehension propagate into every later step.
Agent shows its analysis before proceeding. Check: correctly identifies functionality? Sensible src/ layout? Nothing lost from POC? Correct it before it starts if wrong.
β Watch for: The agent frequently says "Shall I proceed?" and immediately starts building without waiting. If it begins creating files before you have reviewed the analysis, the work may be based on a wrong understanding of your POC. You can always let it finish and then check β the setup stage is relatively low-risk β but establishing the pattern of "analysis first, approval second" here prevents bigger problems at the plan and build stages.
If the AI gets interrupted or loses context mid-setup, this brings it back on track.
/project:status
Then: Read docs/1-setup.md. In your response, write out every checkbox line verbatim β I need to see the exact state of each item. Then complete any remaining unchecked items.
Confirm every setup item is done and all automated quality checks pass before moving on. Nothing proceeds until this is clean.
I need to verify setup is complete. For each item below, write the results directly in your response β do not just run commands, I need to read the output in your reply: 1. Read docs/1-setup.md β write out every checkbox line verbatim so I can see which are checked 2. List the directory structure of src/ (3 levels deep) 3. Run: uv run pytest tests/ -v --tb=short β report the full test results including pass/fail counts 4. Run: uv run ruff check src/ β report the full output 5. Run: uv run mypy src/ β report the full output 6. Read conversion_state.yaml β write out the complete file contents Do not modify anything. Write all results in your response text.
All items [x]. Ruff zero. Mypy zero. Tests pass.
Save the clean project scaffold and advance to the requirements stage β not directly to plan. The requirements stage validates your response schema before you invest in architecture.
Commit all setup work with message "Setup stage complete" and update conversion_state.yaml stage to "requirements".
Tip: After the agent commits, check git status. The agent often commits source files but forgets the state file change. If conversion_state.yaml is uncommitted, tell it to commit the remaining changes. This happens at every stage transition β build the habit of checking now.
Recovery: setup errors
mypy errors: Fix all mypy strict errors. Every function needs full type annotations including return types.
ruff warnings: Fix all ruff warnings in src/. Then run uv run ruff check src/ and write the full output in your response so I can verify it's clean.
tests fail: These base tests are failing: [paste failures]. Fix them.
Hook errors ("command not found" for pytest/ruff/mypy): The git hooks call tools directly instead of via uv run. Fix each hook script in .claude/hooks/ or .git/hooks/: replace pytest with uv run pytest, ruff with uv run ruff, mypy with uv run mypy. One-time fix β the agent usually handles this automatically when it encounters the error, but verify it did so.
Unused POC dependencies: The agent may import all POC dependencies into pyproject.toml including ones that are no longer needed after restructuring (e.g., pandas when only numpy is used). Run uv run python -c "import <package>" for each suspicious dependency and remove unused ones. Clean dependencies now prevents confusion during the assessment stage.
1.5 β Requirements SYSTEM PROTOTYPE Β· BEFORE PLAN
The AI analyses your POC, identifies what it actually returns versus what callers need, and produces testable acceptance criteria. A minimal mock API is built and run to verify the response schema is correct before any architecture work begins. POC-level schema flaws β the most expensive to fix late β are caught here at zero cost.
The AI reads your POC, maps its primary workflows, identifies every response schema, and surfaces any mismatch between what the POC returns and what is actually useful to the caller.
Read docs/0-requirements.md β this is the requirements checklist. Analyse the POC codebase β look at all source files, identify: 1. The primary user workflow(s) β what does a user actually do with this application? 2. Every HTTP endpoint and its current response schema β exact field names, types, and what each field contains 3. Any mismatch between what the POC returns and what a caller would actually need (e.g. metadata fields where result data is expected) Then write docs/0-requirements.md containing: ACCEPTANCE CRITERIA β numbered (AC-001, AC-002, ...), one per primary user workflow. Each must be testable: "Given X input, the system returns Y" β not vague goals. AGREED RESPONSE SCHEMA β for each primary endpoint, the exact response schema you intend to implement. If the POC schema is wrong or incomplete, define the correct one here. Every field: name, type, and what it contains. After writing docs/0-requirements.md, build a minimal mock API: - A FastAPI app with hardcoded responses matching the agreed schema (placeholder values, no real logic) - Run it: uv run uvicorn mock_api:app --port 8001 - Confirm it starts without error Read docs/0-requirements.md and write the complete contents in your response. Then show the mock API startup output.
This is your only chance to catch a wrong schema before it propagates through every module. Once the plan is written against it, fixing it requires reworking module interfaces, service contracts, and API response models.
β Acceptance criteria are testable β each has a specific input and expected output, not a vague goal
β Response schemas contain actual results, not just metadata or confirmation messages
β Every field a caller would need is present (confidence scores? bounding boxes? raw metrics?)
β No fields that exist only because the POC happened to produce them
β Mock API starts and returns the agreed schema shape
If schema is wrong: The response schema needs changes: [describe what fields are missing or incorrect]. Update docs/0-requirements.md and rebuild the mock API. Read the revised file and write its complete contents in your response.
Lock in the agreed acceptance criteria and response schema before architecture work begins. The mock API is disposable β delete it after verification.
Check off the final checkbox in docs/0-requirements.md ("This file committed as the requirements artefact"). Delete mock_api.py if it still exists. Commit everything with message "Requirements stage complete β acceptance criteria and response schema defined". Update conversion_state.yaml stage to "plan".
What just happened: You now have a locked-in contract for what the API must return. Every module, test, and endpoint from this point forward is built against docs/0-requirements.md. If you later discover the schema was wrong, you will need to rework interfaces β that is why this step exists before the plan, not after. The mock API served its purpose (verifying the schema shape) and is deleted.
2 β Plan ALL MODULES BEFORE BUILDING ANY
The AI designs the blueprint: breaking the project into modules with clear boundaries, deciding how they connect, and defining what each piece must do. Architecture mistakes here are the most expensive to fix later.
The AI decomposes your project into modules with documented boundaries, connection points, error handling strategy, and an ordered task list for each. All modules are planned together because their interfaces depend on each other.
Analyse the POC code that was imported during setup. Based on the code's structure and responsibilities, decompose it into modules. For a web application, typical modules are:
- A domain/processing module (core logic, algorithms, data transforms)
- A service module (orchestrates domain logic, manages resources, file I/O)
- An API module (HTTP routes, request/response schemas, dependency injection)
- An app module (application factory, middleware, exception handlers, startup)
You may need more or fewer modules depending on the project's complexity.
For EACH module, write a plan in docs/2-plan.md containing:
1. MODULE DESCRIPTION β what it does, what POC code maps to it
2. FILE STRUCTURE β source files to create under src/
3. TASK BREAKDOWN β ordered list of implementation tasks (each becomes a red/green cycle)
4. DESIGN DECISIONS (all mandatory):
a. Exception hierarchy β what domain exceptions this module raises, semantic meaning, HTTP mapping if applicable
b. Input validation strategy β where validation happens (API boundary, not domain logic)
c. Dependency direction β what this module imports, what imports it. Must be a DAG.
d. Configuration management β what config values this module needs, sourced from settings class
5. INTERFACE CONTRACTS (per module boundary):
- Exported function signatures (name, parameter types, return type)
- Exception types that may cross the boundary
- Data types exchanged (Pydantic models, dataclasses, primitives)
MANDATORY TASKS to include in the app module:
- Health endpoint: GET /health β {"status": "ok"} (HTTP 200)
- README.md: description, installation, usage, API reference
Order modules by dependency: lowest-level (no internal dependencies) first, highest-level (app factory) last.
Read docs/2-plan.md. Write out the complete file contents in your response β every module, every task, every interface contract, every design decision. I need to review the full plan. Do not summarise or omit anything.
You verify the architecture makes sense for your domain. Everything that follows is built on these decisions β a wrong boundary or missing error case here becomes a structural problem later.
Read the entire plan. Check:
β Module boundaries make sense
β Dependency direction is a DAG (no circular, lower doesn't import upper)
β Exception hierarchy covers real error cases (not generic Exception)
β Input validation at API boundaries, not inside domain logic
β Interface contracts documented per module boundary
β Health endpoint + README in app module tasks
β Task ordering sensible (foundational first)
β Reasonable number of tasks per module
If wrong: The plan needs changes: [describe]. Update docs/2-plan.md, then read the file and write out the complete revised contents in your response.
Get a second opinion on the architecture before committing to it. Claude app has the full spec context and can spot structural issues you might miss β circular dependencies, incomplete contracts, exception gaps.
Paste into this Claude Project conversation:
Claude Code just produced this module plan for my POC-to-production conversion. Review it independently against the V0.4 orchestrator spec. Be blunt. Check specifically: 1. Are module boundaries sensible for this domain? Any that should be split or merged? 2. Is the dependency direction a clean DAG? Any circular or upward risks? 3. Is the exception hierarchy complete β are there real error cases this domain needs that are missing? 4. Are interface contracts specific enough (types, return values, exceptions) or vague? 5. Is input validation at the API boundary, not buried in domain logic? 6. Any tasks that are too large (should be split) or too small (overhead)? 7. Anything that will cause problems at the integrate stage? Here is the plan: [PASTE CLAUDE CODE'S FULL PLAN OUTPUT]
If issues found β feed the feedback back to Claude Code: The plan needs changes: [paste Claude app's feedback]. Update docs/2-plan.md, then read the file and write out the complete revised contents in your response.
Lock in the approved blueprint and move to skeleton generation. The modules list in the state file must also be populated β this is what the status skill uses for progress tracking and gating.
Commit docs/2-plan.md with message "Plan stage complete β [N] modules, [M] total tasks". Then update conversion_state.yaml: set stage to "skeleton" and populate the modules list as specified in the plan (each module with status: pending and its task list). Commit the state change with message "Advance to skeleton stage".
2.5 β Skeleton ARCHITECTURE PROTOTYPE Β· BEFORE BUILD
The AI converts every interface contract in the plan into executable Python stubs β real type-annotated signatures, docstrings, and raise NotImplementedError bodies. Type errors in stubs are interface contract defects caught before any implementation is committed. The skeleton must pass mypy --strict, ruff, and circular import checks before build begins.
Important: If setup already created real implementations (which it typically does β importing and reorganising POC logic), the skeleton replaces those implementations with NotImplementedError stubs. This feels counterproductive but is intentional: the build stage will re-implement everything through proper red/green TDD cycles, replacing each stub with tested code. The skeleton validates the architecture contract (types, imports, exports) before any implementation is committed to it. Existing setup-stage tests should still pass because they test base patterns (config, exceptions, models), not the domain logic being stubbed out.
The AI reads the plan's interface contracts and generates all source file stubs. No real logic β only signatures, types, and NotImplementedError bodies.
Read docs/2-plan.md. You have a complete module plan with interface contracts for all modules. Generate the skeleton: for EACH module, create all source files under src/ with: - Type-annotated signatures for every exported function and class (matching the interface contracts exactly) - Docstrings on every exported symbol - __all__ declarations in every __init__.py listing every exported symbol - raise NotImplementedError bodies (no real logic) - Correct imports β each module imports only from modules it is permitted to depend on per the declared DAG After generating all stubs, run these three checks and write the results in your response: 1. uv run mypy src/ β must report zero errors 2. uv run ruff check src/ β must report zero warnings 3. For each module package, verify it imports without circular errors: python -c "import <project>.<module>" for each module Do not proceed until all three pass. If mypy reports type errors, fix the stubs β those errors are interface contract defects. Show the full output of each check.
The stubs are the interface contracts made executable. A type error here is a contract defect β fix it now, before any implementation is written against it.
β mypy strict: zero errors
β ruff: zero warnings
β All modules import without circular dependency errors
β Function signatures match the interface contracts in the plan (parameter names, types, return types)
β Dependency direction correct β no module imports from a module above it in the DAG
β Exception types used in signatures exist in the exception hierarchy
β Response types match the agreed schema from the requirements stage
If contracts are wrong: The skeleton has interface issues: [describe]. Fix the stubs β do not write any real implementation. Re-run mypy, ruff, and circular import checks, and write the full output in your response.
Verify existing tests still pass with the stubs in place, then lock in the architecture before implementation begins.
Before committing, verify existing tests still pass with the skeleton stubs: uv run pytest tests/ -v --tb=short If they pass, commit with message "Skeleton stage complete β architecture prototype, N stub functions across M modules". Then update conversion_state.yaml: set skeleton_phase to "complete" and stage to "build". Commit the state change with message "Advance to build stage".
What just happened: You now have a complete, type-checked, importable architecture β but with no real logic. Every function raises NotImplementedError. The build stage will replace each stub with tested implementation, one task at a time, through locked red/green cycles. The stubs are guardrails: if an implementation later deviates from the declared interface, mypy will catch it.
Recovery: skeleton errors
mypy errors: Type errors in stubs are interface contract defects β fix the stub signatures. Mypy reports [N] errors in the skeleton. Fix the stubs β do not write any implementation logic. Show me mypy output after fixes.
Circular import: A module imports from a module it should not depend on. Circular import detected between [module A] and [module B]. Fix the import structure to match the DAG in the plan. Show me the corrected imports.
Signature mismatch: A stub's signature doesn't match the plan. The stub for [function] has signature [X] but the plan says [Y]. Update the stub to match the plan. Do not change the plan.
3 β Build PER MODULE, IN DEPENDENCY ORDER
Each task is built in two locked phases: first write tests that define what the code should do (red), then write the code to pass those tests (green). The AI cannot cheat β test files are physically locked during coding and cryptographically verified afterward.
Point the AI at the first module to build (lowest-level, fewest dependencies first). Set both the module and its first task in a single prompt, then immediately trigger the red phase. This keeps the agent focused on one specific deliverable.
Set current_module to "<first_module_name>" and current_task to "<first_task_name>" in conversion_state.yaml. Then run /project:red
Recommended commit pattern: After each green phase, give a specific commit message: Commit with message "analysis-1: core stats calculation β 13 tests, 100% module coverage". Include the task name, test count, and what was done. Then advance: Set current_task to "<next_task_name>" and run /project:red. Combining commit + advance + red in one prompt keeps the rhythm tight and prevents the agent from drifting between steps.
The AI writes tests that define what the code should do β before any code exists. These tests must fail because the code hasn't been written yet. This is the specification: it defines "done" before work begins.
/project:red
Weak tests produce weak code. The AI only writes enough code to pass its tests, so if the tests miss edge cases, the code will too.
β Tests specify behaviour, not implementation details
β Boundary condition tests: zero, empty, limits, off-by-one (β₯ 2 per task)
β Error path tests assert specific exception types (not bare Exception)
β Failure-path cleanup tests if task manages resources
β Descriptive names (test_empty_input_returns_error, not test_1)
If weak: The tests need improvement: [specific feedback]. Rewrite the tests and re-run /project:red.
The AI wrote both the plan and the tests β it may share blind spots with itself. A second opinion catches edge cases and weak assertions that look reasonable at first glance.
Paste into this Claude Project conversation:
Claude Code just completed /project:red for module [MODULE], task [TASK]. Review these tests independently. Be blunt β weak tests here mean weak code later. Check specifically: 1. Do tests specify behaviour or just mirror likely implementation? 2. Are boundary conditions meaningful for this domain, or trivial/obvious? 3. Do error path tests assert specific exception types from the hierarchy? 4. What edge cases should be tested but aren't? What could go wrong that these tests wouldn't catch? 5. Are any tests testing implementation details that would break on a valid refactor? Here are the tests: [PASTE THE TEST CODE FROM CLAUDE CODE'S OUTPUT]
If issues found β feed back to Claude Code: The tests need improvement: [paste Claude app's feedback]. Rewrite the tests and re-run /project:red.
The AI writes the minimum code to pass the locked tests. Test files are locked at the OS level and cryptographically hash-verified afterward β the AI cannot silently weaken its own tests to make implementation easier.
/project:green
Locks tests β implements β pytest + ruff + mypy + coverage β₯ 80% + hygiene β unlocks β SHA-256 audit.
Shortcut: /project:green --auto-advance auto-commits + runs red for next task.
Tip β coverage scoping: Overall coverage may fall below 80% because stub files in unbuilt modules have 0% coverage. The agent should scope coverage to the current module: --cov=<project>.<module> not --cov=src/. If the agent gets stuck retrying the coverage check, tell it: Scope coverage to the current module only. Stubs in other modules are not yet implemented.
Tip β testing routes before the app factory exists: When building route modules before the app factory is implemented, the red-phase tests create a minimal test-local FastAPI app that mounts just the router. This is the correct pattern β it tests the route handler in isolation. The app factory (exception handlers, middleware) is tested separately in its own module's tasks.
Save this unit of work as one clean, traceable change in the project history.
Commit this task with a descriptive message.
β Repeat 3.1β3.3 for each task in the module.
Mark this module as done and advance to the next one in a single prompt. Combining mark-complete + set-next-module + run-red keeps the workflow tight.
Commit with message "<task-name>: <description> β N tests". Mark <current_module> as complete in conversion_state.yaml. Set current_module to "<next_module>" and current_task to "<first_task>". Run /project:red
Health check before starting the next module β verify tests are growing, coverage is holding, and nothing is falling behind.
/project:status
Previous module complete. Tests growing. Coverage β₯ 80%. β Start next module from 3.0.
Recovery: build phase errors
pytest fails: Tests [list] are still failing. The error is [paste]. Fix the implementation β do not modify any test files.
ruff fails: Ruff has warnings. Run ruff check --fix src/ and then re-verify. (Common: EN DASH characters in docstrings copied from the plan β replace with hyphens.)
mypy fails: Mypy reports [N] errors. Fix all type errors. Every function needs full type annotations. Then run uv run mypy src/ and write the full output in your response so I can verify it's clean.
Coverage < 80%: First check if the agent is measuring coverage across the whole src/ (which includes stubs at 0%). Tell it: Scope coverage to the current module only: --cov=<project>.<module_name>. If coverage is genuinely low for the current module, add tests for uncovered paths.
Hygiene fails: Code hygiene check found: [paste]. Remove all bare print() (use logger), remove TODO/FIXME, remove debug code. Re-run /project:green.
SHA-256 audit fails: SHA-256 audit failed β test files were modified during green phase. Reset: set tdd_phase to null and test_lock to false in conversion_state.yaml. Re-run /project:red to re-hash.
Tests don't fail (pass during red): Some tests may legitimately pass β e.g., data class structure tests pass against the skeleton, or FastAPI routing tests (405 for wrong method) pass via framework behaviour. The key question: do the tests exercising function logic fail with NotImplementedError? If yes, proceed. If all tests pass, the task scope may overlap with existing work.
Existing tests break: Existing tests are now failing. Fix the new test code without changing existing tests.
Agent skips red/green: STOP. You must follow the red/green cycle. Reset: set tdd_phase to null in conversion_state.yaml. Now run /project:red for the current task.
Documentation-only tasks (e.g., README): The plan may mark a task as "documentation only". These do not go through the red/green TDD cycle β there is no meaningful test for README content. Write the document directly, commit, and advance. The validate stage will verify it exists with required sections.
Context compaction: Read conversion_state.yaml and docs/2-plan.md. You are on module "<name>", task "<name>". The tdd_phase is "<phase>". Resume from this point.
Crash mid-green (files locked): Terminal: chmod -R u+w tests/ then Claude Code: Set test_lock to false and tdd_phase to null in conversion_state.yaml. Re-run /project:green.
4 β Integrate ONCE, AFTER ALL MODULES COMPLETE
Individual modules were tested in isolation. This phase tests how they work together using real dependencies β no fakes or simulations. This is where interface mismatches and wiring bugs surface.
Confirm every module is built and marked complete before testing their connections.
/project:status
Must show all modules complete.
The AI reads the entire codebase and maps how all modules connect β data flow, error paths, and interface contracts.
/project:integrate
Step 1: Automated codebase analysis. No action from you.
You provide domain knowledge the AI cannot infer from code alone β which failure modes actually matter in your field, and how precise the testing needs to be.
Question A β Cross-stage failure modes. Agent proposes 4β5 options. Select the real risks in your domain. If unsure, select all plausible.
Question B β Correctness criteria. Choose tolerance: scientific/medical β moderate Β±5% or strict Β±1% Β· data processing β coarse Β±15% or moderate Β±5% Β· CRUD/web β structural correctness
Verify the tests use real components, not mocked substitutes. The whole point of this phase is testing genuine connections between modules.
Before I approve, I need to review the integration tests. Write all of this in your response β do not just run commands: 1. Read every test file in tests/integration/ and write out the complete contents of each 2. For each test function: state which module boundary it exercises (one line each) 3. Run: grep -rn "mock\|patch\|MagicMock\|override\|Mock\|unittest.mock" tests/integration/ β report the results (should be zero matches)
β Zero mock/patch/override references β reject immediately if any found
β β₯ 1 test per module boundary
β Real or programmatically generated test inputs
β β₯ 1 error propagation test (lower module error β HTTP status at API)
β Batch/degradation test if applicable
β All tagged @pytest.mark.integration
If mocks found: Integration tests must use REAL dependencies, no mocks. Remove all mock/patch/override/MagicMock usage. Rewrite the tests, then read each revised test file and write out the complete contents in your response.
Run the approved integration tests, then automatically verify the project's dependency structure and module interface contracts.
I approve the integration tests. Proceed with execution, DAG verification, and interface contract checks.
Save the integration evidence β tests, dependency verification, and contract checks.
Commit with message "Integrate stage complete β integration tests, DAG verified, contracts verified"
Recovery: integrate failures
Integration tests fail: This is the stage working as designed β it found real cross-module issues. Common: interface mismatch, missing config value, exception type not caught at boundary. Agent diagnoses; may need source fixes via red/green cycle, then re-run /project:integrate.
DAG fails: Circular or upward imports. Fix the import structure.
Interface contracts fail: Missing __all__, missing docstrings, broken imports. Fix and re-run.
5 β Validate
Test the finished application as a real user would. Does it start? Does it respond to requests? Does it handle bad input gracefully? Is it documented and observable when something goes wrong?
Seven checks from the end user's perspective: can it start, respond, complete a real workflow, handle mistakes, explain itself, be diagnosed when things go wrong, and reproduce the POC's outputs?
/project:validate
Runs 7 checks in order:
| Startup | App factory runs, returns ASGI instance, <10s |
| Health | /health β 200 + JSON |
| E2E | Primary workflow via TestClient, no mocks, in tests/e2e/ |
| Errors | Invalid inputs β structured JSON 4xx, no stack traces |
| Docs | README sections + /docs returns 200 |
| Observability | Startup + error logging, correlation mechanism |
| POC Parity | Production outputs match POC outputs on known inputs (acceptance criteria from docs/0-requirements.md) |
= hard gate (blocks if fail) Β· = soft (gaps recorded)
Why POC parity? The production system must do at least what the POC did. If the POC processed an image and returned brightness values, the production system should return equivalent values for the same input. This catches subtle regressions introduced during restructuring β different image mode handling, rounding changes, dropped fields.
Read the verdict on whether the application meets production standards.
Read docs/validate-report.md. Write out the complete file contents in your response β every check, every verdict, the final determination. Do not summarise.
PRODUCTION READY all pass Β· CONDITIONALLY READY hard gates pass, soft gaps Β· NOT READY hard gate fails β fix and re-run
Save the validation evidence and determination.
Commit with message "Validate stage complete β [RESULT]"
Recovery: validate failures
Startup fails: Startup failed with: [paste error]. This is usually a wiring issue that only surfaces when the full app assembles. Diagnose the root cause and fix it. Then re-run /project:validate.
E2E fails: The E2E test failed. Read every test file in tests/e2e/ and write out the complete contents in your response. Then write out the actual route handler signature it's trying to hit and the error message. The request format probably doesn't match the route's expected input.
Stack traces in errors: Error response check found stack traces in: [paste]. The exception handler chain is incomplete. Fix the exception handling and re-run /project:validate.
POC parity fails: The production system returns different values than the POC for the same input. Common causes: different image mode handling (RGB vs RGBA vs palette), different rounding, changed field names. Compare the POC's computation logic to the production implementation and fix the discrepancy. POC parity check shows a mismatch: [describe difference]. Compare the POC logic in app/main.py to the production implementation and fix the discrepancy. The production system must reproduce the POC's behaviour for known inputs.
6 β Production Readiness Assessment OPTIONAL Β· SEPARATE SESSION
A completely independent review. A fresh AI session β with no memory of the build process β examines the finished codebase against 33 formal quality criteria across four VP-model layers. This is the final quality gate: an auditor that evaluates only what exists in the repository.
The first assessment run rarely returns PRODUCTION READY. This is the system working correctly β the independent assessor catches gaps that the orchestrator's own checks miss (observability depth, mutation testing coverage, error message assertion tightness). The typical pattern is 2β3 runs: assess β fix targeted gaps in the build session β reassess in a fresh session. Each loop takes 15β20 minutes. Do not be discouraged by NOT READY or CONDITIONALLY READY on the first pass.
Assessor variance: Different assessor sessions may judge the same code differently β one may pass observability "with gaps", another may fail it. This is inherent to LLM-based assessment and is actually a strength: multiple runs surface different interpretations of the conformance criteria. If a check is borderline, fix it rather than relying on a lenient assessor.
Start a clean AI session so the assessment has no bias or memory from the build process.
cd <YOUR_POC_PATH> claude
Must be a separate session β no orchestrator context, no build history. Set model to Opus / high.
The independent AI runs 33 formal checks across four VP-model layers: user experience, architecture, test quality, and code quality. It evaluates only what exists in the repository.
Read the production readiness assessment spec at ~/Desktop/Desktop/Claude/test-orchestrator/production-readiness-assessment-spec.md This project uses uv for dependency management. Use "uv run" to execute all commands (e.g. "uv run pytest", "uv run ruff check", "uv run mypy"). Install assessor tools with "uv pip install" (e.g. "uv pip install pytest-randomly mutmut"). Execute every check in order, from User through Implementation level. Collect all evidence. Do not skip any check. Produce the final report in the format specified at the end of the document. Save the report to docs/production-readiness-report.md.
33 checks across 4 VP-model layers. Expect: User PASS or CONDITIONAL, Architecture PASS, Design PASS (3.5/3.6 may surface gaps), Implementation PASS.
Fix the specific gaps identified in the report, then re-run the assessment in a fresh session. Each fix-and-reassess loop is typically small and targeted.
Resume the build session (not the assessment session):
The independent production readiness assessment found these gaps: [PASTE THE FAIL VERDICTS AND REMEDIATION STEPS FROM THE REPORT] Fix each gap. Run all tests to verify nothing breaks. Commit with message "Fix [gap description]".
Then close the build session, open a new assessment session, and re-run step 6.2. Repeat until PRODUCTION READY.
Common gaps on first run: Mutation kill rate below 60% (add targeted tests for surviving mutants), observability depth (add logger.error() in exception handlers, add startup log event), error message assertion tightness (use match= parameter in pytest.raises).
Mark the project as complete in version history.
git tag poc-to-production-complete -m "Full conversion complete β all VP-model levels validated"
Recovery: assessment issues
Assessor can't find dependencies ("ModuleNotFoundError"): The assessor may try bare pytest instead of uv run pytest. If you see import errors, tell it: This project uses uv. Run all commands with "uv run" prefix (e.g. "uv run pytest tests/ -v"). Install additional tools with "uv pip install".
Mutmut v3 segfaults: Mutmut v3 can segfault when mutation-testing code that uses numpy/PIL native extensions. If this happens, the assessor should use prior mutation testing evidence from git history (commit messages document kill rates). Alternatively, install mutmut v2: uv pip install "mutmut<3".
Mutmut v3 dropped --paths-to-mutate: The assessment spec says to run mutation testing on the highest-test-count module only, but mutmut v3 removed per-module scoping. The assessor will run against the full codebase, which includes logging/config code that drags the kill rate down. If the kill rate is marginal (55β65%), tightening error message assertions and adding logging tests is the fastest fix.
pytest-randomly not installed: This is an assessor tool, not a project dependency. The assessor should install it: uv pip install pytest-randomly.
Assessor variance: Two independent sessions may produce different verdicts on the same code. This is inherent to LLM-based assessment. If a borderline check passes in one run but fails in another, fix the underlying gap rather than relying on the lenient interpretation.
CLAUDE APP Ad-hoc Prompts
Use these anytime during the workflow β not at fixed checkpoints but when something seems off, breaks, or needs a handoff.
When Claude Code's output looks suspicious β tests and code appearing together, steps being skipped, unexpected file changes. Faster than re-reading the spec yourself.
Claude Code just produced this output. I'm on module [MODULE], task [TASK], phase [red/green/etc]. Does this follow the TDD orchestrator protocol? Specifically: - Was the red/green sequence respected, or did it write tests and implementation together? - Were any steps skipped? - Is there anything here I should reject or push back on? If it deviated, give me the exact correction prompt to paste into Claude Code. [PASTE CLAUDE CODE'S OUTPUT]
When something breaks and you don't know the right fix. Claude app can diagnose the error and produce the exact prompt to paste into Claude Code.
Claude Code hit an error during the [PHASE] phase for module [MODULE], task [TASK]. Diagnose the issue and give me: 1. What went wrong and why 2. The exact prompt I should paste into Claude Code to fix it 3. Whether I need to reset any state (tdd_phase, test_lock) before the fix Here is the error output: [PASTE THE ERROR]
After Claude Code compaction, a crash, or starting a new day. Claude app remembers the project history and can draft the optimal re-entry prompt so Claude Code picks up exactly where it left off.
I need to resume the TDD orchestrator workflow in Claude Code. [Context compaction happened / I'm starting a new session / Claude Code crashed]. I was on module [MODULE], task [TASK], phase [PHASE]. [Optional: here's what happened in the last session β PASTE ANY RELEVANT CONTEXT] Draft the optimal continuation prompt I should paste into the new Claude Code session, including: - What to read first (state file, plan, lessons) - Where to resume - Any warnings about common issues at this point in the workflow
At the integrate stage, before you approve. Claude app can independently verify the integration tests use real dependencies and cover the right module boundaries.
Claude Code produced these integration tests at the /project:integrate stage. Review them independently. The critical constraint is: NO mocks, patches, MagicMock, or dependency overrides anywhere. Check specifically: 1. Are there ANY mock/patch/override references? (reject immediately if so) 2. Does every module boundary have at least one integration test? 3. Is there an error propagation test (error in lower module β correct HTTP status at API)? 4. Are test inputs real or programmatically generated (not empty stubs)? 5. Any module boundaries that are untested? Here are the integration tests: [PASTE INTEGRATION TEST CODE]
Commands
| Command | When |
|---|---|
/project:status | Anytime β health, progress, gating, next step |
/project:red | Start of each task β write failing test specs |
/project:green | After reviewing tests β implement + verify |
/project:green --auto-advance | Green + auto-commit + auto-red for next task |
/project:integrate | After all modules complete β integration tests, DAG, contracts |
/project:validate | After integrate β startup, health, E2E, errors, docs, observability |
Known Weaknesses
Same model writes tests + impl. Temporal separation via lock+hash, but shared blind spots. Multi-agent is a candidate but requires orchestrator redesign.
Quality counts are soft-enforced. Mutation testing would add hard evidence but is currently assessment-only.
Bypassed once. Structural checks via status skill mitigate. PreToolUse hook (candidate) would close entirely.
No fix mechanism post-completion. No rollback. Integration tests catch some issues. Formal patch command is a candidate.
Compaction = info loss. Lessons file mitigates. Not resolvable by orchestrator design.
Independent assessment sessions may judge the same code differently β one passes observability "with gaps", another fails it. Validated across 3 runs on the confluency app: NOT READY β CONDITIONALLY READY β PRODUCTION READY (after targeted fixes). Mitigated by fix-and-reassess loops and by fixing borderline checks rather than relying on lenient assessors.
Gap Resolution Matrix
| Gap | Status | Resolved by |
|---|---|---|
| Integration tests | CLOSED | Integrate stage |
| System/E2E tests | CLOSED | Validate stage |
| Error propagation | CLOSED | Integrate stage |
| User-level validation | CLOSED | Validate stage |
| Mutation in build | CANDIDATE | β |
| Hard test-first enforcement | CANDIDATE | PreToolUse hook |
| Multi-agent | CANDIDATE | β |
| Patch workflow | CANDIDATE | /project:patch |
| Security scanning | CANDIDATE | β |
| Performance testing | CANDIDATE | β |
| Requirements traceability | SCOPE PROFILE | Configured per project scope (see Scope Profiles tab) |
| Context window | Inherent LLM constraint | |
Defects Found (Phase B)
| Defect | Detected By | Code Review? |
|---|---|---|
mask[-0:] selects entire array | Boundary test (zero-value) | Unlikely |
np.uint8() returns scalar | mypy strict | Possibly |
Pydantic v1 __construct__ | Collection validation | If reviewer knows v2 |
Open Questions
1. Would you accept this git history? Atomic commits, passing tests, clean statics, strict types.
2. Is design-level testing sufficient, or do you require architecture validation + mutation?
3. Is enforced test-first sufficient, or do you require an additional human review gate?
4. Does border_percent=0 detection justify the overhead?
Scope Profiles β Adapting the Workflow
The orchestrator's current configuration targets pre-clinical commercial software: production-quality code without regulatory certification. Every other scope profile is described as a delta from this baseline β what to add, what to tighten, what new stages are required.
Each profile gives step-by-step instructions for modifying the orchestrator files. You are creating a new V0.4 workflow instance for each scope β not modifying the baseline. Copy the orchestrator, apply the delta, and use the modified copy for that project.
These profiles describe what should change per scope. The orchestrator does not currently enforce scope-specific behaviour automatically β you are manually configuring it. Some profiles require stages (security scanning, formal UAT, traceability) that do not yet exist as orchestrator commands. Where this is the case, it is flagged as a manual step you must perform outside the orchestrator.
How Scope Profiles Map to VP-Model Layers
The orchestrator enforces the VP-model across four abstraction layers (see VP-Model tab). Scope profiles parameterise all four layers, not just the User level β though the User level changes most dramatically. The table below shows which layers each profile modifies and the nature of the change.
| VP-Model Layer | Internal | Pre-clinical | Scientific R&D | Clinical Trials | Regulated Medical | Safety-Critical |
|---|---|---|---|---|---|---|
| User | Startup + health only | E2E, error quality, docs, observability | + reference comparison, reproducibility docs | + data export round-trip, e-signature flow | + formal UAT, traceability matrix, SRS/SDD | + witnessed UAT, safety case, certification |
| Architecture | Optional integration | Integration tests, DAG, error propagation | + numerical pipeline regression, provenance | + audit trail verification, edit checks | + hazard-scenario tests, risk control verification | + formal proof, MC/DC, independent verification |
| Design | Relaxed thresholds | 80% coverage, β₯3 boundary, quality gates | + determinism tests, regression fixtures | + domain edit-check tests, soft-delete tests | 90β100% coverage, mutation (Class C), sign-off | 100% MC/DC, formal specification |
| Implementation | TODOs allowed if tracked | Zero hygiene issues, structured logging | + Python version pinned, seed management | + audit log emission, no hard-delete | + security scanning, reviewer in commit | Replace toolchain (MISRA, Polyspace) |
The User level is the primary configuration surface β it determines what "production ready" means for each project. But the lower layers must tighten in step: there is no value in formal UAT (User) if the coverage threshold is 60% (Design) or integration tests are skipped (Architecture). Each profile is internally consistent across all four layers.
The four VP-model layers and the seven orchestrator stages (setup β requirements β plan β skeleton β build β integrate β validate) are fixed structure. The requirements stage content (what acceptance criteria to define, what schema decisions to validate) varies per project scope. What changes per project is the content within each layer: what the plan template requires, what thresholds the build enforces, what the integrate stage tests for, and what "validate" means. The pre-clinical baseline is one configuration β demonstrated on the confluency app. Every new development project can define its own scope profile, parameterising the same workflow to meet its specific regulatory, scientific, or commercial requirements.
Profile 0: Pre-Clinical Commercial BASELINE
This is what exists today. All other profiles reference it. No changes needed β use the orchestrator as-is.
| Dimension | Baseline Setting |
|---|---|
| Coverage threshold | β₯ 80% overall, β₯ 60% per file |
| Mutation testing | Assessment-only, β₯ 60% kill rate (sampled) |
| Boundary tests | β₯ 3 per module |
| Integration tests | β₯ 1 per module boundary, no mocks |
| Security | Path traversal in unit tests. No independent audit. |
| Traceability | None. POC is implicit requirement. |
| Acceptance testing | Lightweight E2E via TestClient. No formal UAT. |
| Documentation | README + auto-generated API docs |
| Observability | Structured logging. Correlation mechanism optional. |
| Assessment checks | 33 checks, 4 layers. PRODUCTION READY target. |
Profile 1: Clinical / Regulated Medical
When to use: Software that will be submitted to or reviewed by a regulatory body (FDA, EMA, MHRA, PMDA). Includes medical device software (IEC 62304), clinical trial data systems (21 CFR Part 11), GxP laboratory software, and diagnostic tools intended for clinical decision-making.
VP-Model layers modified: User βββ β formal UAT, traceability, SRS/SDD Β· Architecture ββ β hazard-scenario integration tests Β· Design βββ β 90β100% coverage, mutation build-time Β· Implementation ββ β security scanning, audit log, reviewer sign-off
The orchestrator was not designed for regulated medical software. These modifications bring it closer, but do not constitute a validated software development lifecycle. You still need a quality management system (QMS), risk management per ISO 14971, and regulatory expertise. This profile adds engineering rigour β it does not replace regulatory process.
Applicable standards: IEC 62304 (medical device software lifecycle), IEC 62366 (usability engineering), ISO 14971 (risk management), 21 CFR Part 11 (electronic records), EU MDR 2017/745.
Step-by-step modifications
1. Plan template β add requirements traceability
In docs/2-plan.md, add a Requirements section before module decomposition. Each requirement gets a unique ID (e.g., REQ-001), acceptance criteria, risk classification (IEC 62304 Class A/B/C), and traceability forward to the integration/E2E test that validates it. The plan template should enforce: every requirement has at least one acceptance test ID. Every acceptance test ID traces back to at least one requirement.
2. Plan template β add risk classification per module
IEC 62304 classifies software by safety class (A = no injury, B = non-serious injury, C = death/serious injury). Each module in the plan must declare its class. Class C modules require: 100% statement coverage, mandatory mutation testing during build (not assessment-only), and formal code review sign-off.
3. Build stage β tighten thresholds
Modify .claude/skills/green/SKILL.md:
| Dimension | Pre-clinical | Clinical |
|---|---|---|
| Coverage (overall) | β₯ 80% | β₯ 90% (Class B/C modules: 100% statement) |
| Coverage (per file) | β₯ 60% | β₯ 80% |
| Mutation testing | Assessment-only | Build-time for Class C modules, β₯ 80% kill rate |
| Boundary tests | β₯ 3 per module | β₯ 5 per module, covering all identified risks |
| Error path tests | One per public function that raises | One per public function that raises + one per risk control |
4. Build stage β add human review gate
After each green phase, add a mandatory hold: the agent outputs its diff and waits for explicit human approval before committing. Modify the green skill to require the user to type APPROVED before the commit step. Record the reviewer identity and timestamp in the commit message.
5. Integrate stage β add system-level hazard tests
Modify .claude/skills/integrate/SKILL.md. In addition to the standard integration tests, require hazard-scenario tests derived from the ISO 14971 risk analysis. Each identified hazard must have a corresponding integration test that demonstrates the risk control is effective. These are separate from functional integration tests β they test safety, not features.
6. Validate stage β formalise acceptance testing
Modify .claude/skills/validate/SKILL.md. Replace lightweight E2E with formal UAT:
| Dimension | Pre-clinical | Clinical |
|---|---|---|
| Acceptance tests | Agent-authored E2E via TestClient | Derived from REQ-IDs, traceable, human-witnessed |
| Error responses | No stack traces, structured JSON | + no internal identifiers (DB IDs, file paths, UUIDs) |
| Observability | Correlation optional | Mandatory request ID, audit log for all state-changing operations |
| Documentation | README + /docs | + Software Requirements Specification, Software Design Description, Test Report |
7. New stage β add /project:audit (manual)
After validate, produce a traceability matrix: Requirement β Design Task β Unit Test(s) β Integration Test β Acceptance Test. This does not exist as an orchestrator command. Create it manually or write a script that parses docs/2-plan.md, test file names, and pytest markers to generate the mapping. Save to docs/traceability-matrix.md.
8. New stage β add security scanning (manual)
Run pip-audit for dependency vulnerabilities. Run bandit for static security analysis. Record results in docs/security-scan-report.md. Neither is orchestrated β run manually after validate.
9. Assessment β extend checks
Add to production-readiness-assessment-spec.md:
| New Check | Layer | Pass Condition |
|---|---|---|
| Requirements traceability | User | Every REQ-ID has β₯ 1 acceptance test. No orphan tests. |
| Risk control verification | Architecture | Every identified hazard has a corresponding integration test. |
| Audit log presence | Implementation | State-changing operations produce audit log entries. |
| Security scan | Implementation | Zero critical/high vulnerabilities in dependencies. |
| Reviewer sign-off | Design | Every commit message includes reviewer identity. |
Usability engineering (IEC 62366). Clinical evaluation. Post-market surveillance. Labelling. Manufacturing records. These are QMS-level concerns outside the scope of a code quality orchestrator. You need a regulatory affairs specialist and a QMS β this tool produces better-quality software artefacts for that QMS to reference.
Profile 2: Scientific R&D / Computational Tools
When to use: Research software that produces results cited in publications, grant deliverables, or internal scientific decisions. Includes image analysis pipelines, statistical analysis tools, simulation code, bioinformatics workflows, and computational notebooks converted to production tools. The priority is reproducibility and numerical correctness, not regulatory compliance.
VP-Model layers modified: User ββ β reference comparison, reproducibility docs, methodology Β· Architecture ββ β numerical pipeline regression, provenance tracking Β· Design ββ β determinism tests, regression fixtures with provenance Β· Implementation β β Python version pinned, seed management
Relevant frameworks: FAIR principles (Findable, Accessible, Interoperable, Reusable), Software Sustainability Institute guidelines, NIH/UKRI software sharing policies, journal reproducibility requirements.
Step-by-step modifications
1. Plan template β add reference dataset specification
In docs/2-plan.md, add a Validation Data section. For each module that produces numerical output, specify: a reference input (real or synthetic), expected output (with provenance β how was this "expected" value determined?), and acceptable tolerance. The tolerance must be justified (floating point accumulation, stochastic algorithm, approximation method) β not arbitrary.
2. Plan template β add reproducibility contract
Document: given identical input + identical configuration + identical dependencies (lock file), does the output match exactly or within tolerance? If stochastic, document the seed strategy. If hardware-dependent (GPU, SIMD), document known sources of non-determinism.
3. Build stage β add numerical regression tests
Modify .claude/skills/red/SKILL.md. For every module that produces a numerical result, the red phase must include at least one regression test: a known input β expected output pair that is checked on every build. Store reference values in tests/fixtures/ with provenance metadata (source, date, method). The green phase must not change reference values without human approval.
4. Build stage β add determinism tests
Modify the red skill. For each stochastic operation, require a test that: runs the operation twice with the same seed, asserts identical output. For non-stochastic operations: run twice, assert bitwise equality. This catches hidden state or non-deterministic dependencies.
5. Integrate stage β add end-to-end numerical validation
Modify .claude/skills/integrate/SKILL.md. Integration tests must include a full-pipeline regression: known input through the entire processing chain, output compared to a reference with documented tolerance. This catches accumulation of small numerical errors across module boundaries that are invisible at the unit test level.
6. Integrate stage β add data provenance tracking
If the application processes input data and produces output, the integration test must verify that the output includes or is accompanied by provenance metadata: what input was used, what version of the code processed it, what parameters were applied, and a timestamp. This is not a software quality concern β it is a scientific reproducibility concern.
7. Validate stage β add comparison against external reference
Modify .claude/skills/validate/SKILL.md. If an external reference implementation or published benchmark exists, the validate stage must include a comparison. Document: what the reference is, how the comparison was performed, what level of agreement was achieved, and any known reasons for disagreement. Save to docs/validation-against-reference.md.
8. Assessment β extend checks
| New Check | Layer | Pass Condition |
|---|---|---|
| Numerical regression tests | Design | Every numerical-output module has β₯ 1 regression test with documented reference. |
| Determinism verification | Design | Stochastic operations produce identical output with same seed. |
| Dependency reproducibility | Implementation | Lock file present AND Python version pinned. |
| Provenance metadata | Architecture | Output includes code version, input hash, parameters, timestamp. |
| External validation | User | Documented comparison against reference implementation or benchmark, if one exists. |
9. Documentation β extend README
Add sections to the README task: Methodology (what algorithms, what assumptions), Validation (how numerical accuracy was verified), Reproducing Results (exact commands to regenerate published outputs from raw input), Known Limitations (where the tool is known to be inaccurate or unreliable).
Formal verification of algorithms. Statistical validation of entire analysis pipelines (e.g., type I/II error rates). Peer review of the scientific methodology. Performance benchmarking under production data volumes. These require domain expertise and are outside a code quality orchestrator's scope.
Profile 3: Clinical Trials Data Systems
When to use: Software that captures, stores, processes, or reports clinical trial data. Includes electronic data capture (EDC) tools, randomisation systems, adverse event reporting, CDISC data transformation pipelines, and analysis tools that feed into regulatory submissions. The regulatory concern is data integrity β 21 CFR Part 11, EU Annex 11, ICH E6(R2) GCP.
VP-Model layers modified: User βββ β data export round-trip, e-signature flow, ALCOA+ mapping Β· Architecture ββ β audit trail verification, e-signature integration Β· Design ββ β domain edit-check tests, soft-delete verification Β· Implementation ββ β audit log emission, no hard-delete pattern
Key difference from Profile 1: Profile 1 (regulated medical) is about the software being a medical device. This profile is about the software handling clinical trial data β different regulatory requirements, different risk profile.
Step-by-step modifications
1. Plan template β add data integrity requirements
In docs/2-plan.md, add an ALCOA+ Compliance section. For every module that writes, modifies, or deletes data, document how the ALCOA+ principles are maintained: Attributable (who changed it), Legible (can be read/interpreted), Contemporaneous (recorded at the time), Original (preserved original or certified copy), Accurate (correct and complete). Plus: Complete, Consistent, Enduring, Available.
2. Build stage β add audit trail tests
Modify the red skill. For every state-changing operation (create, update, delete), require a test that verifies an audit trail entry is produced containing: user identity, timestamp, what changed, old value, new value. Deletion operations must verify soft-delete (record marked as deleted, not physically removed) unless hard-delete is explicitly justified.
3. Build stage β add data validation tests
For any module that accepts clinical data input: require edit-check tests. These verify that out-of-range values, logically inconsistent entries (e.g., death date before birth date), and format violations are caught at entry, not downstream. This is distinct from general input validation β it is domain-specific clinical data validation.
4. Integrate stage β add electronic signature verification
If the application supports electronic signatures (21 CFR Part 11 requirement): integration tests must verify the signature workflow end-to-end β authenticate, present data for review, capture signature, lock the record, verify the locked record cannot be modified without a new signature event.
5. Validate stage β add data export verification
If the application exports data (CDISC, CSV, SAS transport): the validate stage must include a round-trip test β export data, re-import it, verify equality. For CDISC formats: validate against the published CDISC validation rules (OpenCDISC/Pinnacle 21).
6. Assessment β extend checks
| New Check | Layer | Pass Condition |
|---|---|---|
| Audit trail completeness | Architecture | Every state-changing operation produces an audit record. |
| Soft delete verification | Design | Delete operations preserve original record. |
| Edit checks | Design | Domain-specific validation rules tested at entry point. |
| Electronic signature flow | User | Sign β lock β modify-attempt-rejected workflow tested E2E. |
| Export round-trip | User | Export + re-import produces identical data set. |
System validation protocols (IQ/OQ/PQ). Computerised system validation master plans. Role-based access control design. Data archival and retention policies. 21 CFR Part 11 compliance assessment. These are IT infrastructure and GCP concerns beyond code quality.
Profile 4: Safety-Critical Systems
When to use: Software where failure can cause physical harm, environmental damage, or loss of life. Includes automotive control systems (ISO 26262), industrial automation (IEC 61508), aerospace (DO-178C), and nuclear (IEC 60880). These domains have formal certification requirements that are fundamentally different from medical device or clinical data regulations.
VP-Model layers modified: User βββ β witnessed UAT, safety case, certification body submission Β· Architecture βββ β formal proof, independent verification team Β· Design βββ β 100% MC/DC, formal specification, independent test author Β· Implementation βββ β replace toolchain (MISRA, Polyspace, AstrΓ©e)
This orchestrator was designed for Python web applications. Safety-critical systems are typically written in C, C++, Ada, or Rust with specialised compilers, static analysers (Polyspace, LDRA, AstrΓ©e), and formal methods tools. The orchestrator's enforcement mechanisms (ruff, mypy, pytest) are Python-specific. Applying this profile requires replacing the toolchain, not just adjusting thresholds. This profile describes what to add β implementing it requires significant engineering beyond configuration changes.
Conceptual modifications (not directly configurable)
1. Replace static analysis toolchain. Ruff and mypy are insufficient for safety-critical code. Replace with: MISRA C/C++ checker (for C/C++), Polyspace or AstrΓ©e (for formal absence of runtime errors), or equivalent for the target language. The green skill would need to invoke these instead of ruff/mypy.
2. Coverage: MC/DC required. Line coverage and branch coverage are insufficient. Modified Condition/Decision Coverage (MC/DC) is required by DO-178C Level A and ISO 26262 ASIL D. This requires specialised coverage tools (e.g., VectorCAST, LDRA TBrun). The 80% line coverage threshold is replaced by 100% MC/DC for the highest safety integrity levels.
3. Formal requirements with bidirectional traceability. Every requirement traces forward to design, code, and test. Every test traces backward to requirement. Every line of code traces to a design element. Orphan code (code with no traceability) must be justified or removed. This is Profile 1's traceability, but enforced bidirectionally and at a finer granularity.
4. Independence between test author and implementer. The orchestrator's known weakness (same agent writes tests and code) becomes a certification blocker. Safety-critical standards require independence between verification and development. At minimum, a different human must review and approve all test cases. At higher integrity levels, a different team must author them.
5. Formal methods for critical modules. For the highest integrity levels (ASIL D, SIL 4, DAL A), formal proof of correctness may be required for critical algorithms. This is outside the orchestrator's capabilities entirely β it requires tools like SPARK/Ada, Frama-C, or TLA+.
6. Configuration management with baseline control. Every artefact (source, test, document, tool configuration) must be under configuration management with formal baselines. The orchestrator's git-based approach is a starting point, but needs formal baseline tagging, change control records, and impact analysis for each change.
The orchestrator provides value for safety-critical projects at the process discipline level: enforced test-first, hash-audited integrity, atomic commits. But it cannot replace the specialised tooling, formal methods, and independent verification required by IEC 61508 / ISO 26262 / DO-178C. Treat this as a development discipline supplement, not a certification pathway.
Profile 5: Internal Tooling / Early-Stage Prototypes
When to use: Internal tools, dashboards, data pipelines, or prototypes where the audience is your own team and the cost of failure is rework, not harm. You want code quality discipline without the overhead of full production hardening. The priority is speed with guardrails.
VP-Model layers modified: User β½ β startup + health only, skip E2E/error quality/observability Β· Architecture β½ β integration optional for small projects Β· Design β½ β 60% coverage, β₯1 boundary test, TODOs allowed Β· Implementation β β unchanged (hard constraints cost nothing)
Step-by-step modifications
1. Build stage β relax thresholds
| Dimension | Pre-clinical | Internal |
|---|---|---|
| Coverage (overall) | β₯ 80% | β₯ 60% |
| Coverage (per file) | β₯ 60% | No per-file minimum |
| Boundary tests | β₯ 3 per module | β₯ 1 per module |
| Error path tests | One per public function | One per module (most critical path only) |
| Code hygiene | Zero TODOs | TODOs allowed if tracked (issue reference) |
2. Integrate stage β optional
For small projects (1β2 modules), skip the integrate stage entirely. The unit tests provide sufficient coverage. For larger projects (3+ modules), keep integration tests but reduce to one test for the primary workflow path only β skip error propagation and graceful degradation checks.
3. Validate stage β simplify
Keep: application startup verification and health endpoint check. Drop: E2E workflow test (covered by integration), error response quality (accept framework defaults), observability checks (add logging when you need it). Documentation: README with installation and usage only β no API reference section required.
4. Assessment β reduce scope
Run only Layer 4 (Implementation) and Layer 3 (Design) checks. Skip Layer 2 (Architecture) and Layer 1 (User). Target: CONDITIONALLY READY is acceptable. The purpose is code quality discipline, not production certification.
5. Keep the hard constraints
Even for internal tools, keep: filesystem lock, SHA-256 hash audit, test-first cycle, ruff zero warnings, mypy strict. These cost nothing once configured and prevent the most common quality regressions. The overhead is in thresholds and documentation, not in the core enforcement mechanism.
Relaxed thresholds make it easier to accumulate technical debt that becomes expensive when the internal tool is later promoted to production or external use. If there is any chance the tool will be externally facing, use the pre-clinical baseline from the start. Retrofitting quality is significantly harder than building it in.
Comparison Matrix
| Dimension | Internal | Pre-clinical | Scientific R&D | Clinical Trials | Regulated Medical | Safety-Critical |
|---|---|---|---|---|---|---|
| Coverage threshold | 60% | 80% | 80% | 80% | 90β100% | 100% MC/DC |
| Mutation testing | Skip | Assessment | Assessment | Assessment | Build (Class C) | Build (all) |
| Integration tests | Optional | Required | Required + numerical | Required + audit | Required + hazard | Required + formal |
| Requirements traceability | None | None | Provenance only | Partial (data flows) | Full (bidirectional) | Full + MC/DC mapping |
| Human review gates | Tests only | Tests + plan | Tests + plan + references | All phases | All phases + sign-off | Independent team |
| Security scanning | None | None | Dependency audit | OWASP + dependency | OWASP + dependency | Formal analysis |
| Acceptance testing | Skip | Lightweight E2E | Reference comparison | Data round-trip | Formal UAT | Formal UAT + witness |
| Documentation | README | README + API | + methodology + validation | + ALCOA+ mapping | + SRS + SDD + test report | + safety case |
| Assessment target | CONDITIONAL | PROD READY | PROD READY + repro | PROD READY + audit | PROD READY + trace | Certification body |
| Orchestrator feasibility | β Full | β Full | β Full | β Mostly | β Partial | β Supplement only |
Creating a New Workflow Instance
For every new project, follow this sequence:
Copy the orchestrator into your project as described in the Usage Guide (Step 1). This gives you the pre-clinical baseline.
Choose the profile that matches your project's regulatory and quality context. If in doubt, use pre-clinical β you can tighten later, but relaxing after the fact loses the value of early discipline.
Follow the step-by-step modifications for your selected profile. Edit the skill files, plan template, and assessment spec as described. Each modification is a specific file edit β not a conceptual suggestion.
Add a docs/scope-profile.md file to the project recording: which profile was selected, which modifications were applied, any deviations from the profile (with justification), and any additional scope-specific requirements not covered by the profile.
Run the workflow as normal. The profile modifications will take effect through the modified skill files and plan template. The assessment at the end will use the extended checks if you modified the assessment spec.
A scope: field in conversion_state.yaml could automate profile selection β the status skill would enforce the correct thresholds and require the correct stages based on the declared scope. This is a candidate for a future version, not current functionality. Today, you are manually configuring the orchestrator for each scope.