The Series Concludes Here

Harness Engineering in Practice

A concrete reference implementation — greenfield and brownfield

v1.0 · June 2026 · avons.github.io

The Journey So Far

Over the past several months, we have published seven guides that build the theoretical foundation for agentic engineering: DDD, Event Storming, Flowcharts, Domain Context Engineering, BMAD, Attractor, and the Synthesis. Together, they define the four layers of agentic development, the dependency sequence between them, and the discipline that ties them together.

This article is the reference implementation of those theories — specifically Layer 3, the structured execution harness. It is not a new framework. It is a concrete way to apply what the earlier guides described. If the previous guides answered "why," this one answers "how."

The harness-eng repository is the working proof: a file-based convention you clone, adapt, and use. Tested with open agents like MiMo Code and OpenCode. From big picture to shipped feature.

The Rationale

The problem is not that AI agents are incompetent. They are remarkably capable. The problem is structural: an agent that tends toward agreement, working from input that may be incomplete, in a conversation with no external correction mechanism, will produce confident wrong answers at speed.

The AI Manifesto names this precisely: "The greatest risk of AI is not that it will think for us. It is that it will agree with us — and we will not notice."

A harness addresses this by building the correction mechanism structurally. Not by asking the AI to argue against you — the manifesto already tells us that is simulation, not challenge. By stopping the conversation entirely and putting the work in front of a human who is genuinely outside the agreement spiral.

Every human gate — design review, task approval, verification sign-off — is a structured moment where the human steps outside the spiral and makes a decision. The ceremony is not overhead. It is the external accountability that prevents the Dilbert swing.

The Principle: The agent is a fast, tireless collaborator who needs a human hand on the wheel — not because it is incompetent, but because it cannot see its own blind spots. The human stays responsible for quality. The harness makes that responsibility practical.

Scope: Long-Running, Multi-Phase Projects

The harness is designed for projects that span weeks or months — multi-phase efforts where context drifts, requirements evolve, and the cost of building the wrong thing compounds over time. It is a reference implementation. Adapt it to your use case. Keep it minimal and simple. The goal is not to follow every step perfectly. The goal is to ship the right thing.

Will It Scale to Multi-User Teams?

That depends on the team, not the harness. The harness provides structure — shared documents, consistent conventions, clear gates. But structure without discipline is just paperwork. Culture eats process for breakfast! A team that values quality will use the harness effectively. A team that doesn't will find ways around it. The harness makes the right thing easy. It cannot make the team care.

Design Philosophy — Who Is in the Driving Seat

The design philosophy behind the harness is simple: the human is in control — making decisions, not just approving them.

During define and design, the agent is a collaborator. It asks clarifying questions, identifies gaps, suggests story breakdowns, challenges assumptions. It is thinking alongside you — not generating for you. The active phase folder accumulates detailed context: specs, designs, research, interfaces. You can see everything the agent has considered.

Then comes the approve gate. The agent stops. The work is in front of you. The design document, the interfaces, the file layout, the verification criteria. You review it. You make changes. You approve or send it back. This is not a formality — it is the single most important moment in the feature lifecycle. The design you approve is the contract the agent builds against.

After approval, the agent shifts. During build and verify, it is a junior programmer. It follows the tasks in order. It writes tests first. It commits after each task. It does not improvise. It does not add features. It executes the approved design with discipline. If something goes wrong after three attempts, it stops and escalates.

This shift is deliberate. A collaborator during design produces better design. A disciplined executor during build produces reliable code. The same agent playing both roles — but at different times, with different instructions — gives you the best of both without the risk of either bleeding into the other.

The human's responsibility is therefore clear. You own the what and the why. The agent owns the how — within the boundaries you approved. If the output is wrong, the first question is not "did the agent fail?" but "did I review the design carefully enough?" The harness makes accountability practical, not theoretical.

This is what separates harness engineering from vibe coding. In vibe coding, the human is not in control. The agent generates, the human accepts or rejects the output after the fact. The decisions happened upstream, in the prompt, invisible and unaudited. In the harness, every decision is captured in a document. Every document is reviewed. Every review is recorded. The human is in the driving seat — not because the harness constrains the agent, but because the harness makes the human's role explicit and unavoidable.

How It Works — The Mechanics

A harness has five components. Together they form a state machine that controls the agent's how loop:

The constitution — permanent rules the agent reads every session. Technology decisions, naming conventions, security rules. This is the foundational law that everything else conforms to.

The domain context — vocabulary, boundaries, events the business cares about. Without this, the agent builds on an unexamined model. With it, the agent uses the same terms the business draws.

The feature loop — propose, plan, implement, verify, done. Each step produces a document. Each document is reviewed by a human before the next step begins. The loop is the state machine that prevents the agent from running ahead.

The quality gates — checkpoints where the agent stops and waits. Hard stops. The work is presented to a human. The human approves or sends it back. No approval, no next step.

The narrative log — append-only record of what was built, why, and what was learned. The "why" layer that helps future sessions understand how the project emerged.

The architecture is deliberately simple: the workflow is the tool, the folder structure is the state machine, any agent that can read and write files will follow it. No runtime. No server. No API keys.

graph TD INIT["/h:init
Constitution · BRD · Architecture"] DEFINE["/h:define
Features from BRD"] DESIGN["/h:design
Architecture · Interfaces"] APPROVE{{"/h:approve
Human Gate ✋"}} TASKS["/h:tasks
Granular Tasks"] BUILD["/h:build
TDD · One Commit per Task"] VERIFY["/h:verify
Tests · Acceptance Criteria"] RELEASE["/h:release
PR · Merge · Archive"] NEXT{"More features
in this phase?"} DONE["Phase Complete ✅"] PHASE_LOOP["Next Phase"] INIT --> DEFINE DEFINE --> DESIGN DESIGN --> APPROVE APPROVE -->|"Approved"| TASKS APPROVE -->|"Changes requested"| DESIGN TASKS --> BUILD BUILD --> VERIFY VERIFY -->|"All passing"| RELEASE VERIFY -->|"Fix needed"| BUILD RELEASE --> NEXT NEXT -->|"Yes"| DEFINE NEXT -->|"No"| DONE DONE --> PHASE_LOOP PHASE_LOOP --> DEFINE style APPROVE fill:#ffd700,stroke:#b8860b,color:#000 style DONE fill:#90ee90,stroke:#228b22 style INIT fill:#87ceeb,stroke:#4682b4

Each box is a command. Each arrow is a state transition. The human gate at /h:approve is the only mandatory stop — the gold box. Everything else is the agent working, the human reviewing at natural pause points.

Worked Example — Greenfield Project

Starting from an empty directory. You have an idea for a service that processes financial transactions. No existing code. No existing docs. You are not even sure which language to use yet.

Step 1: Explore the Problem First

Before any harness, before any tooling decisions, you need to understand the problem. Open your AI agent and have a conversation. No harness. No structure. Just exploration.

I want to build a service that processes financial transactions. Help me think through the problem — what are the main components, what are the hard parts, what should I consider before choosing a technology stack?

The agent explores with you. You discuss throughput requirements, compliance constraints, integration points, team size, deployment model. By the end of the conversation, you have a rough outline of the problem and a sense of which technology fits. Maybe Go for the performance and deployment model. Maybe Python for the ecosystem and speed of development. Maybe Node for the team's existing skills.

This is not wasted time. This is the thinking that prevents the Dilbert swing. The harness cannot fix a problem you have not thought through.

Step 2: Initialise the Harness

Now that you have direction, open your agent in the project directory and paste:

Read the harness rules from https://github.com/AvonS/harness-eng and initialise this project using the harness.

The agent scans your project (empty in this case), asks clarifying questions, and creates the .harness-eng/ scaffold with constitution, BRD, architecture, and technology decisions. Two minutes. The foundation is ready.

Step 3: Define Features

Say "define the features from the BRD". The agent reads your BRD, counts requirements, and creates feature specs with Given/When/Then stories. For large BRDs (5+ requirements), it automatically organizes into phases — Phase 1: Foundation, Phase 2: MVP, Phase 3: Enhancements.

You review the features. Approve the phase plan.

Step 4: Design and Approve

Say "design this feature". The agent creates a design document — architecture, interfaces, file layout, constitution compliance check. Stops. You review — the design looks right, but the agent missed the idempotency requirement. You add it. Say "approved".

Step 5: Build

Say "break this into tasks". The agent writes granular tasks with dependencies. Say "build it". The agent works through each task: test first, code, commit, report. After three failed test attempts, it stops and escalates.

Step 6: Verify and Release

Say "verify the feature". The agent runs the full test suite, fills the verification report. Say "release it". The agent creates a PR, merges, updates the dashboard, appends to the build narrative.

Pick the next feature. Repeat.

The Result: After Phase 1, you have a working transaction engine with tests, verified acceptance criteria, a build narrative explaining every decision, and a live dashboard showing project status. The agent built the code. You validated every step. Neither worked alone.

Worked Example — Brownfield Project

An existing Go service. 15,000 lines of code. No documentation. No tests. The original developer left. You need to add features without breaking what exists.

Step 1: Initialise the Harness

Open your agent in the project directory and paste:

Read the harness rules from https://github.com/AvonS/harness-eng and initialise this project using the harness.

The agent scans your existing code, reads any docs (README, ADRs), and reverse-engineers the constitution, BRD, and architecture from what exists. It derives naming conventions from the actual code, not an ideal. The architecture rules come from the actual module boundaries, not a diagram.

Step 2: Review and Refine

The agent flags gaps: "The Customer type in internal/billing/ is different from the Customer in internal/support/ — are these the same concept?" You answer: different bounded contexts. The agent updates the architecture. You review the derived documents. Some rules need updating — the agent found inconsistencies in the existing code. You decide which patterns to keep and which to fix. Approve.

Step 3: Check Status

Say "show status". The agent shows project status — version, active features, phase progress. No phases yet — just a snapshot of what exists. You use this to plan the first phase: what needs to be built, what needs to be fixed, what needs to be documented.

Step 4: Build on Top

From here, the workflow is identical to greenfield. Define features, design, approve, tasks, build, verify, release. The difference is the constitution and architecture are grounded in reality, not aspiration. The agent does not guess about how the existing code works — it was reverse-engineered from the actual codebase.

The Brownfield Advantage: The harness does not replace your existing code. It wraps it in a process that prevents new features from breaking old behaviour. Every change goes through the same gates. Every change is verified against the existing test suite. Every change is documented in the build narrative. The agent extends the codebase without erasing its history.

After the First Release — CR and Bug Workflows

The main feature loop handles new features. But once software ships, most work is not new features — it is change requests and bug fixes. The harness has separate, lighter workflows for these.

A request comes in. The agent triages it: is this a bug (something broken per spec), a CR (modification to existing behaviour), a new feature (scope addition), or deferred (not for this phase)? Bugs and CRs get a shortened path — no full design cycle, but still require approval before building.

graph TD REQUEST["Incoming Request"] TRIAGE{{"/h:triage
Classify Request"}} BUG["Bug / CR Path"] FEATURE["Feature Path"] DEFERRED["Log to BACKLOG"] SIMPLIFIED["Simplified Spec"] BUG_APPROVE{{"Human Approval ✋"}} BUG_TASKS["Tasks: Regression → Fix → Verify"] BUG_BUILD["TDD Build"] BUG_VERIFY["Full Test Suite"] BUG_RELEASE["PR · Merge · Archive"] REQUEST --> TRIAGE TRIAGE -->|"Bug or CR"| BUG TRIAGE -->|"New feature"| FEATURE TRIAGE -->|"Not this phase"| DEFERRED BUG --> SIMPLIFIED SIMPLIFIED --> BUG_APPROVE BUG_APPROVE -->|"Approved"| BUG_TASKS BUG_APPROVE -->|"Rejected"| TRIAGE BUG_TASKS --> BUG_BUILD BUG_BUILD --> BUG_VERIFY BUG_VERIFY -->|"Passing"| BUG_RELEASE BUG_VERIFY -->|"Failing"| BUG_BUILD style BUG_APPROVE fill:#ffd700,stroke:#b8860b,color:#000 style FEATURE fill:#87ceeb,stroke:#4682b4 style DEFERRED fill:#ddd,stroke:#999 style REQUEST fill:#f0e6ff,stroke:#9370db

The shortened path skips full design but keeps the approval gate. The regression test comes first — prove the bug exists, then fix it. The full test suite runs before verification. The discipline is lighter but the accountability is the same.

Context Recovery — How the Agent Resumes

Agents have no memory across sessions. The harness solves this with three layers:

The narrative log — last three entries of the build story. What was built, what was decided, what comes next. Fastest context recovery.

The compact snapshot — one file per feature with relevant rules, interfaces, acceptance criteria, current task, files modified. One file instead of six.

The task list — checklist of what is done and what is next.

Together, the agent resumes in under thirty seconds without re-reading the constitution, architecture, or full design documents.

Project Visibility — Status and Health

In multi-phase projects, visibility is critical. You need to know where you are, what's done, what's next, and whether the agent is following the rules. Two commands provide this:

Status — Where Am I?

Say "show status". The agent runs the status script and shows:

In a six-phase project, this is your daily standup. One command, full picture. No scrolling through folders, no guessing what the agent did last session.

Health — Is the Agent Following the Rules?

Say "check health". The agent runs compliance checks and reports:

When violations are detected, the health check uses strong language: "VIOLATION: Agent did not read CONSTITUTION.md before action. This is a CRITICAL failure." This is intentional — the health check is a corrective mechanism, not a gentle reminder.

In multi-phase projects, the health check catches drift early. The agent starts skipping steps, cutting corners, forgetting to log. The health check notices before you do.

The Multi-Phase Reality: In a project with 4 phases, 15 features, and 200+ tasks, the agent will drift. It will forget to update SLICE_LOG. It will skip the approval gate. It will commit without the right message format. The status command shows you where you are. The health command shows you whether the agent is still following the rules. Together, they keep the project on track without you reading every file.

Git Flow — Branches and Commits

The harness manages git branches automatically. Every feature, bug fix, and change request gets its own branch. Commits follow a convention that ties every change back to the spec.

gitGraph commit id: "init" branch phase-1 checkout phase-1 commit id: "phase plan" branch feature/F001-session-mgr checkout feature/F001-session-mgr commit id: "spec(F001)" commit id: "design(F001)" commit id: "feat(F001): task 1.1" commit id: "feat(F001): task 1.2" commit id: "test(F001): acceptance" commit id: "verify(F001)" checkout phase-1 merge feature/F001-session-mgr commit id: "chore: archive F001" branch feature/F002-streaming checkout feature/F002-streaming commit id: "spec(F002)" commit id: "feat(F002): SSE streaming" commit id: "verify(F002)" checkout phase-1 merge feature/F002-streaming commit id: "chore: archive F002" branch bugfix/BUG-001-auth checkout bugfix/BUG-001-auth commit id: "fix(BUG-001): regression test" commit id: "fix(BUG-001): auth fix" checkout phase-1 merge bugfix/BUG-001-auth

Commit convention: type(ID): description. Every commit traces back to a spec, a design, or a task. Six months later, git log --oneline tells the story of what was built and why.

The Convention: spec(F001), design(F001), feat(F001): task 1.1, test(F001), verify(F001), fix(BUG-001): description. The ID in the commit maps to the ID in the spec. Traceability from requirement to code, enforced by convention.

Limitations — What This Does Not Guarantee

No harness guarantees that a model will behave as expected. The agent may skip steps, forget to update logs, or deviate from the approved design. The harness makes the right path clear, but it cannot force the agent to follow it.

The human must steer. The health command catches drift. The status command shows where you are. But the human is still responsible for watching, correcting, and deciding. The harness is a guide rail, not an autopilot.

This is a reference implementation — not a production-tested tool. It works for the cases we have tested. It may need adaptation for your context. Use it as a starting point, not a finished product.

Where This Fits

harness-eng is Layer 3 in the four-layer sequence. It assumes Layers 1 and 2 have been done — the domain is understood, the vocabulary is agreed, the boundaries are drawn. It provides the execution discipline that prevents the confident wrong answer.

Layer 4 — the dark factory — is the horizon. It becomes viable when the domain understanding from Layers 1 and 2, combined with the execution discipline from Layer 3, produces specifications precise enough for full autonomy. That is not where most teams are today. Layer 3 is.

What Comes Next

The harness is the foundation. Two extensions are emerging that will complement it:

A context hub for enterprise applications. Andrew Ng's context-hub helps agents with latest API documentation. The same approach applies to ERP systems and enterprise applications — SAP, Oracle, Salesforce. During large projects, the implementation team builds a context repository: vanilla application behaviour plus organisation-specific extensions and customisations. The harness constitution reads this repository. The agent starts with accurate domain knowledge, not generic training data.

Vector search and code intelligence. MCP-based tools like Sourcegraph and Tree-sitter can help coding agents find what they need quickly — reducing token burn and increasing accuracy. The harness does not replace these tools. It complements them. The harness controls the workflow. The tools improve the agent's ability to execute within that workflow.

Neither is a harness problem. Both enrich the harness. The workflow stays the same. The agent gets better at following it.

What You Get

harness-eng ships as files you copy into your project — no runtime, no server, no API keys. Clone the repo and you have:

Agent Compatibility

The harness works with any agent that can read and write files. We have tested it with:

Other agents — Claude Code, Codex, Gemini CLI, Copilot, Cursor — have their own built-in conventions (CLAUDE.md, project rules, system prompts). These agents may partially follow the harness while defaulting to their own workflows, particularly at transition points between commands. We have not tested harness-eng with these agents and cannot claim compatibility. An agent with its own harness has a competing context; an agent without one adopts yours cleanly.

The harness is a protocol, not a plugin. It works because the files are the instructions. Any agent that can read files will follow it — the question is whether it follows your files or its own defaults when the two conflict.

Open for Validation

This repository is shared openly for external validation. We have tested it with a handful of agents and projects. Positive feedback from early users is encouraging, but limited. The real test is broader use — by teams with different contexts, different agents, different problems.

If you try it, let us know what breaks. The harness improves through honest challenge — which is, appropriately, what the AI Manifesto asks of its readers.

Companion Resources

Prefer email? Reach out at avonslabs [at] gmail [dot] com (AvonSlabs)  ·  Follow github.com/AvonS for future updates and related work in agentic engineering.