Harness Engineering

Harness engineering is the discipline of improving everything around an AI model that turns it into a useful agent: prompts, tools, context policies, hooks, sandboxes, subagents, feedback loops, memory, observability, and recovery paths. Addy Osmani summarizes the core equation as: Agent = Model + Harness; a raw model becomes an agent only when the harness gives it state, tool execution, feedback loops, and enforceable constraints.^{source: addy-osmani-agent-harness-engineering-2026.md}

The central habit is a ratchet: every observed agent failure should become a durable harness improvement, not just a retry. Examples include adding a convention to AGENTS.md, blocking destructive shell commands with hooks, splitting long tasks into planner/executor roles, or wiring type checks and tests back into the agent loop.^{source: addy-osmani-agent-harness-engineering-2026.md}

Important design patterns:

Behavior-first design: start from the behavior wanted, then add only the harness component that produces it.
Filesystem and Git: durable state, coordination surface, versioning, branching, and rollback.
Bash/code execution: general-purpose tool creation and verification within a ReAct loop.
Sandboxes: safe execution with useful defaults for language runtimes, test CLIs, and browsers.
Memory/search: inject stable project knowledge and retrieve current external facts.
Context management: compaction, tool-call offloading, and progressive disclosure to fight context rot.
Long-horizon execution: loops, planning artifacts, and separate generator/evaluator agents.
Hooks: deterministic enforcement before/after tools, edits, and commits; ideally silent on success and verbose on failure.^{source: addy-osmani-agent-harness-engineering-2026.md}

For this wiki and hermes-agent, harness engineering is directly relevant because skills, AGENTS.md, persistent memory, tool discipline, cron jobs, and linting form the harness that lets the agent maintain a self-improving-knowledge-base. The llm-wiki-pattern itself is an example of turning repeated knowledge-management failures into durable scaffolding.

Mark Erikson gives a concrete software-engineering version of the same thesis: LLM non-determinism becomes useful only when surrounded by deterministic scaffolding such as tests, typechecking, linting, CI, static analysis, explicit plans, prompt/context files, and human review. In this framing, ai-assisted-software-development works best when the harness reduces what the model has to invent and turns repeated knowledge into scripts, tools, and guardrails.^{source: mark-erikson-ai-thoughts-part-1-2026.md}

Peter Yang's personal-agents framing adds a UX constraint: the harness must become powerful enough to do real work while becoming invisible enough that users do not need to understand APIs, MCP servers, CLIs, worktrees, or tool plumbing.^{source: peter-yang-chat-era-ending-2026.md}

Garry Tan adds a compounding-system version of harness engineering: the harness should stay thin, while skills, code, and data become fat. In his framing, skillification is the ratchet that turns repeated workflows into reusable skills, and the gbrain data layer gives those skills enough personal context to behave like an operating system rather than a chatbot.^{source: garry-tan-meta-meta-prompting-ai-agents-2026.md}

Osmani's cognitive-surrender article reframes verification as a cognitive-safety requirement, not just a QA ritual. Evidence-based exits, anti-rationalization tables, smaller PRs, conceptual inquiry before generation, and deliberate friction all preserve the human engineer's independent model while still using agents for speed.^{source: addy-osmani-cognitive-surrender-2026.md}

Output format is part of the harness. Thariq argues that html-artifacts can be a stronger coordination surface than markdown for large agent outputs because they support diagrams, layout, interactivity, annotated diffs, and sharing; this can keep humans in the loop during planning, review, design, and verification.^{source: thariq-unreasonable-effectiveness-html-2026.md}

shopify-river adds an organizational harness pattern: force agent use into public, searchable channels so conversations become training material, reusable context, and social review. In that design, Slack visibility and channel-specific instructions are not incidental UI choices; they are harness components that make the agent and the organization learn together.^{source: tobi-lutke-learning-shop-floor-river-2026.md}

Nakazawa's modern-engineering-values essay emphasizes the repo as a harness boundary. If each agent session is like a new engineer arriving without organizational memory, then tests, lint rules, fast changed-file tools, design docs, product principles, and taste encoded in the repository are not bureaucracy; they are the local context and feedback loops that let agents and humans move quickly without flooding the codebase with slop.^{source: matt-van-horn-wtf-is-a-loop-2026.md}

The retail autonomy case makes the same harness problem visible outside software development. In agentic-ai-in-retail, agents that can change prices, move inventory, or communicate with customers need a semantic business backbone, permission boundaries, governance, EU AI Act compliance, and human escalation for risky decisions; otherwise "agentic" becomes unsafe automation rather than reliable delegated judgment.^{source: ft-sopra-steria-agentic-ai-retail-2026.md}

Van Horn's agent-loops article makes the long-horizon execution part more explicit: the loop is now a harness object in its own right, with scheduling, durable state, self-verification, budget ceilings, no-progress detection, and sometimes supervisor agents coordinating worker agents. In this framing, the expensive engineering work moves from the prompt to the feedback and stopping structure around repeated agent calls.^{source: matt-van-horn-wtf-is-a-loop-2026.md}

Osmani later distinguishes agent-loops as the layer above the harness: if the harness makes a single agent reliable, loop engineering adds cadence, work discovery, worktree isolation, skill calls, connectors, sub-agent verification, and durable state so the system can prompt agents and resume across runs without the human typing every next step.^{source: nvidia-enpire-agentic-robot-policy-self-improvement-2026.md}

Noam Brown's test-time-compute-evaluations article adds a benchmark-level version of the harness thesis: scaffolds and loops can unlock additional capability simply by spending more inference at test time. That means a model-plus-harness system may benchmark very differently from the same base model at a fixed small budget, and evaluation reports should state or sweep the allowed token, cost, or time budget.^{source: nvidia-enpire-agentic-robot-policy-self-improvement-2026.md}

ENPIRE makes the same thesis concrete in robotics: a frontier coding model is not enough to improve a physical policy unless the harness exposes reset, safety, verification, rollout, logging, and branch comparison as reliable interfaces. Its Environment, Policy Improvement, Rollout, and Evolution modules show that real-world autonomy emerges from model plus agent-operable environment plus evaluation loop, not from model capability alone.^{source: nvidia-enpire-agentic-robot-policy-self-improvement-2026.md}

Fan's ENPIRE follow-up is a physical-world harness-engineering checklist: safety constraints must be physical and programmatic, /done must be frozen before the agent can optimize, and telemetry must expose robot, GPU, and token bottlenecks to the loop. That makes physical-autoresearch a particularly demanding case of harness engineering because bad harness design wastes scarce robot-seconds or creates unsafe physical behavior.^{source: jim-fan-physical-autoresearch-loopcraft-2026.md}

Ronacher's "The Coming Loop" adds a cautionary harness requirement: as task queues, durable sessions, subagents, and orchestration become normal, harnesses must not merely run more machines. They need ways to make loop behavior legible over time, jolt humans back into meaningful review, preserve strong invariants, and prevent the human role from collapsing into a messenger between one machine that says "done" and another machine that judges it.^{source: armin-ronacher-the-coming-loop-2026.md}

The bayer-prince case provides a production enterprise version of the same idea. PRINCE's harness is not just its prompts: LangGraph defines the workflow, PostgreSQL checkpoints agent state, DynamoDB stores application state, retries happen at both call and node level, model providers have fallbacks, Langfuse traces production traffic, and citations plus intermediate-step visibility keep users able to verify answers. In this framing, production-llm-reliability is harness engineering applied to regulated research.^{source: martin-fowler-bayer-reliable-agentic-ai-systems-2026.md}

Agentic RAG also makes context engineering part of the harness. PRINCE deliberately separates planning context, retrieval context, evidence context, and synthesis context instead of stuffing every available document into one large prompt; this reduces context pollution and makes agentic-rag easier to debug and evaluate.^{source: martin-fowler-bayer-reliable-agentic-ai-systems-2026.md}

Harness Engineering

Resources