Working With Coding Agents: Principles for Reliable AI-Built Software

Nish · July 5, 2026

⏱️ 9 min read

In late 2025, Andrej Karpathy described going from roughly 80% manual coding to 80% agent coding in a matter of weeks, calling it the biggest change to his workflow in two decades of programming. Coding agents like Claude Code and Codex can now explore a codebase, plan a change, write it, run it, and test it with minimal supervision. Yet the people getting reliably good output are not the ones with the cleverest prompts. They are the ones who noticed that the bottleneck has moved: writing code is now cheap, and deciding whether to trust it is the expensive part. This post distils what top practitioners converge on into seven working principles, aimed at anyone using these tools to build real things.

TL;DR: give the agent a check it can run, plan before you prompt, keep context small and specific, hunt the unknowns your spec missed, review by risk with a fresh-eyed reviewer, invest in the fundamentals that agents multiply, and stay accountable for what ships.

Table of Contents

The bottleneck has moved

The data behind the shift is sobering. In his essay on agentic code review, Addy Osmani collects 2026 industry numbers: AI assistance produces around 4x the raw code output but only about 10% more delivered value, and defect rates in some studies jumped from 9% to 54% as review discipline failed to keep pace. As he puts it, “the hard part of engineering moved from writing code to deciding whether to trust it.”

So the principles below are not prompt tricks. They are the habits that separate casual vibe coding, fine for throwaway prototypes, from what Karpathy and others now call agentic engineering: using agents to ship software you would put your name on.

1. Give the agent a way to verify its work

This is the single highest-leverage habit. Karpathy’s observation is that modern agents are exceptionally good at looping until they meet a goal, so you should give them success criteria rather than step-by-step instructions. Anthropic’s Claude Code guidance frames it the same way: give the agent a check it can run, such as a test suite, a build exit code, a linter, or a screenshot compared against a design. Without one, “looks done” is the only signal available, and you become the verification loop; every mistake waits for you to notice it.

The corollary is to demand evidence, not assertions. Ask for the test output, the command that was run and what it returned, or a screenshot of the result. Kun Chen, the ex-Meta principal engineer behind the widely shared agentic workflow video, found that around 68% of his agent-written changes contained bugs before he built a validation pipeline that forces end-to-end evidence rather than trusting unit tests alone.

Rule of thumb: if you cannot verify it, do not ship it.

2. Plan before you prompt

Every serious practitioner separates thinking from typing. Anthropic’s recommended workflow is explore, plan, implement, commit: let the agent read the relevant code and produce a written plan you can edit before any change is made. Simon Willison notes that longer changes, especially refactors, go much better when you have the model write a plan first and iterate on it, treating the plan as a kind of meta program.

Kun Chen makes the payoff concrete: plan quality determines how long an agent can run autonomously. A one-line prompt buys you minutes of useful work; a detailed, agreed spec can keep agents productive for hours. A related technique from Anthropic’s guide is to invert the flow entirely: give the agent a rough idea and have it interview you about edge cases, tradeoffs, and constraints until a complete spec exists, then execute that spec in a fresh session.

The caveat: planning has overhead. If you could describe the diff in one sentence, skip the plan and just ask.

3. Treat context as a scarce resource

An agent’s context window holds everything it can currently see, and performance degrades as it fills. Most practical guidance follows from this one constraint. Keep persistent instruction files like CLAUDE.md or AGENTS.md short and high-signal; for every line, ask whether removing it would actually cause mistakes, because bloated instruction files get ignored. Matt Pocock’s guide to these files takes the idea to its logical end: the ideal file is “as small as possible”, holding only the core, high-level facts the agent cannot derive from the code, and otherwise acting as an index that gives the agent “only what it needs right now” and points it to skills, rules, and docs it can load on demand. Treat it as a living file with a budget: every token in it is loaded on every request, models only follow a couple of hundred instructions consistently, and stale lines actively poison the context, so prune it as ruthlessly as dead code. Beyond the standing files, be specific in prompts: point at files, name the edge case you care about, and reference an existing pattern to imitate rather than describing style in the abstract.

Hygiene matters too. Watch the opening moves of a task and interrupt early if the direction is wrong, because a bad start compounds. Clear context between unrelated tasks, and if you have corrected the agent twice on the same issue, stop correcting: a fresh session with a better prompt that incorporates what you learned almost always beats a long session polluted with failed attempts. Osmani goes further and argues the boundary between always-loaded and on-demand context is a real architectural decision that should be reviewed and versioned like code.

4. The map is not the territory

The X post that partly prompted this article is Thariq’s “A Field Guide to Fable”, and its central idea generalises to every agent: your prompt, spec, and context files are a map, while the actual work is the territory, and the quality of the output is bottlenecked by your ability to close the gap between them. He sorts the gaps into known unknowns you can ask about, unknown knowns you would recognise but never thought to write down, and unknown unknowns nobody has considered yet.

The practical moves are simple. Before implementation, run a blind-spot pass: ask the agent what is ambiguous, what it is assuming, and what could go wrong. During implementation, have it keep notes on where reality forced it to deviate from the plan. Afterwards, have it explain the change back to you, because Karpathy’s warning applies here: today’s models rarely make syntax errors, but they make the subtle conceptual mistakes of a hasty junior developer, including confidently building on assumptions they never checked.

5. Review by risk, not by author

Reviewing every agent diff line-by-line does not scale, and rubber-stamping is worse. Osmani’s answer is to tier review depth by blast radius and code lifespan: a config tweak on an internal tool needs a glance, while anything touching payments, auth, or data deletion gets types, tests, multiple reviewers, and a human sign-off.

Two mechanics make agent review much stronger. First, separate the writer from the reviewer: a fresh agent in a fresh context, seeing only the diff and the acceptance criteria, catches problems the session that wrote the code is blind to. This is exactly Kun Chen’s setup, and Anthropic ships it as an adversarial review step. Second, be suspicious of test changes. A known agent failure mode is rewriting a failing assertion to match broken behaviour instead of fixing the bug, so diffs that touch tests deserve your closest human attention. And beware what Osmani calls borrowed confidence: several AI reviewers agreeing “looks good” with no human anywhere in the loop means nobody actually understood the change.

6. Agents amplify your engineering fundamentals

Willison’s “vibe engineering” essay makes a point that is easy to miss in the hype: agents are a force multiplier on exactly the boring practices good teams already had. A robust test suite lets agents iterate safely. Good CI catches their regressions. Clear documentation lets them navigate your APIs without reading every source file. Tight version control makes their mistakes cheap to undo, and agents are genuinely good at tools like git bisect. Preview environments let you exercise their work before users do.

A codebase with none of these does not become productive by adding an agent; it becomes chaotic faster. Willison also names a newer skill on the same list: judgment about what to delegate at all. Knowing which tasks suit an agent and which still need your own hands is itself a fundamental, and one you have to keep updating as the tools improve. He is honest about the learning curve too: using LLMs to write code well is difficult and unintuitive, and anyone who tells you it is easy is misleading you.

The amplifier also runs in reverse: used carelessly, agents erode the very fundamentals they depend on. In a randomized trial at Anthropic, junior engineers who learned a new library with AI assistance scored 17% lower on a follow-up comprehension quiz than those who coded by hand, with the widest gap in debugging. Crucially, the damage tracked how the tool was used, not whether it was used: engineers who purely delegated code generation understood the least, while those who asked follow-up questions and requested explanations alongside the code stayed fast and kept learning. That is the deeper case for the explain-it-back checks in principle 4: they protect your understanding, not just the diff.

7. Stay accountable for what ships

Every source ends in the same place: the human role shifts from typist to owner. You provide the judgment about whether this is the right thing to build, catch the unwritten requirements no spec captured, and own the merge decision. Osmani’s framing for production work is “set the bar at the eval, not the demo”: a feature that works once in a happy-path demo is not the same as one that passes a verification standard you defined in advance.

Ownership is also much easier to keep when the why is written down. Osmani calls the missing piece intent debt: technical debt lives in the code and cognitive debt in your head, but intent debt lives in the artifacts you never wrote, the goals, constraints, and rationale for why the system is the way it is. Agents make this debt expensive because they start every session cold, and when the why is missing they will invent a confident-sounding reason rather than admit they do not know. The good news is that the habits above pay it down as a side effect: specs that capture intent rather than implementation, implementation notes from the map-territory loop, and decisions logged when they are made become a ledger the next session, agent or human, can actually read.

Karpathy’s closing observation is worth carrying with you. The main effect of these tools is not that existing work gets faster, but that far more becomes worth attempting, including work in domains you would previously never have touched. The leverage, though, flows to people with engineering taste: those who know what good looks like, insist on verification, and treat the agent as a tireless but overconfident junior colleague rather than an oracle.

Sources & further reading

Citation Information

If you find this content useful, please cite this work as:

Bhana, Nish. "Working With Coding Agents: Principles for Reliable AI-Built Software". Nish Blog (July 2026). https://www.nishbhana.com/Working-With-Coding-Agents/

Or use the BibTeX citation:

@article{bhana2026workingwith,
  title   = {Working With Coding Agents: Principles for Reliable AI-Built Software},
  author  = {Bhana, Nish},
  journal = {nishbhana.com},
  year    = {2026},
  month   = {July},
  url     = {https://www.nishbhana.com/Working-With-Coding-Agents/}
}

x.com, Facebook