Secret Agent

Page content

Last week I started working on a project, and this time I wanted to see what would happen if I did what all the “cool kids” are doing and “vibe coded” most of it. Let me just say it was an absolute disaster. I ended up with about 18k lines of iffy spaghetti code that only kind of does what I asked, is chock full of bugs, and isn’t anything I would ever willingly publish in my name. Just awful.

But it’s how everything turned out that way, and what I learned in the process, that makes it interesting. Everyone that thinks coding is “over” is partially right, but there’s a secret to using LLMs that nobody really talks about, so I’m here to spill the beans.

The Setup

I’m not a complete imbecile; I know better than to simply give an LLM some vague instructions and let it loose. I spent two days painstakingly crafting a very thorough specification for what I wanted. Full instructions on how the configuration file looks, procedural transition and end states, important business logic, API endpoints, everything I could possibly conjure. I asked Claude Code to check it for anything I may have missed, or areas that were too vague, or anything else it may have needed to proceed, until it was solid as a rock.

The full specification ended up about 30 pages in length. It truly is a thing of beauty. Then I had Claude produce a full checklist of everything it would need to build in order to fully realize this wondrous new creation. The checklist consisted of over 600 items that Claude would need to complete before the project could be considered an Alpha where I’d spend some time cleaning up. Or, that’s what I’d hoped.

Everything was ready. And then I sat on it for the rest of the week because I had other things I needed to finish. I wanted few distractions while I worked on my masterpiece. And then I started, and everything went sideways.

The Problem

If you’re not familiar with how LLMs work, I won’t go into detail about that here. But I can tell you they’re absolute idiots. But they’re also lightning-fast coding savants when given simple tasks. It’s actually kind of terrifying just how capable an LLM is at one-shotting some basic procedure. If I want a small library, I could spend a day researching the other libraries I’d need to use, build a small scaffold, run it through a few tests until it does what I want, clean it up, and call it a day.

An LLM can do that in about five minutes.

Ironically, it’s this very power that completely cripples current LLM adaptability. Two relatively recent studies go over how and why this happens:

This “prior response anchoring” basically means any long or complicated interaction is doomed to eventually degrade into nonsense. I vaguely understood this before I started, but didn’t give it the attention it deserved. I had (wrongly) assumed Claude Code, a product built specifically for long complicated interactions, would have accounted for this. Even then, I leveraged my checklist to combat the effect. I just needed to start a new shell every once in a while to prevent the context from getting too polluted, and everything would turn out alright.

No. Wrong. Absolutely not. But why?

What I Missed

Coding isn’t just complicated. Every project library, sub-module, API, and shred of documentation covers dozens of topics, sometimes in a vast multitude of skill domains. That’s why an LLM can competently produce a small library or simple app in a few minutes—there’s far less opportunity for conceptual drift.

I often equivocate this skill level with a junior developer. I would have to be insane to hand my specification and checklist to a junior developer and expect them to produce the final product. Even assuming they built a workable library scaffolding, moderately sane APIs, understandable modularity, and full test coverage, it would be a hellish spaghetti of criss-crossing dependencies that barely worked, if at all. More than likely, they would give up halfway and run to me for help.

The primary difference with an LLM is that this happens in a few days rather than several weeks or months. Oh, and unlike the junior dev, the LLM won’t give up. Ever. Even when they’re completely wrong, they’re tenacious to a fault. Was there something I missed in the spec? Don’t stop and ask, just do exactly what the spec said anyway, even if that means breaking some fundamental functionality. Did you get lost in the code or need something where it wasn’t implemented? That’s fine, just add a function call and include a completely unrelated library from another part of the code that would be ridiculous in normal circumstances.

If an LLM is lost in a maze of its own creation, it will simply smash through an existing wall to reach its destination. It forgot how it got to its current position and it sees an impediment, so it will perform miracles to proceed anyway. This kind of focus is great when harnessed properly. In a long project that’s halfway done, it’s utter disaster.

This screwed me precisely because I treated the LLM more like a senior-level dev, expecting competency where there was none. LLMs require precise and constant hand-holding.

The Fix

Luckily the fix is shockingly simple: just do less work.

I ran some experiments last night until about midnight because this whole scenario had really triggered my autism. I often use Ollama to run models locally on my GPU. Google recently released a model named Gemma4, and I spent some time comparing it to other models:

Some are definitely better than others, though Gemma4 and Qwen3.5 were nearly always at the head of the pack. I pushed them up to 64k or even 128k of context to make sure they had a good token budget for big prompts, their chain-of-thought, and the response. Then I ran four tests.

  1. Write a short Go library to manage three files in a configuration folder. The library should rename an existing foo.conf to foo.base.conf, create a new project.conf, and a new foo.conf that included both the base and project config. It should bootstrap a settings struct with a passed map, and write them “ini style” to the project conf when requested.
  2. Refactor the project/config/*.go files from Claude Code. This was three files at about 500 lines.
  3. Refactor the project/engine/*.go files from Claude Code. This was seven files at about 1200 lines.
  4. Refactor the project/agent/*.go files from Claude Code. This was 17 files at about 2500 lines.

Keep in mind that these are all local models running on a 24GB GPU at or below 30B parameters. That’s child’s play compared to the frontier models. Here’s how it went for each test:

  1. All models did an OK job. I actually used the output from Qwen3.5, though I did tweak it a bit. This was the “easy” test, and ended up being around 100-120 lines in all cases. Gemma4 actually produced a better result that addressed some potential threading race condition it foresaw, but I don’t plan on allowing that kind of usage, so skipped its suggestions.
  2. Another test they all passed, but quality started to suffer. Gemma4 was particularly notable here because it actually recommended an enhancement that essentially cut the code in half. A local model on my GPU produced better code than Claude Opus 4.6. Cogito and devstral-small basically just fixed some hallucinated misspellings in the code comments and called it a day. Uh, thanks I guess.
  3. Only Qwen3.5 and Gemma4 passed this one. All other models basically responded with the pasted code as if they’d accomplished something. I modified the prompt several times to repeatedly remind them to actually refactor the code rather than just reorganize it, but none seemed capable of doing so. The fact that Qwen3-Coder failed here was particularly embarrassing.
  4. All failed. Gemma4 was the closest to passing, and its excuse for not producing the full result is that doing so would exceed the context window. Even at 128k tokens, it steadfastly refused to fully rewrite the 2500 lines into something more coherent. But it did significantly improve the design. The structure was cleaner, it had better role separation, and even better thread safety. It’s just that I would have to apply that design to the code myself. A valiant attempt, but alas.

Less code consistently produced better results. The newer and more capable models could push a bit further, but they all started to falter at around 1000 lines of code. My suspicion is that even frontier models can only really produce and maintain 3000-5000 LoC before things start to get iffy. My still-unfinished project is around 18k lines, and that clearly exceeded Claude’s capabilities, or at least how I was using it.

Less is More

So how do you “just do less work”? The answer was in the directory structure Claude came up with before it started writing any code: sub-modules. There are about 20 subfolders in this particular codebase, none of which are above 3000 lines. And remember, some of that code is excessive braindead spaghetti from Claude flailing around once it got overwhelmed.

My spec wasn’t enough because it was for the project as a whole. The checklist too, was too comprehensive. I couldn’t realistically restart Claude between every checklist item to avoid context drift, and even if I did, that’s not the right approach. What I should have done is written some basic pseudo-code for each sub-module.

Remember, LLMs are great at single-purpose output. As a senior dev, I should have not only set functionality expectations, but defined the basic code framework. What structs should exist as the system interfaces, what each module literally does along with a minimal library scaffolding. Then even a small local LLM can address each individual item on a per-function basis if necessary. Something like Claude or Codex should be able to handle the whole module.

The trick is separation. Every bit of code is small and easily digestible. What I don’t really understand is why this isn’t the default operating mode of tools like Claude Code. The main conversation should never write any code, ever. It shouldn’t even be an option. The devs at Anthropic should know about the anchoring limitation and build the fix into the platform. It should always launch a sub-agent to handle any minor task. Item in the checklist? Sub-agent. One-shot the item and exit. Write a test? Sub-agent. Document a module? Sub-agent. It should only ever coordinate the output from sub-agents.

The principal coordination thread keeps the “big picture” from the spec at the forefront, and sub-agents succeed or fail without polluting the original goal. No more mostly irrelevant minutiae from micro-managing the code itself. No more getting lost in hundreds of individual files and tens of thousands of lines of code. Yes, you can create skills and dedicated sub-agents for various tasks, but this is basic functionality that it should just do out of the box. Sub-agents are fresh context without anchoring bias, and it should always use them for everything.

So that’s the answer. Just a bit more planning before handing the rest to the LLM. Then it will essentially build the same boilerplate output it’s done a billion times before. One file at a time if necessary. It’s a cleaner and more maintainable and scalable approach. Never let any of the current models work on an entire project, just a subfolder. If there’s a missing field or function somewhere in another semi-related part of the project, start a new instance to address that specific shortcoming.

I suspect that eventually coding tools will catch up and start doing this natively. They’ll transform the spec into many sub-folders, populate each sub-folder with the relevant portion of the spec, launch a sub-agent to plan the interface and structure for each library, and then allocate a sub-agent for each file in that design. Until that happens, you kind of have to do it yourself. That’s how young this technology is right now; we’re really on the bleeding edge here. It’s just as exhausting as it is invigorating.

Until Tomorrow