I came up with a solution for this problem over my Christmas break. And I created a free open source tool that works on any LLM model. Basically it's built around the concept of needs observation and output. And it's a rule base system that the LLM reflects on prior to it coding. It's a brief succinct notation system that sits as a knowledge graph that the LLM refers to before and after it codes. It generates glyph chains, which is the knowledge graph. These chains represent a need all the way down to an output. These explicit rules are then reflected on by the LLM. I'm able to get my 30 billion Qwen3 model to act like a frontier model. And using one fifth of the tokens. This has literally removed architectural drift from the equation and eliminates the need to have the LLM reference spec documents and task lists all the time because that increases context window and can still generate drift.
I also find it interesting that I'm an artist/painter who knows how to code who came up with this idea. Not a scientist or AI engineer. Sometimes an artist thinks a little bit different than your computer scientist, right?
Addy, these are the suggestions I wish I had two years ago when I started doing agentic coding, they are incredibly based!
I'd add one verification layer, the one I call the Clarity Gate.
In practice: can a different AI, or even just a fresh session, generate functionally equivalent code from the same spec? If the answer is no, then the spec has implicit assumptions baked in, and those will cause drift.
One more thing on living documents... when code fails, I've found the problem usually origins from the spec themselves. Fixing the docs usually fix the code as well. Otherwise you're just patching symptoms while the real problem keeps generating new bugs downstream.
I’ve realized that I’ve been following a very similar workflow for a while now—without consciously naming it—and it’s yielded great results.
One extra thing I do for personal projects is to build the app incrementally and modularly by asking the LLM to generate prompts directly from the specs. Sometimes I include instructions that assume previous prompts have already been executed and the implementation is in place, or that certain tasks can be done in parallel.
I review and refine those prompts before using them, then review the resulting implementation and move on to the next set. I also document the prompts as I go, so I always know what was done and where I am in the process.
I’m curious how others think about this last step—treating prompt generation from specs as part of the workflow itself.
The attention budget point is critical. New research quantified it: agents skip the middle of long files and auto-generated specs made things worse in 5 of 8 settings tested. Your plan-first approach helps because it forces agents to discover context instead of being told. I covered the research here: https://sulat.com/p/agents-md-hurting-you
The modular prompts approach saved me weeks of frustration. Used to dump everything into one massive context block. Now I break specs into focused modules.
Your six core areas (commands, testing, structure, style, workflow, boundaries) map perfectly to how I structure agent instructions. The boundaries section especially - clarity on what NOT to do prevents so many failures.
this was a genuinely useful read. thank you for putting it together.
the "curse of instructions" part hit us. we've been testing a lot of AI tools lately and noticed the same thing firsthand. the more you ask in a single prompt, the less reliably anything gets done. it sounds obvious when you say it out loud, but so many people (us included) still fall into that trap.
the three-tier boundary system (always / ask first / never) is probably the most practical takeaway here. it's simple enough to actually use, which is what matters. a lot of "best practices" sound great in theory but nobody applies them because they're too complicated. this one sticks.
one thing we've been thinking about: most of this advice applies beyond coding agents too. if you're using AI to help with content, research, or even CRM workflows, the same principles hold. give it focused context. don't dump everything at once. check the output. iterate. and iterate.
the Willison quote about AI agents feeling like "managing an intern" is painfully accurate. you really do have to spell things out. and that's okay. it just means the spec (or whatever you call your instructions) is doing the real work. the AI is just executing.
The best specs I've seen for AI agents treat them like junior developers who are brilliant but have zero context about your business. They need explicit constraints, not just goals. What if we wrote specs that assumed the AI would interpret everything in the worst possible way, then worked backwards from there?
I love the concept of using the AI to draft its own spec from a high-level brief!
I like the TOC for specs, good idea. Thanks for such a detailed article.
I came up with a solution for this problem over my Christmas break. And I created a free open source tool that works on any LLM model. Basically it's built around the concept of needs observation and output. And it's a rule base system that the LLM reflects on prior to it coding. It's a brief succinct notation system that sits as a knowledge graph that the LLM refers to before and after it codes. It generates glyph chains, which is the knowledge graph. These chains represent a need all the way down to an output. These explicit rules are then reflected on by the LLM. I'm able to get my 30 billion Qwen3 model to act like a frontier model. And using one fifth of the tokens. This has literally removed architectural drift from the equation and eliminates the need to have the LLM reference spec documents and task lists all the time because that increases context window and can still generate drift.
https://github.com/danaia/archeonGUI
Walkthrough:
https://youtu.be/YtNKRKn5FEs?si=O8PKV3EtC7Gc5RG8
I also find it interesting that I'm an artist/painter who knows how to code who came up with this idea. Not a scientist or AI engineer. Sometimes an artist thinks a little bit different than your computer scientist, right?
Addy, these are the suggestions I wish I had two years ago when I started doing agentic coding, they are incredibly based!
I'd add one verification layer, the one I call the Clarity Gate.
In practice: can a different AI, or even just a fresh session, generate functionally equivalent code from the same spec? If the answer is no, then the spec has implicit assumptions baked in, and those will cause drift.
One more thing on living documents... when code fails, I've found the problem usually origins from the spec themselves. Fixing the docs usually fix the code as well. Otherwise you're just patching symptoms while the real problem keeps generating new bugs downstream.
I’ve realized that I’ve been following a very similar workflow for a while now—without consciously naming it—and it’s yielded great results.
One extra thing I do for personal projects is to build the app incrementally and modularly by asking the LLM to generate prompts directly from the specs. Sometimes I include instructions that assume previous prompts have already been executed and the implementation is in place, or that certain tasks can be done in parallel.
I review and refine those prompts before using them, then review the resulting implementation and move on to the next set. I also document the prompts as I go, so I always know what was done and where I am in the process.
I’m curious how others think about this last step—treating prompt generation from specs as part of the workflow itself.
The attention budget point is critical. New research quantified it: agents skip the middle of long files and auto-generated specs made things worse in 5 of 8 settings tested. Your plan-first approach helps because it forces agents to discover context instead of being told. I covered the research here: https://sulat.com/p/agents-md-hurting-you
The modular prompts approach saved me weeks of frustration. Used to dump everything into one massive context block. Now I break specs into focused modules.
Your six core areas (commands, testing, structure, style, workflow, boundaries) map perfectly to how I structure agent instructions. The boundaries section especially - clarity on what NOT to do prevents so many failures.
Wrote about agent orchestration patterns: https://thoughts.jock.pl/p/claude-code-vs-codex-real-comparison-2026
Do you version your specs? Finding it hard to track what changes actually improved output.
this was a genuinely useful read. thank you for putting it together.
the "curse of instructions" part hit us. we've been testing a lot of AI tools lately and noticed the same thing firsthand. the more you ask in a single prompt, the less reliably anything gets done. it sounds obvious when you say it out loud, but so many people (us included) still fall into that trap.
the three-tier boundary system (always / ask first / never) is probably the most practical takeaway here. it's simple enough to actually use, which is what matters. a lot of "best practices" sound great in theory but nobody applies them because they're too complicated. this one sticks.
one thing we've been thinking about: most of this advice applies beyond coding agents too. if you're using AI to help with content, research, or even CRM workflows, the same principles hold. give it focused context. don't dump everything at once. check the output. iterate. and iterate.
the Willison quote about AI agents feeling like "managing an intern" is painfully accurate. you really do have to spell things out. and that's okay. it just means the spec (or whatever you call your instructions) is doing the real work. the AI is just executing.
appreciate you sharing this. bookmarked.
Everyone will have to get increasingly better at being able to convey a system in plain English to models.
The best specs I've seen for AI agents treat them like junior developers who are brilliant but have zero context about your business. They need explicit constraints, not just goals. What if we wrote specs that assumed the AI would interpret everything in the worst possible way, then worked backwards from there?