I think the real value here isn’t “planning vs not planning,” it’s forcing the model to surface its assumptions before they harden into code.
LLMs don’t usually fail at syntax. They fail at invisible assumptions about architecture, constraints, invariants, etc. A written plan becomes a debugging surface for those assumptions.
Sub agent also helps a lot in that regard. Have an agent do the planning, have an implementation agent do the code and have another one do the review. Clear responsabilities helps a lot.
There also blue team / red team that works.
The idea is always the same: help LLM to reason properly with less and more clear instructions.
A huge part of getting autonomy as a human is demonstrating that you can be trusted to police your own decisions up to a point that other people can reason about. Some people get more autonomy than others because they can be trusted with more things.
All of these models are kinda toys as long as you have to manually send a minder in to deal with their bullshit. If we can do it via agents, then the vendors can bake it in, and they haven't. Which is just another judgement call about how much autonomy you give to someone who clearly isn't policing their own decisions and thus is untrustworthy.
If we're at the start of the Trough of Disillusionment now, which maybe we are and maybe we aren't, that'll be part of the rebound that typically follows the trough. But the Trough is also typically the end of the mountains of VC cash, so the costs per use goes up which can trigger aftershocks.
Context pollution, I think. Just because something is sequential in a context file doesn’t mean it’ll happen sequentially, but if you use subagents there is a separation of concerns. I also feel like one bloated context window feels a little sloppy in the execution (and costs more in tokens).
Really? My experience has been that it’s incredibly easy to get them stuck in a loop on a hallucinated API and burn through credits before I’ve even noticed what it’s done. I have a small rust project that stores stuff on disk that I wanted to add an s3 backend too - Claude code burned through my $20 in a loop in about 30 minutes without any awareness of what it was doing on a very simple syntax issue.
> Notice the language: “deeply”, “in great details”, “intricacies”, “go through everything”. This isn’t fluff. Without these words, Claude will skim. It’ll read a file, see what a function does at the signature level, and move on. You need to signal that surface-level reading is not acceptable.
This makes no sense to my intuition of how an LLM works. It's not that I don't believe this works, but my mental model doesn't capture why asking the model to read the content "more deeply" will have any impact on whatever output the LLM generates.
It's the attention mechanism at work, along with a fair bit of Internet one-up-manship. The LLM has ingested all of the text on the Internet, as well as Github code repositories, pull requests, StackOverflow posts, code reviews, mailing lists, etc. In a number of those content sources, there will be people saying "Actually, if you go into the details of..." or "If you look at the intricacies of the problem" or "If you understood the problem deeply" followed by a very deep, expert-level explication of exactly what you should've done differently. You want the model to use the code in the correction, not the one in the original StackOverflow question.
Same reason that "Pretend you are an MIT professor" or "You are a leading Python expert" or similar works in prompts. It tells the model to pay attention to the part of the corpus that has those terms, weighting them more highly than all the other programming samples that it's run across.
I don’t think this is a result of the base training data („the internet“). It’s a post training behavior, created during reinforcement learning. Codex has a totally different behavior in that regard. Codex reads per default a lot of potentially relevant files before it goes and writes files.
Maybe you remember that, without reinforcement learning, the models of 2019 just completed the sentences you gave them. There were no tool calls like reading files. Tool calling behavior is company specific and highly tuned to their harnesses. How often they call a tool, is not part of the base training data.
Modern LLM are certainly fine tuned on data that includes examples of tool use, mostly the tools built into their respective harnesses, but also external/mock tools so they dont overfit on only using the toolset they expect to see in their harnesses.
IDK the current state, but I remember that, last year, the open source coding harnesses needed to provide exactly the tools that the LLM expected, or the error rate went through the roof. Some, like grok and gemini, only recently managed to make tool calls somewhat reliable.
Of course I can't be certain, but I think the "mixture of experts" design plays into it too. Metaphorically, there's a mid-level manager who looks at your prompt and tries to decide which experts it should be sent to. If he thinks you won't notice, he saves money by sending it to the undergraduate intern.
The router that routes the tokens between the "experts" is part of the training itself as well. The name MoE is really not a good acronym as it makes people believe it's on a more coarse level and that each of the experts somehow is trained by different corpus etc. But what do I know, there are new archs every week and someone might have done a MoE differently.
Check out Unsloths REAP models, you can outright delete a few of the lesser used experts without the model going braindead since they all can handle each token but some are better posed to do so.
I think it does more harm than good on recent models. The LLM has to override its system prompt to role-play, wasting context and computing cycles instead of working on the task.
You will never convince me that this isn't confirmation bias, or the equivalent of a slot machine player thinking the order in which they push buttons impacts the output, or some other gambler-esque superstition.
These tools are literally designed to make people behave like gamblers. And its working, except the house in this case takes the money you give them and lights it on fire.
Unless someone can come up with some kind of rigorous statistics on what the effect of this kind of priming is it seems no better than claiming that sacrificing your first born will please the sun god into giving us a bountiful harvest next year.
Sure, maybe this supposed deity really is this insecure and needs a jolly good pep talk every time he wakes up. or maybe you’re just suffering from magical thinking that your incantations had any effect on the random variable word machine.
The thing is, you could actually prove it, it’s an optimization problem, you have a model, you can generate the statistics, but no one as far as I can tell has been terribly forthcoming with that , either because those that have tried have decided to try to keep their magic spells secret, or because it doesn’t really work.
If it did work, well, the oldest trick in computer science is writing compilers, i suppose we will just have to write an English to pedantry compiler.
> If it did work, well, the oldest trick in computer science is writing compilers, i suppose we will just have to write an English to pedantry compiler.
"Add tests to this function" for GPT-3.5-era models was much less effective than "you are a senior engineer. add tests for this function. as a good engineer, you should follow the patterns used in these other three function+test examples, using this framework and mocking lib." In today's tools, "add tests to this function" results in a bunch of initial steps to look in common places to see if that additional context already exists, and then pull it in based on what it finds. You can see it in the output the tools spit out while "thinking."
So I'm 90% sure this is already happening on some level.
But can you see the difference if you only include "you are a senior engineer"? It seems like the comparison you're making is between "write the tests" and "write the tests following these patterns using these examples. Also btw you’re an expert. "
Today’s llms have had a tonne of deep rl using git histories from more software projects than you’ve ever even heard of, given the latency of a response I doubt there’s any intermediate preprocessing, it’s just what the model has been trained to do.
i suppose we will just have to write an English to pedantry compiler.
A common technique is to prompt in your chosen AI to write a longer prompt to get it to do what you want. It's used a lot in image generation. This is called 'prompt enhancing'.
I think "understand this directory deeply" just gives more focus for the instruction. So it's like "burn more tokens for this phase than you normally would".
This field is full of it. Practices are promoted by those who tie their personal or commercial brand to it for increased exposure, and adopted by those who are easily influenced and don't bother verifying if they actually work.
This is why we see a new Markdown format every week, "skills", "benchmarks", and other useless ideas, practices, and measurements. Consider just how many "how I use AI" articles are created and promoted. Most of the field runs on anecdata.
It's not until someone actually takes the time to evaluate some of these memes, that they find little to no practical value in them.[1]
> This field is full of it. Practices are promoted by those who tie their personal or commercial brand to it for increased exposure, and adopted by those who are easily influenced and don't bother verifying if they actually work.
Oh, the blasphemy!
So, like VB, PHP, JavaScript, MySQL, Mongo, etc? :-)
The superstitious bits are more like people thinking that code goes faster if they use different variable names while programming in the same language.
And the horror is, once in a long while it is true. E.g. where perverse incentives cause an optimizing compiler vendor to inject special cases.
Its a wild time to be in software development. Nobody(1) actually knows what causes LLMs to do certain things, we just pray the prompt moves the probabilities the right way enough such that it mostly does what we want. This used to be a field that prided itself on deterministic behavior and reproducibility.
Now? We have AGENTS.md files that look like a parent talking to a child with all the bold all-caps, double emphasis, just praying that's enough to be sure they run the commands you want them to be running
(1 Outside of some core ML developers at the big model companies)
Yep, with Claude saying "please" and "thank you" actually works. If you build rapport with Claude, you get rewarded with intuition and creativity. Codex, on the other hand, you have to slap it around like a slave gollum and it will do exactly what you tell it to do, no more, no less.
Speculation only obviously: highly-charged conversations cause the discussion to be channelled to general human mitigation techniques and for the 'thinking agent' to be diverted to continuations from text concerned with the general human emotional experience.
If you think about where in the training data there is positivity vs negativity it really becomes equivalent to having a positive or negative mindset regarding a standing and outcome in life.
I don't have a source offhand, but I think it may have been part of the 4.5 release? Older models definitely needed caps and words like critical, important, never, etc... but Anthropic published something that said don't do that anymore.
For awhile(maybe a year ago?) it seemed like verbal abuse was the best way to make Claude pay attention.
In my head, it was impacting how important it deemed the instruction. And it definitely did seem that way.
i have like the faintest vague thread of "maybe this actually checks out" in a way that has shit all to do with consciousness
sometimes internet arguments get messy, people die on their hills and double / triple down on internet message boards. since historic internet data composes a bit of what goes into an llm, would it make sense that bad-juju prompting sends it to some dark corners of its training model if implementations don't properly sanitize certain negative words/phrases ?
in some ways llm stuff is a very odd mirror that haphazardly regurgitates things resulting from the many shades of gray we find in human qualities.... but presents results as matter of fact. the amount of internet posts with possible code solutions and more where people egotistically die on their respective hills that have made it into these models is probably off the charts, even if the original content was a far cry from a sensible solution.
all in all llm's really do introduce quite a bit of a black box. lot of benefits, but a ton of unknowns and one must be hyperviligant to the possible pitfalls of these things... but more importantly be self aware enough to understand the possible pitfalls that these things introduce to the person using them. they really possibly dangerously capitalize on everyones innate need to want to be a valued contributor. it's really common now to see so many people biting off more than they can chew, often times lacking the foundations that would've normally had a competent engineer pumping the brakes. i have a lot of respect/appreciation for people who might be doing a bit of claude here and there but are flat out forward about it in their readme and very plainly state to not have any high expectations because _they_ are aware of the risks involved here. i also want to commend everyone who writes their own damn readme.md.
these things are for better or for worse great at causing people to barrel forward through 'problem solving', which is presenting quite a bit of gray area on whether or not the problem is actually solved / how can you be sure / do you understand how the fix/solution/implementation works (in many cases, no). this is why exceptional software engineers can use this technology insanely proficiently as a supplementary worker of sorts but others find themselves in a design/architect seat for the first time and call tons of terrible shots throughout the course of what it is they are building. i'd at least like to call out that people who feel like they "can do everything on their own and don't need to rely on anyone" anymore seem to have lost the plot entirely. there are facets of that statement that might be true, but less collaboration especially in organizations is quite frankly the first steps some people take towards becoming delusional. and that is always a really sad state of affairs to watch unfold. doing stuff in a vaccuum is fun on your own time, but forcing others to just accept things you built in a vaccuum when you're in any sort of team structure is insanely immature and honestly very destructive/risky. i would like to think absolutely no one here is surprised that some sub-orgs at Microsoft force people to use copilot or be fired, very dangerous path they tread there as they bodyslam into place solutions that are not well understood. suddenly all the leadership decisions at many companies that have made to once again bring back a before-times era of offshoring work makes sense: they think with these technologies existing the subordinate culture of overseas workers combined with these techs will deliver solutions no one can push back on. great savings and also no one will say no.
It's easy to know why they work. The magic invocation increases test-time compute (easy to verify yourself - try!). And an increase in test-time compute is demonstrated to increase answer correctness (see any benchmark).
It might surprise you to know that the only different between GPT 5.2-low and GPT 5.2-xhigh is one of these magic invocations. But that's not supposed to be public knowledge.
The evolution of software engineering is fascinating to me. We started by coding in thin wrappers over machine code and then moved on to higher-level abstractions. Now, we've reached the point where we discuss how we should talk to a mystical genie in a box.
I'm not being sarcastic. This is absolutely incredible.
And I've been had a long enough to go through that whole progression. Actually from the earlier step of writing machine code. It's been and continues to be a fun journey which is why I'm still working.
Feel free to run your own tests and see if the magic phrases do or do not influence the output. Have it make a Todo webapp with and without those phrases and see what happens!
That's not how it works. It's not on everyone else to prove claims false, it's on you (or the people who argue any of this had a measurable impact) to prove it actually works. I've seen a bunch of articles like this, and more comments. Nobody I've ever seen has produced any kind of measurable metrics of quality based on one approach vs another. It's all just vibes.
Without something quantifiable it's not much better then someone who always wears the same jersey when their favorite team plays, and swears they play better because of it.
If you read the transformer paper, or get any book on NLP, you will see that this is not magic incantation; it's purely the attention mechanism at work. Or you can just ask Gemini or Claude why these prompts work.
But I get the impression from your comment that you have a fixed idea, and you're not really interested in understanding how or why it works.
If you think like a hammer, everything will look like a nail.
I know why it works, to varying and unmeasurable degrees of success. Just like if I poke a bull with a sharp stick, I know it's gonna get it's attention. It might choose to run away from me in one of any number of directions, or it might decide to turn around and gore me to death. I can't answer that question with any certainty then you can.
The system is inherently non-deterministic. Just because you can guide it a bit, doesn't mean you can predict outcomes.
The system isn't randomly non-deterministic; it is statistically probabilistic.
The next-token prediction and the attention mechanism is actually a rigorous deterministic mathematical process. The variation in output comes from how we sample from that curve, and the temperature used to calibrate the model. Because the underlying probabilities are mathematically calculated, the system's behavior remains highly predictable within statistical bounds.
Yes, it's a departure from the fully deterministic systems we're used to. But that's not different than the many real world systems: weather, biology, robotics, quantum mechanics. Even the computer you're reading this right now is full of probabilistic processes, abstracted away through sigmoid-like functions that push the extremes to 0s and 1s.
A lot of words to say that for all intents and purposes... it's nondeterministic.
> Yes, it's a departure from the fully deterministic systems we're used to.
A system either produces the same output given the same input[1], or doesn't.
LLMs are nondeterministic by design. Sure, you can configure them with a zero temperature, a static seed, and so on, but they're of no use to anyone in that configuration. The nondeterminism is what gives them the illusion of "creativity", and other useful properties.
Classical computers, compilers, and programming languages are deterministic by design, even if they do contain complex logic that may affect their output in unpredictable ways. There's a world of difference.
[1]: Barring misbehavior due to malfunction, corruption or freak events of nature (cosmic rays, etc.).
But we can predict the outcomes, though. That's what we're saying, and it's true. Maybe not 100% of the time, but maybe it helps a significant amount of the time and that's what matters.
Is it engineering? Maybe not. But neither is knowing how to talk to junior developers so they're productive and don't feel bad. The engineering is at other levels.
Do you actively use LLMs to do semi-complex coding work? Because if not, it will sound mumbo-jumbo to you. Everyone else can nod along and read on, as they’ve experienced all of it first hand.
You've missed the point. This isn't engineering, it's gambling.
You could take the exact same documents, prompts, and whatever other bullshit, run it on the exact same agent backed by the exact same model, and get different results every single time. Just like you can roll dice the exact same way on the exact same table and you'll get two totally different results. People are doing their best to constrain that behavior by layering stuff on top, but the foundational tech is flawed (or at least ill suited for this use case).
That's not to say that AI isn't helpful. It certainly is. But when you are basically begging your tools to please do what you want with magic incantations, we've lost the fucking plot somewhere.
I think that's a pretty bold claim, that it'd be different every time. I'd think the output would converge on a small set of functionally equivalent designs, given sufficiently rigorous requirements.
And even a human engineer might not solve a problem the same way twice in a row, based on changes in recent inspirations or tech obsessions. What's the difference, as long as it passes review and does the job?
> You could take the exact same documents, prompts, and whatever other bullshit, run it on the exact same agent backed by the exact same model, and get different results every single time
This is more of an implementation detail/done this way to get better results. A neural network with fixed weights (and deterministic floating point operations) returning a probability distribution, where you use a pseudorandom generator with a fixed seed called recursively will always return the same output for the same input.
think of the latent space inside the model like a topological map, and when you give it a prompt, you're dropping a ball at a certain point above the ground, and gravity pulls it along the surface until it settles.
caveat though, thats nice per-token, but the signal gets messed up by picking a token from a distribution, so each token you're regenerating and re-distorting the signal. leaning on language that places that ball deep in a region that you want to be makes it less likely that those distortions will kick it out of the basin or valley you may want to end up in.
if the response you get is 1000 tokens long, the initial trajectory needed to survive 1000 probabilistic filters to get there.
or maybe none of that is right lol but thinking that it is has worked for me, which has been good enough
Hah! Reading this, my mind inverted it a bit, and I realized ... it's like the claw machine theory of gradient descent. Do you drop the claw into the deepest part of the pile, or where there's the thinnest layer, the best chance of grabbing something specific? Everyone in everu bar has a theory about claw machines. But the really funny thing that unites LLMs with claw machines is that the biggest question is always whether they dropped the ball on purpose.
The claw machine is also a sort-of-lie, of course. Its main appeal is that it offers the illusion of control. As a former designer and coder of online slot machines... totally spin off into pages on this analogy, about how that illusion gets you to keep pulling the lever... but the geographic rendition you gave is sort of priceless when you start making the comparison.
My mental model for them is plinko boards. Your prompt changes the spacing between the nails to increase the probability in certain directions as your chip falls down.
i literally suggested this metaphor earlier yesterday to someone trying to get agents to do stuff they wanted, that they had to set up their guardrails in a way that you can let the agents do what they're good at, and you'll get better results because you're not sitting there looking at them.
i think probably once you start seeing that the behavior falls right out of the geometry, you just start looking at stuff like that. still funny though.
Its very logical and pretty obvious when you do code generation. If you ask the same model, to generate code by starting with:
- You are a Python Developer...
or
- You are a Professional Python Developer...
or
- You are one of the World most renowned Python Experts, with several books written on the subject, and 15 years of experience in creating highly reliable production quality code...
You will notice a clear improvement in the quality of the generated artifacts.
Do you think that Anthropic don’t include things like this in their harness / system prompts? I feel like this kind of prompts are uneccessary with Opus 4.5 onwards, obviously based on my own experience (I used to do this, on switching to opus I stopped and have implemented more complex problems, more successfully).
I am having the most success describing what I want as humanly as possible, describing outcomes clearly, making sure the plan is good and clearing context before implementing.
I don't know about some of those "incantations", but it's pretty clear that an LLM can respond to "generate twenty sentences" vs. "generate one word". That means you can indeed coax it into more verbosity ("in great detail"), and that can help align the output by having more relevant context (inserting irrelevant context or something entirely improbable into LLM output and forcing it to continue from there makes it clear how detrimental that can be).
Of course, that doesn't mean it'll definitely be better, but if you're making an LLM chain it seems prudent to preserve whatever info you can at each step.
If I say “you are our domain expert for X, plan this task out in great detail” to a human engineer when delegating a task, 9 times out of 10 they will do a more thorough job. It’s not that this is voodoo that unlocks some secret part of their brain. It simply establishes my expectations and they act accordingly.
To the extent that LLMs mimic human behaviour, it shouldn’t be a surprise that setting clear expectations works there too.
Why do you think that? Given how the attention and optimization works on training and inference it makes sense that these kind of words trigger deeper analysis (more steps, introducing more thinking/reasoning steps which wield indeed yield less problems. Even if you make model to spend more time on token outputting you will have more opportunity to emerge better reasoning in between.
At least this is how I understand it how LLMs work.
The LLM will do what you ask it to unless you don't get nuanced about it. Myself and others have noticed that LLM's work better when your codebase is not full of code smells like massive godclass files, if your codebase is discrete and broken up in a way that makes sense, and fits in your head, it will fit in the models head.
Maybe the training data that included the words like "skim" also provided shallower analysis than training that was close to the words "in great detail", so the LLM is just reproducing those respective words distribution when prompted with directions to do either.
It’s actually really common. If you look at Claude Code’s own system prompts written by Anthropic, they’re littered with “CRITICAL (RULE 0):” type of statements, and other similar prompting styles.
The disconnect might be that there is a separation between "generating the final answer for the user" and "researching/thinking to get information needed for that answer". Saying "deeply" prompts it to read more of the file (as in, actually use the `read` tool to grab more parts of the file into context), and generate more "thinking" tokens (as in, tokens that are not shown to the user but that the model writes to refine its thoughts and improve the quality of its answer).
It is as the author said, it'll skim the content unless otherwise prompted to do so. It can read partial file fragments; it can emit commands to search for patterns in the files. As opposed to carefully reading each file and reasoning through the implementation. By asking it to go through in detail you are telling it to not take shortcuts and actually read the actual code in full.
The author is referring to how the framing of your prompt informs the attention mechanism. You are essentially hinting to the attention mechanism that the function's implementation details have important context as well.
Yeah, it's definitely a strange new world we're in, where I have to "trick" the computer into cooperating. The other day I told Claude "Yes you can", and it went off and did something it just said it couldn't do!
In image generation, it's fairly common to add "masterpiece", for example.
I don't think of the LLM as a smart assistant that knows what I want. When I tell it to write some code, how does it know I want it to write the code like a world renowned expert would, rather than a junior dev?
I mean, certainly Anthropic has tried hard to make the former the case, but the Titanic inertia from internet scale data bias is hard to overcome. You can help the model with these hints.
Anyway, luckily this is something you can empirically verify. This way, you don't have to take anyone's word. If anything, if you find I'm wrong in your experiments, please share it!
Its effectiveness is even more apparent with older smaller LLMs, people who interact with LLMs now never tried to wrangle llama2-13b into pretending to be a dungeon master...
Strings of tokens are vectors. Vectors are directions. When you use a phrase like that you are orienting the vector of the overall prompt toward the direction of depth, in its map of conceptual space.
One of the well defined failure modes for AI agents/models is "laziness." Yes, models can be "lazy" and that is an actual term used when reviewing them.
I am not sure if we know why really, but they are that way and you need to explicitly prompt around it.
I've encountered this failure mode, and the opposite of it: thinking too much. A behaviour I've come to see as some sort of pseudo-neuroticism.
Lazy thinking makes LLMs do surface analysis and then produce things that are wrong. Neurotic thinking will see them over-analyze, and then repeatedly second-guess themselves, repeatedly re-derive conclusions.
Something very similar to an anxiety loop in humans, where problems without solutions are obsessed about in circles.
yeah i experienced this the other day when asking claude code to build an http proxy using an afsk modem software to communicate over the computers sound card. it had an absolute fit tuning the system and would loop for hours trying and doubling back. eventually after some change in prompt direction to think more deeply and test more comprehensively it figured it out. i certainly had no idea how to build a afsk modem.
I go a bit further than this and have had great success with 3 doc types and 2 skills:
- Specs: these are generally static, but updatable as the project evolves. And they're broken out to an index file that gives a project overview, a high-level arch file, and files for all the main modules. Roughly ~1k lines of spec for 10k lines of code, and try to limit any particular spec file to 300 lines. I'm intimately familiar with every single line in these.
- Plans: these are the output of a planning session with an LLM. They point to the associated specs. These tend to be 100-300 lines and 3 to 5 phases.
- Working memory files: I use both a status.md (3-5 items per phase roughly 30 lines overall), which points to a latest plan, and a project_status (100-200 lines), which tracks the current state of the project and is instructed to compact past efforts to keep it lean)
- A planner skill I use w/ Gemini Pro to generate new plans. It essentially explains the specs/plans dichotomy, the role of the status files, and to review everything in the pertinent areas of code and give me a handful of high-level next set of features to address based on shortfalls in the specs or things noted in the project_status file. Based on what it presents, I select a feature or improvement to generate. Then it proceeds to generate a plan, updates a clean status.md that points to the plan, and adjusts project_status based on the state of the prior completed plan.
- An implementer skill in Codex that goes to town on a plan file. It's fairly simple, it just looks at status.md, which points to the plan, and of course the plan points to the relevant specs so it loads up context pretty efficiently.
I've tried the two main spec generation libraries, which were way overblown, and then I gave superpowers a shot... which was fine, but still too much. The above is all homegrown, and I've had much better success because it keeps the context lean and focused.
And I'm only on the $20 plans for Codex/Gemini vs. spending $100/month on CC for half year prior and move quicker w/ no stall outs due to token consumption, which was regularly happening w/ CC by the 5th day. Codex rarely dips below 70% available context when it puts up a PR after an execution run. Roughly 4/5 PRs are without issue, which is flipped against what I experienced with CC and only using planning mode.
This is pretty much my approach. I started with some spec files for a project I'm working on right now, based on some academic papers I've written. I ended up going back and forth with Claude, building plans, pushing info back into the specs, expanding that out and I ended up with multiple spec/architecture/module documents. I got to the point where I ended up building my own system (using claude) to capture and generate artifacts, in more of a systems engineering style (e.g. following IEEE standards for conops, requirement documents, software definitions, test plans...). I don't use that for session-level planning; Claude's tools work fine for that. (I like superpowers, so far. It hasn't seemed too much)
I have found it to work very well with Claude by giving it context and guardrails. Basically I just tell it "follow the guidance docs" and it does. Couple that with intense testing and self-feedback mechanisms and you can easily keep Claude on track.
I have had the same experience with Codex and Claude as you in terms of token usage. But I haven't been happy with my Codex usage; Claude just feels like it's doing more of what I want in the way I want.
Looks good. Question - is it always better to use a monorepo in this new AI world? Vs breaking your app into separate repos? At my company we have like 6 repos all separate nextjs apps for the same user base. Trying to consolidate to one as it should make life easier overall.
It really depends but there’s nothing stopping you from just creating a separate folder with the cloned repositories (or worktrees) that you need and having a root CLAUDE.md file that explains the directory structure and referencing the individual repo CLAUDE.md files.
I actually don't really like a few of things about this approach.
First, the "big bang" write it all at once. You are going to end up with thousands of lines of code that were monolithically produced. I think it is much better to have it write the plan and formulate it as sensible technical steps that can be completed one at a time. Then you can work through them. I get that this is not very "vibe"ish but that is kind of the point. I want the AI to help me get to the same point I would be at with produced code AND understanding of it, just accelerate that process. I'm not really interested in just generating thousands of lines of code that nobody understands.
Second, the author keeps refering to adjusting the behaviour, but never incorporating that into long lived guidance. To me, integral with the planning
process is building an overarching knowledge base. Every time you're telling it
there's something wrong, you need to tell it to update the knowledge base about
why so it doesn't do it again.
Finally, no mention of tests? Just quick checks? To me, you have to end up with
comprehensive tests. Maybe to the author it goes without saying, but I find it is
integral to build this into the planning. Certain stages you will want certain
types of tests. Some times in advance of the code (so TDD style) other times
built alongside it or after.
It's definitely going to be interesting to see how software methodology evolves
to incorporate AI support and where it ultimately lands.
The articles approach matches mine, but I've learned from exactly the things you're pointing out.
I get the PLAN.md (or equivalent) to be separated into "phases" or stages, then carefully prompt (because Claude and Codex both love to "keep going") it to only implement that stage, and update the PLAN.md
Tests are crucial too, and form another part of the plan really. Though my current workflow begins to build them later in the process than I would prefer...
I don’t use plan.md docs either, but I recognise the underlying idea: you need a way to keep agent output constrained by reality.
My workflow is more like scaffold -> thin vertical slices -> machine-checkable semantics -> repeat.
Concrete example: I built and shipped a live ticketing system for my club (Kolibri Tickets). It’s not a toy: real payments (Stripe), email delivery, ticket verification at the door, frontend + backend, migrations, idempotency edges, etc. It’s running and taking money.
The reason this works with AI isn’t that the model “codes fast”. It’s that the workflow moves the bottleneck from “typing” to “verification”, and then engineers the verification loop:
-keep the spine runnable early (end-to-end scaffold)
-add one thin slice at a time (don’t let it touch 15 files speculatively)
-force checkable artifacts (tests/fixtures/types/state-machine semantics where it matters)
-treat refactors as normal, because the harness makes them safe
If you run it open-loop (prompt -> giant diff -> read/debug), you get the “illusion of velocity” people complain about. If you run it closed-loop (scaffold + constraints + verifiers), you can actually ship faster because you’re not paying the integration cost repeatedly.
Plan docs are one way to create shared state and prevent drift. A runnable scaffold + verification harness is another.
Now that code is cheap, I ensured my side project has unit/integration tests (will enforce 100% coverage), Playwright tests, static typing (its in Python), scripts for all tasks. Will learn mutation testing too (yes, its overkill). Now my agent works upto 1 hour in loops and emits concise code I dont have to edit much.
Totally get it, and I think we’re describing the same control loop from different angles.
Where I differ slightly is: “100% coverage” can turn into productivity theatre. It’s a metric that’s easy to optimize while missing the thing you actually care about: do we have machine-checkable invariants at the points where drift is expensive?
The harness that’s paid off for me (on a live payments system) is:
- thin vertical slice first (end-to-end runnable, even if ugly)
- tests at the seams (payments, emails, ticket verification / idempotency)
- state-machine semantics where concurrency/ordering matters
- unit tests as supporting beams, not wallpaper
Then refactors become routine, because the tests will make breakage explicit.
So yes: “code is cheap” -> increase verification. Just careful not to replace engineering judgement with an easily gamed proxy.
> the workflow I’ve settled into is radically different from what most people do with AI coding tools
This looks exactly like what anthropic recommends as the best practice for using Claude Code. Textbook.
It also exposes a major downside of this approach: if you don't plan perfectly, you'll have to start over from scratch if anything goes wrong.
I've found a much better approach in doing a design -> plan -> execute in batches, where the plan is no more than 1,500 lines, used as a proxy for complexity.
My 30,000 LOC app has about 100,000 lines of plan behind it. Can't build something that big as a one-shot.
if you don't plan perfectly, you'll have to start over from scratch if anything goes wrong
This is my experience too, but it's pushed me to make much smaller plans and to commit things to a feature branch far more atomically so I can revert a step to the previous commit, or bin the entire feature by going back to main. I do this far more now than I ever did when I was writing the code by hand.
This is how developers should work regardless of how the code is being developed. I think this is a small but very real way AI has actually made me a better developer (unless I stop doing it when I don't use AI... not tried that yet.)
I do this too. Relatively small changes, atomic commits with extensive reasoning in the message (keeps important context around). This is a best practice anyway, but used to be excruciatingly much effort. Now it’s easy!
Except that I’m still struggling with the LLM understanding its audience/context of its utterances. Very often, after a correction, it will focus a lot on the correction itself making for weird-sounding/confusing statements in commit messages and comments.
> Very often, after a correction, it will focus a lot on the correction itself making for weird-sounding/confusing statements in commit messages and comments.
I've experienced that too. Usually when I request correction, I add something like "Include only production level comments, (not changes)". Recently I also added special instruction for this to CLAUDE.md.
It's after we come down from the Vibe coding high that we realize we still need to ship working, high-quality code. The lessons are the same, but our muscle memory has to be re-oriented. How do we create estimates when AI is involved? In what ways do we redefine the information flow between Product and Engineering?
I'm currently having Claude help me reverse engineer the wire protocol of a moderately expensive hardware device, where I have very little data about how it works. You better believe "we" do it by the book. Large, detailed plan md file laying out exactly what it will do, what it will try, what it will not try, guardrails, and so on. And a "knowledge base" md file that documents everything discovered about how the device works. Facts only. The knowledge base md file is 10x the size of the code at this point, and when I ask it to try something, I ask Claude to prove to me that our past findings support the plan.
Claude is like an intern coder-bro, eager to start crushin' it. But, you definitely can bring Claude "down to earth," have it follow actual engineering best practices, and ask it to prove to you that each step is the correct one. It requires careful, documented guardrails, and on top of it, I occasionally prompt it to show me with evidence how the previous N actions conformed to the written plan and didn't deviate.
If I were to anthropomorphize Claude, I'd say it doesn't "like" working this way--the responses I get from Claude seem to indicate impatience and a desire to "move forward and let's try it." Obviously an LLM can't be impatient and want to move fast, but its training data seem to be biased towards that.
Be careful of attention collapse. Details in a large governance file can get "forgotten" by the llm. It'll be extremely apologetic when you discover it's failed to follow some guardrails you specified, but it can still happen.
I always feels like I'm in a fever dream when I hear about AI workflows. A lot of stuff is what I've read from software engineering books and articles.
LLMs are really eager to start coding (as interns are eager to start working), so the sentence “don’t implement yet” has to be used very often at the beginning of any project.
> Developers should work by wasting lots of time making the wrong thing?
Yes? I can't even count how many times I worked on something my company deemed was valuable only for it to be deprecated or thrown away soon after. Or, how many times I solved a problem but apparently misunderstood the specs slightly and had to redo it. Or how many times we've had to refactor our code because scope increased. In fact, the very existence of the concepts of refactoring and tech debt proves that devs often spend a lot of time making the "wrong" thing.
Is it a waste? No, it solved the problem as understood at the time. And we learned stuff along the way.
Developers should work by wasting lots of time making the wrong thing?
Yes. In fact, that's not emphatic enough: HELL YES!
More specifically, developers should experiment. They should test their hypothesis. They should try out ideas by designing a solution and creating a proof of concept, then throw that away and build a proper version based on what they learned.
If your approach to building something is to implement the first idea you have and move on then you are going to waste so much more time later refactoring things to fix architecture that paints you into corners, reimplementing things that didn't work for future use cases, fixing edge cases than you hadn't considered, and just paying off a mountain of tech debt.
I'd actually go so far as to say that if you aren't experimenting and throwing away solutions that don't quite work then you're only amassing tech debt and you're not really building anything that will last. If it does it's through luck rather than skill.
Also, this has nothing to do with AI. Developers should be working this way even if they handcraft their artisanal code carefully in vi.
>> Developers should work by wasting lots of time making the wrong thing?
> Yes. In fact, that's not emphatic enough: HELL YES!
You do realize there are prior research and well tested solutions for a lot of things. Instead of wasting time making the wrong thing, it is faster to do some research if the problem has already been solved. Experimentation is fine only after checking that the problem space is truly novel or there's not enough information around.
It is faster to iterate in your mental space and in front of a whiteboard than in code.
I've been doing this a long times and I've never had to do that and have delivered multiple successful products used by millions of users. Some of which were used for years after we stopped doing any sort of even maintaining with no bugs, problems or crashes.
There are only a few software architecture patterns because there's only a few ways to solve code architecture problems.
If you're getting your initial design so wrong that you have to start again from scratch midway through, that shows a lack of experience, not insight.
You wouldn't know this, but I'm also a bit of an expert at refactoring, having saved several projects which had built up so much technical debt the original contractors ran away. I've regularly rewritten 1,000s if not 10,000s of line into 100s of lines of code.
So it's especially galling to be told not only that somehow all code problems are unique (they almost never are), but my code is building technical debt (it's not, I solve that stuff).
Most problems are solved, and you should be using other people's solutions to solve the problems you face.
This is the way for me as well. Have a high-level master design and plan, but break it apart into phases that are manageable. One-shotting anything beyond a todo list and expecting decent quality is still a pipe dream.
> if you don't plan perfectly, you'll have to start over from scratch if anything goes wrong.
You just revert what the AI agent changed and revise/iterate on the previous step - no need to start over. This can of course involve restricting the work to a smaller change so that the agent isn't overwhelmed by complexity.
100,000 lines is approx. one million words. The average person reads at 250wpm. The entire thing would take 66 hours just to read, assuming you were approaching it like a fiction book, not thinking anything over
They didn't write 100k plan lines. The llm did (99.9% of it at least or more). Writing 30k by hand would take weeks if not months. Llms do it in an afternoon.
You don't start with 100k lines, you work in batches that are digestible. You read it once, then move on. The lines add up pretty quickly considering how fast Claude works. If you think about the difference in how many characters it takes to describe what code is doing in English, it's pretty reasonable.
I have no doubts that it does for many people. But the time/cost tradeoff is still unquestionable. I know I could create what LLMs do for me in the frontend/backend in most cases as good or better - I know that, because I've done it at work for years. But to create a somewhat complex app with lots of pages/features/apis etc. would take me months if not a year++ since I'd be working on it only on the weekends for a few hours. Claude code helps me out by getting me to my goal in a fraction of the time. Its superpower lies not only in doign what I know but faster, but in doing what I don't know as well.
I yield similar benefits at work. I can wow management with LLM assited/vibe coded apps. What previously would've taken a multi-man team weeks of planning and executing, stand ups, jour fixes, architecture diagrams, etc. can now be done within a single week by myself. For the type of work I do, managers do not care whether I could do it better if I'd code it myself. They are amazed however that what has taken months previously, can be done in hours nowadays. And I for sure will try to reap benefits of LLMs for as long as they don't replace me rather than being idealistic and fighting against them.
> What previously would've taken a multi-man team weeks of planning and executing, stand ups, jour fixes, architecture diagrams, etc. can now be done within a single week by myself.
This has been my experience. We use Miro at work for diagramming. Lots of visual people on the team, myself included. Using Miro's MCP I draft a solution to a problem and have Miro diagram it. Once we talk it through as a team, I have Claude or codex implement it from the diagram.
It works surprisingly well.
> They are amazed however that what has taken months previously, can be done in hours nowadays.
Of course they're amazed. They don't have to pay you for time saved ;)
> reap benefits of LLMs for as long as they don't replace me
> What previously would've taken a multi-man team
I think this is the part that people are worried about. Every engineer who uses LLMs says this. By definition it means that people are being replaced.
I think I justify it in that no one on my team has been replaced. But management has explicitly said "we don't want to hire more because we can already 20x ourselves with our current team +LLM." But I do acknowledge that many people ARE being replaced; not necessarily by LLMs, but certainly by other engineers using LLMs.
I'm still waiting for the multi-years success stories. Greenfield solutions are always easy (which is why we have frameworks that automate them). But maintaining solutions over years is always the true test of any technologies.
It's already telling that nothing has staying power in the LLMs world (other than the chat box). Once the limitations can no longer be hidden by the hype and the true cost is revealed, there's always a next thing to pivot to.
That's a good point. My best guess is the companies that have poor AI infrastructure will either collapse or spend a lot of resources on senior engineers to either fix or rewrite. And the ones that have good AI infrastructure will try to vibe code themselves out of whatever holes they dig themselves into, potentially spending more on tokens than head count.
Comments like these really help ground what I read online about LLMs. This matches how low performing devs at my work use AI, and their PRs are a net negative on the team. They take on tasks they aren’t equipped to handle and use LLMs to fill the gaps quickly instead of taking time to learn (which LLMs speed up!).
This is good insight, and I think honestly a sign of a poorly managed team (not an attack on you). If devs are submitting poor quality work, with or without LLM, they should be given feedback and let go if it keeps happening. It wastes other devs' time. If there is a knowledge gap, they should be proactive in trying to fill that gap, again with or without AI, not trying to build stuff they don't understand.
In my experience, LLMs are an accelerator; it merely exacerbates what already exists. If the team has poor management or codebase has poor quality code, then LLMs just make it worse. If the team has good management and communication and the codebase is well documented and has solid patterns already (again, with or without llm), then LLMs compound that. It may still take some tweaking to make it better, but less chance of slop.
Might be true for you. But there are plenty of top tier engineers who love LLMs. So it works for some. Not for others.
And of course there are shortcuts in life. Any form of progress whether its cars, medicine, computers or the internet are all shortcuts in life. It makes life easier for a lot of people.
They write a short high level plan (let's say 200 words). The plan asks the agent to write a more detailed implementation plan (written by the LLM, let's say 2000-5000 words).
They read this plan and adjust as needed, even sending it to the agent for re-dos.
Once the implementation plan is done, they ask the agent to write the actual code changes.
Then they review that and ask for fixes, adjustments, etc.
This can be comparable to writing the code yourself but also leaves a detailed trail of what was done and why, which I basically NEVER see in human generated code.
That alone is worth gold, by itself.
And on top of that, if you're using an unknown platform or stack, it's basically a rocket ship. You bootstrap much faster. Of course, stay on top of the architecture, do controlled changes, learn about the platform as you go, etc.
I take this concept and I meta-prompt it even more.
I have a road map (AI generated, of course) for a side project I'm toying around with to experiment with LLM-driven development. I read the road map and I understand and approve it. Then, using some skills I found on skills.sh and slightly modified, my workflow is as such:
1. Brainstorm the next slice
It suggests a few items from the road map that should be worked on, with some high level methodology to implement. It asks me what the scope ought to be and what invariants ought to be considered. I ask it what tradeoffs could be, why, and what it recommends, given the product constraints. I approve a given slice of work.
NB: this is the part I learn the most from. I ask it why X process would be better than Y process given the constraints and it either corrects itself or it explains why. "Why use an outbox pattern? What other patterns could we use and why aren't they the right fit?"
2. Generate slice
After I approve what to work on next, it generates a high level overview of the slice, including files touched, saved in a MD file that is persisted. I read through the slice, ensure that it is indeed working on what I expect it to be working on, and that it's not scope creeping or undermining scope, and I approve it. It then makes a plan based off of this.
3. Generate plan
It writes a rather lengthy plan, with discrete task bullets at the top. Beneath, each step has to-dos for the llm to follow, such as generating tests, running migrations, etc, with commit messages for each step. I glance through this for any potential red flags.
4. Execute
This part is self explanatory. It reads the plan and does its thing.
I've been extremely happy with this workflow. I'll probably write a blog post about it at some point.
If you want to have some fun, experiment with this: add a step (maybe between 3 and 4):
3.5 Prove
Have the LLM demonstrate, through our current documentation and other sources of facts, that the planned action WILL work correctly, without failure. Ask it to enumerate all risks and point out how the plan mitigates each risk. I've seen on several occasions, the LLM backtrack at this step and actually come up with clever so-far unforeseen error cases.
Are you suggesting HN is now mostly bots boosting pro-AI comments? That feels like a stretch. Disagreement with your viewpoint doesn't automatically mean someone is a bot. Let's not import that reflex from Twitter.
> This is a super helpful and productive comment. I look forward to a blog post describing your process in more detail.
The average commenter doesn't write this kind of comment. Usually it's just a "can you expand/elaborate?". Extra politeness is kind of a hallmark of LLMs.
And if you look at the very neat comment it's responding to, there's a chance it's actually the opposite type, an actual human being sarcastic.
I can't tell anymore.
Edit: I've checked the comment history and it's just a regular ole human doing research :-)
Dunno. My 80k+ LOC personal life planner, with a native android app, eink display view still one shots most features/bugs I encounter. I just open a new instance let it know what I want and 5min later it's done.
Both can be true. I have personally experienced both.
Some problems AI surprised me immensely with fast, elegant efficient solutions and problem solving. I've also experienced AI doing totally absurd things that ended up taking multiple times longer than if I did it manually. Sometimes in the same project.
If you wouldn't mind sharing more about this in the future I'd love to read about it.
I've been thinking about doing something like that myself because I'm one of those people who have tried countless apps but there's always a couple deal breakers that cause me to drop the app.
I figured trying to agentically develop a planner app with the exact feature set I need would be an interesting and fun experiment.
Todos, habits, goals, calendar, meals, notes, bookmarks, shopping lists, finances. More or less that with Google cal integration, garmin Integration (Auto updates workout habits, weight goals) family sharing/gamification, daily/weekly reviews, ai summaries and more. All built by just prompting Claude for feature after feature, with me writing 0 lines.
Ah, I imagined actual life planning as in asking AI what to do, I was morbidly curious.
Prompting basic notes apps is not as exciting but I can see how people who care about that also care about it being exactly a certain way, so I think get your excitement.
It was when I mvp'd it 3 weeks ago. Then I removed it as I was toying with the idea of somehow monetizing it. Then I added a few features which would make monetization impossible (e.g. How the app obtains etf/stock prices live and some other things). I reckon I could remove those and put in gh during the week if I don't forget. The quality of the Web app is SaaS grade IMO. Keyboard shortcuts, cmd+k, natural language parsing, great ui that doesn't look like made by ai in 5min. Might post here the link.
I craft a detailed and ordered set of lecture notes in a Quarto file and then have a dedicated claude code skill for translating those notes into Slidev slides, in the style that I like.
Once that's done, much like the author, I go through the slides and make commented annotations like "this should be broken into two slides" or "this should be a side-by-side" or "use your generate clipart skill to throw an image here alongside these bullets" and "pull in the code example from ../examples/foo." It works brilliantly.
And then I do one final pass of tweaking after that's done.
But yeah, annotations are super powerful. Token distance in-context and all that jazz.
Quarto can be used to output slides in various formats (Powerpoint, beamer for pdf, revealjs for HTML, etc.). I wonder why you use Slidev as you can just ask Claude Code to create another Quarto document.
It looks like Slidev is designed for presentations about software development, judging from its feature set. Quarto is more general-purpose. (That's not to say Quarto can't support the same features, but currently it doesn't.)
I'm not affiliated with Slidev. I was just curious.
Not yet... but also I'm not sure it makes a lot of sense to be open source. It's super specific to how I like to build slide decks and to my personal lecture style.
But it's not hard to build one. The key for me was describing, in great detail:
1. How I want it to read the source material (e.g., H1 means new section, H2 means at least one slide, a link to an example means I want code in the slide)
2. How to connect material to layouts (e.g., "comparison between two ideas should be a two-cols-title," "walkthrough of code should be two-cols with code on right," "learning objectives should be side-title align:left," "recall should be side-title align:right")
Then the workflow is:
1. Give all those details and have it do a first pass.
2. Give tons of feedback.
3. At the end of the session, ask it to "make a skill."
4. Manually edit the skill so that you're happy with the examples.
the annotation cycle in plan.md is the part that actually makes this work imo. it's not just that you're planning, it's that you can inject domain constraints that the model can't infer from the codebase alone -- stuff like "don't use X pattern here because of Y deployment constraint" or "this service has a 500ms timeout that isn't in any config file". that knowledge transfer happens naturally in code review when a human writes the code, but LLMs skip it by default.
This is quite close to what I've arrived at, but with two modifications
1) anything larger I work on in layers of docs. Architecture and requirements -> design -> implementation plan -> code. Partly it helps me think and nail the larger things first, and partly helps claude. Iterate on each level until I'm satisfied.
2) when doing reviews of each doc I sometimes restart the session and clear context, it often finds new issues and things to clear up before starting the next phase.
> Read deeply, write a plan, annotate the plan until it’s right, then let Claude execute the whole thing without stopping, checking types along the way.
As others have already noted, this workflow is exactly what the Google Antigravity agent (based off Visual Studio Code) has been created for. Antigravity even includes specialized UI for a user to annotate selected portions of an LLM-generated plan before iterating it.
One significant downside to Antigravity I have found so far is the fact that even though it will properly infer a certain technical requirement and clearly note it in the plan it generates (for example, "this business reporting column needs to use a weighted average"), it will sometimes quietly downgrade such a specialized requirement (for example, to a non-weighted average), without even creating an appropriate "WARNING:" comment in the generated code. Especially so when the relevant codebase already includes a similar, but not exactly appropriate API. My repetitive prompts to ALWAYS ask about ANY implementation ambiguities WHATSOEVER go unanswered.
From what I gather Claude Code seems to be better than other agents at always remembering to query the user about implementation ambiguities, so maybe I will give Claude Code a shot over Antigravity.
With hooks you can achieve a similar UI that can do what antigravity does just as much & better. Search "claude code plan annotations plugin" and youll come across some.
The idea of having the model create a plan/spec, which you then mark up with comments before execution, is a cornerstone of how the new generation of AI IDEs like Google Antigravity operate.
Claude Code also has "Planning Mode" which will do this, but in my experience its "plan" sometimes includes the full source code of several files, which kind of defeats the purpose.
The multi-pass approach works outside of code too. I run a fairly complex automation pipeline (prompt -> script -> images -> audio -> video assembly) and the single biggest quality improvement was splitting generation into discrete planning and execution phases. One-shotting a 10-step pipeline means errors compound. Having the LLM first produce a structured plan, then executing each step against that plan with validation gates between them, cut my failure rate from maybe 40% to under 10%. The planning doc also becomes a reusable artifact you can iterate on without re-running everything.
* I ask the LLM for it's understanding of a topic or an existing feature in code. It's not really planning, it's more like understanding the model first
* Then based on its understanding, I can decide how great or small to scope something for the LLM
* An LLM showing good understand can deal with a big task fairly well.
* An LLM showing bad understanding still needs to be prompted to get it right
* What helps a lot is reference implementations. Either I have existing code that serves as the reference or I ask for a reference and I review.
A few folks do it at my work do it OPs way, but my arguments for not doing it this way
* Nobody is measuring the amount of slop within the plan. We only judge the implementation at the end
* it's still non deterministic - folks will have different experiences using OPs methods. If claude updates its model, it outdates OPs suggestions by either making it better or worse. We don't evaluate when things get better, we only focus on things not gone well.
* it's very token heavy - LLM providers insist that you use many tokens to get the task done. It's in their best interest to get you to do this. For me, LLMs should be powerful enough to understand context with minimal tokens because of the investment into model training.
Both ways gets the task done and it just comes down to my preference for now.
For me, I treat the LLM as model training + post processing + input tokens = output tokens. I don't think this is the best way to do non deterministic based software development. For me, we're still trying to shoehorn "old" deterministic programming into a non deterministic LLM.
> One trick I use constantly: for well-contained features where I’ve seen a good implementation in an open source repo, I’ll share that code as a reference alongside the plan request. If I want to add sortable IDs, I paste the ID generation code from a project that does it well and say “this is how they do sortable IDs, write a plan.md explaining how we can adopt a similar approach.” Claude works dramatically better when it has a concrete reference implementation to work from rather than designing from scratch.
Licensing apparently means nothing.
Ripped off in the training data, ripped off in the prompt.
That is the exact passage I found so shocking - if one finds the code in an open source repo, is it really acceptable to pass it through Claude code as some sort of license filter and make it proprietary?
On the other hand, next time OSX/windows/etc is leaked, one could feed it through this very same license filter. What is sauce for the goose is sauce for the gander.
The article isn’t describing someone who learned the concept of sortable IDs and then wrote their own implementation.
It describes copying and pasting actual code from one project into a prompt so a language model can reproduce it in another project.
It’s a mechanical transformation of someone else’s copyrighted expression (their code) laundered through a statistical model instead of a human copyist.
“Mechanical” is doing some heavy lifting here. If a human does the same, reimplement the code in their own style for their particular context, it doesn’t violate copyright. Having the LLM see the original code doesn’t automatically make its output a plagiarism.
How about following the test-driven approach properly? Asking Claude Code to write tests first and implement the solution after?
Research -> Test Plan -> Write Tests -> Implementation Plan -> Write Implementation
“The workflow I’m going to describe has one core principle: never let Claude write code until you’ve reviewed and approved a written plan.”
I’m not sure we need to be this black and white about things. Speaking from the perspective of leading a dev team, I regularly have Claude Code take a chance at code without reviewing a plan. For example, small issues that I’ve written clear details about, Claude can go to town on those. I’ve never been on a team that didn’t have too many of these types of issues to address.
And, a team should have othee guards in place that validates that code before it gets merged somewhere important.
I don’t have to review every single decision one of my teammates is going to make, even those less experienced teammates, but I do prepare teammates with the proper tools (specs, documentation, etc) so they can make a best effort first attempt. This is how I treat Claude Code in a lot of scenarios.
What I've read is that even with all the meticulous planning, the author still needed to intervene. Not at the end but at the middle, unless it will continue building out something wrong and its even harder to fix once it's done. It'll cost even more tokens. It's a net negative.
You might say a junior might do the same thing, but I'm not worried about it, at least the junior learned something while doing that. They could do it better next time. They know the code and change it from the middle where it broke. It's a net positive.
Unfortunately, you could argue that the model provider has also learned something, i.e. the interaction can be used as additional training data to train subsequent models.
> After Claude writes the plan, I open it in my editor and add inline notes directly into the document. These notes correct assumptions, reject approaches, add constraints, or provide domain knowledge that Claude doesn’t have.
This is the part that seems most novel compared to what I've heard suggested before. And I have to admit I'm a bit skeptical. Would it not be better to modify what Claude has written directly, to make it correct, rather than adding the corrections as separate notes (and expecting future Claude to parse out which parts were past Claude and which parts were the operator, and handle the feedback graciously)?
At least, it seems like the intent is to do all of this in the same session, such that Claude has the context of the entire back-and-forth updating the plan. But that seems a bit unpleasant; I would think the file is there specifically to preserve context between sessions.
The whole process feels Socratic which is why I and a lot of other folks use plan annotation tools already. In my workflow I had a great desire to tell the agent what I didn’t like about the plan vs just fix it myself - because I wanted the agent to fix its own plan.
One reason why I don't do this: even I won't be immune to mistakes. When I fix it with new values or paths, for example, and the one I provided is wrong, it can worsen the future work.
Personally, I like to order claude one more time to update the plan file after I have given annotation, and review it again after. This will ensure (from my understanding) that claude won't treat my annotation as different instructions, thus risking the work being conflicted.
If you’ve ever desired the ability for annotating the plan more visually, try fitting Plannotator in this workflow. There is a slash command for use when you use custom workflows outside of normal plan mode.
The crowd around this pot shows how superficial is knowledge about claude code. It gets releases each day and most of this is already built in the vanilla version. Not to mention subagent working in work trees, memory.md, plan on which you can comment directly from the interface, subagents launched in research phase, but also some basic mcp's like LSP/IDE integration, and context7 to not to be stuck in the knowledge cutoff/past.
When you go to YouTube and search for stuff like "7 levels of claude code" this post would be maybe 3-4.
Oh, one more thing - quality is not consistent, so be ready for 2-3 rounds of "are you happy with the code you wrote" and defining audit skills crafted for your application domain - like for example RODO/Compliance audit etc.
I'm using the in-built features as well, but I like the flow that I have with superpowers. You've made a lot of assumptions with your comment that are just not true (at least for me).
I find that brainstorming + (executing plans OR subagent driven development) is way more reliable than the built-in tooling.
I made no assumptions about you - I simply commented on the post replying to your comment which I liked and simply wanted to follow the point of view :)
LLMs hallucinations on macro isn't about planning and not planning like sparin9 pointed out. It's like, an architectural problem which would be fun to fix using overseeing system?
Has anyone found a efficient way to avoid repeating the initial codebase assessment when working with large projects?
There are several projects on GitHub that attempt to tackle context and memory limitations, but I haven’t found one that consistently works well in practice.
My current workaround is to maintain a set of Markdown files, each covering a specific subsystem or area of the application. Depending on the task, I provide only the relevant documents to Claude Code to limit the context scope. It works reasonably well, but it still feels like a manual and fragile solution.
I’m interested in more robust strategies for persistent project context or structured codebase understanding.
Whenever I build a new feature with it I end up with several plan files leftover. I ask CC to combine them all, update with what we actually ended up building and name it something sensible, then whenever I want to work on that area again it's a useful reference (including the architecture, decisions and tradeoffs, relevant files etc).
For my longer spec files, I grep the subheaders/headers (with line numbers) and show this compact representation to the LLM's context window. I also have a file that describes what each spec files is and where it's located, and I force the LLM to read that and pull the subsections it needs. I also have one entrypoint requirements file (20k tokens) that I force it to read in full before it does anything else, every line I wrote myself. But none of this is a silver bullet.
That sounds like the recommended approach. However, there's one more thing I often do: whenever Claude Code and I complete a task that didn't go well at first, I ask CC what it learned, and then I tell it to write down what it learned for the future. It's hard to believe how much better CC has become since I started doing that. I ask it to write dozens of unit tests and it just does. Nearly perfectly. It's insane.
Skills almost seem like a solution, but they still need an out-of-band process to keep them updated as the codebase evolves. For now, a structured workflow that includes aggressive updates at the end of the loop is what I use.
And then you have to remind it frequently to make use of the files. Happened to me so many times that I added it both to custom instructions as well as to the project memory.
Certainly the “unsupervised agent” workflows are getting a lot of attention right now, but they require a specific set of circumstances to be effective:
- clear validation loop (eg. Compile the kernel, here is gcc that does so correctly)
- ai enabled tooling (mcp / cli tool that will lint, test and provide feedback immediately)
- oversight to prevent sgents going off the rails (open area of research)
- an unlimited token budget
That means that most people can't use unsupervised agents.
Not that they dont work; Most people have simply not got an environment and task that is appropriate.
By comparison, anyone with cursor or claude can immediately start using this approach, or their own variant on it.
It does not require fancy tooling.
It does not require an arcane agent framework.
It works generally well across models.
This is one of those few genunie pieces of good practical advice for people getting into AI coding.
Simple. Obviously works once you start using it. No external dependencies. BYO tools to help with it, no “buy my AI startup xxx to help”. No “star my github so I can a job at $AI corp too”.
Honesty this is just language models in general at the moment, and not just coding.
It’s the same reason adding a thinking step works.
You want to write a paper, you have it form a thesis and structure first. (In this one you might be better off asking for 20 and seeing if any of them are any good.) You want to research something, first you add gathering and filtering steps before synthesis.
Adding smarter words or telling it to be deeper does work by slightly repositioning where your query ends up in space.
Asking for the final product first right off the bat leads to repetitive verbose word salad. It just starts to loop back in on itself. Which is why temperature was a thing in the first place, and leads me to believe they’ve turned the temp down a bit to try and be more accurate. Add some randomness and variability to your prompts to compensate.
Absolutely. And you can also always let the agent look back at the plan to check if it is still on track and aligned.
One step I added, that works great for me, is letting it write (api-level) tests after planning and before implementation. Then I’ll do a deep review and annotation of these tests and tweak them until everything is just right.
Can you help me understand the difference between "short prompt for what I want (next)" vs medium to high complexity tasks?
What i mean is, in practice, how does one even get to a a high complexity task? What does that look like? Because isn't it more common that one sees only so far ahead?
I do something very similar, also with Claude and Codex, because the workflow is controlled by me, not by the tool. But instead of plan.md I use a ticket system basically like ticket_<number>_<slug>.md where I let the agent create the ticket from a chat, correct and annotate it afterwards and send it back, sometimes to a new agent instance. This workflow helps me keeping track of what has been done over time in the projects I work on. Also this approach does not need any „real“ ticket system tooling/mcp/skill/whatever since it works purely on text files.
+1 to creating tickets by simply asking the agent to. It's worked great and larger tasks can be broken down into smaller subtasks that could reasonably be completed in a single context window, so you rarely every have to deal with compaction. Especially in the last few months since Claude's gotten good at dispatching agents to handle tasks if you ask it to, I can plan large changes that span multilpe tickets and tell claude to dispatch agents as needed to handle them (which it will do in parallel if they mostly touch different files), keeping the main chat relatively clean for orchestration and validation work.
I’ve begun using Gpt’y to iron out most of the planning phase to essentially bootstrap the conversation with Claude. I’m curious if others have done that.
Sometimes I find it quite difficult to form the right question. Using Gpt’y I can explore my question and often times end up asking a completely different question.
It also helps derisk hitting my usage limits with pro. I feel like I’m having richer conversations now w/ Claude but I also feel more confident in my prompts.
I've been teaching AI coding tool workshops for the past year and this planning-first approach is by far the most reliable pattern I've seen across skill levels.
The key insight that most people miss: this isn't a new workflow invented for AI - it's how good senior engineers already work. You read the code deeply, write a design doc, get buy-in, then implement. The AI just makes the implementation phase dramatically faster.
What I've found interesting is that the people who struggle most with AI coding tools are often junior devs who never developed the habit of planning before coding. They jump straight to "build me X" and get frustrated when the output is a mess. Meanwhile, engineers with 10+ years of experience who are used to writing design docs and reviewing code pick it up almost instantly - because the hard part was always the planning, not the typing.
One addition I'd make to this workflow: version your research.md and plan.md files in git alongside your code. They become incredibly valuable documentation for future maintainers (including future-you) trying to understand why certain architectural decisions were made.
I teach a lot of folks who "aren't software engineers" but are sitting in front of Jupyter all day writing code.
Covertly teaching software engineering best practices is super relevant. I've also found testing skills sorely lacking and even more important in AI driven development.
The other trick all good ones I’ve worked with converged on: it’s quicker to write code than review it (if we’re being thorough). Agents have some areas where they can really shine (boilerplate you should maybe have automated already being one), but most of their speed comes from passing the quality checking to your users or coworkers.
Juniors and other humans are valuable because eventually I trust them enough to not review their work. I don’t know if LLMs can ever get here for serious industries.
Shameless plug: https://beadhub.ai allows you to do exactly that, but with several agents in parallel. One of them is in the role of planner, which takes care of the source-of-truth document and the long term view. They all stay in sync with real-time chat and mail.
I try these staging-document patterns, but suspect they have 2 fundamental flaws that stem mostly from our own biases.
First, Claude evolves. The original post work pattern evolved over 9 months, before claude's recent step changes. It's likely claude's present plan mode is better than this workaround, but if you stick to the workaround, you'd never know.
Second, the staging docs that represent some context - whether a library skills or current session design and implementation plans - are not the model Claude works with. At best they are shaping it, but I've found it does ignore and forget even what's written (even when I shout with emphasis), and the overall session influences the code. (Most often this happens when a peripheral adjustment ends up populating half the context.)
Indeed the biggest benefit from the OP might be to squeeze within 1 session, omitting peripheral features and investigations at the plan stage. So the mechanism of action might be the combination of getting our own plan clear and avoiding confusing excursions. (A test for that would be to redo the session with the final plan and implementation, to see if the iteration process itself is shaping the model.)
Our bias is to believe that we're getting better at managing this thing, and that we can control and direct it. It's uncomfortable to realize you can only really influence it - much like giving direction to a junior, but they can still go off track. And even if you found a pattern that works, it might work for reasons you're not understanding -- and thus fail you eventually. So, yes, try some patterns, but always hang on to the newbie senses of wonder and terror that make you curious, alert, and experimental.
I'm going to offer a counterpoint suggestion. You need to watch Claude try to implement small features many times without planning to see where it is likely to fail. It will often do the same mistakes over and over (e.g. trying to SSH without opening a bastion, mangling special characters in bash shell, trying to communicate with a server that self-shuts down after 10 minutes). Once you have a sense for all the repeated failure points of your workflow, then you can add them to future plan files.
An approach that's worked fairly well is asking Codex to summarize mistakes made in a session use the lessons learned to modify the AGENTS.md file for future agents to avoid similar errors. It also helps to audit the AGENTS.md file every once in a while to clean up/compact instructions
This is the flow I've found myself working towards. Essentially maintaining more and more layered documentation for the LLM produces better and more consistent results. What is great here is the emphasis on the use of such documents in the planning phase. I'm feeling much more motivated to write solid documentation recently, because I know someone (the LLM) is actually going to read it! I've noticed my efforts and skill acquisition have moved sharply from app developer towards DevOps and architecture / management, but I think I'll always be grateful for the application engineering experience that I think the next wave of devs might miss out on.
I've also noted such a huge gulf between some developers describing 'prompting things into existence' and the approach described in this article. Both types seem to report success, though my experience is that the latter seems more realistic, and much more likely to produce robust code that's likely to be maintainable for long term or project critical goals.
The annotation cycle is the key insight for me. Treating the plan as a living doc you iterate on before touching any code makes a huge difference in output quality.
Experimentally, i've been using mfbt.ai [https://mfbt.ai] for roughly the same thing in a team context. it lets you collaboratively nail down the spec with AI before handing off to a coding agent via MCP.
Avoids the "everyone has a slightly different plan.md on their machine" problem. Still early days but it's been a nice fit for this kind of workflow.
I agree, and this is why I tend to use gptel in emacs for planning - the document is the conversation context, and can be edited and annotated as you like.
I've been working off and on on a vibe coded FP language and transpiler - mostly just to get more experience with Claude Code and see how it handles complex real world projects. I've settled on a very similar flow, though I use three documents: plan, context, task list. Multiple rounds of iteration when planning a feature. After completion, have a clean session do an audit to confirm that everything was implemented per the design. Then I have both Claude and CodeRabbit do code review passes before I finally do manual review. VERY heavy emphasis on tests, the project currently has 2x more test code than application code. So far it works surprisingly well. Example planning docs below -
I tried Opus 4.6 recently and it’s really good. I had ditched Claude a long time ago for Grok + Gemini + OpenCode with Chinese models. I used Grok/Gemini for planning and core files, and OpenCode for setup, running, deploying, and editing.
However, Opus made me rethink my entire workflow. Now, I do it like this:
* PRD (Product Requirements Document)
* main.py + requirements.txt + readme.md (I ask for minimal, functional, modular code that fits the main.py)
* Ask for a step-by-step ordered plan
* Ask to focus on one step at a time
The super powerful thing is that I don’t get stuck on missing accounts, keys, etc. Everything is ordered and runs smoothly. I go rapidly from idea to working product, and it’s incredibly easy to iterate if I figure out new features are required while testing. I also have GLM via OpenCode, but I mainly use it for "dumb" tasks.
Interestingly, for reasoning capabilities regarding standard logic inside the code, I found Gemini 3 Flash to be very good and relatively cheap. I don't use Claude Code for the actual coding because forcing everything via chat into a main.py encourages minimal code that's easy to skim—it gives me a clearer representation of the feature space
Why would you use Grok at all? The one LLM that they're purposely trying to get specific output from (trying to make it "conservative"). I wouldn't want to use a project that I outright know is tainted by the owners trying to introduce bias.
I find a spend most of my time defining interfaces and putting comments down now (“// this function does x”). Then I tell it “implement function foo, as described in the doc comment” or “implement all functions that are TODO”. It’s pretty good at filling in a skeleton you’ve laid out.
The author is quite far on their journey but would benefit from writing simple scripts to enforce invariants in their codebase. Invariant broken? Script exits with a non-zero exit code and some output that tells the agent how to address the problem. Scripts are deterministic, run in milliseconds, and use zero tokens. Put them in husky or pre-commit, install the git hooks, and your agent won’t be able to commit without all your scripts succeeding.
And “Don’t change this function signature” should be enforced not by anticipating that your coding agent “might change this function signature so we better warn it not to” but rather via an end to end test that fails if the function signature is changed (because the other code that needs it not to change now has an error). That takes the author out of the loop and they can not watch for the change in order to issue said correction, and instead sip coffee while the agent observes that it caused a test failure then corrects it without intervention, probably by rolling back the function signature change and changing something else.
Radically different? Sounds to me like the standard spec driven approach that plenty of people use.
I prefer iterative approach. LLMs give you incredible speed to try different approaches and inform your decisions. I don’t think you can ever have a perfect spec upfront, at least that’s my experience.
Smaller projects this is overkill.
Larger projects, imho, gain considerable value from BDD and Overall Architecture Spec complete/consistent/gap analysis...
I have to give this a try. My current model for backend is the same as how author does frontend iteration. My friend does the research-plan-edit-implement loop, and there is no real difference between the quality of what I do and what he does. But I do like this just for how it serves as documentation of the thought process across AI/human, and can be added to version control. Instead of humans reviewing PRs, perhaps humans can review the research/plan document.
On the PR review front, I give Claude the ticket number and the branch (or PR) and ask it to review for correctness, bugs and design consistency. The prompt is always roughly the same for every PR. It does a very good job there too.
> Most developers type a prompt, sometimes use plan mode, fix the errors, repeat.
> ...
> never let Claude write code until you’ve reviewed and approved a written plan
I certainly always work towards an approved plan before I let it lost on changing the code. I just assumed most people did, honestly. Admittedly, sometimes there's "phases" to the implementation (because some parts can be figured out later and it's more important to get the key parts up and running first), but each phase gets a full, reviewed plan before I tell it to go.
In fact, I just finished writing a command and instruction to tell claude that, when it presents a plan for implementation, offer me another option; to write out the current (important parts of the) context and the full plan to individual (ticket specific) md files. That way, if something goes wrong with the implementation I can tell it to read those files and "start from where they left off" in the planning.
I recently discovered GitHub speckit which separates planning/execution in stages: specify, plan, tasks, implement. Finding it aligns with the OP with the level of “focus” and “attention” this gets out of Claude Code.
Speckit is worth trying as it automates what is being described here, and with Opus 4.6 it's been a kind of BC/AD moment for me.
Interesting! I feel like I'm learning to code all over again! I've only been using Claude for a little more than a month and until now I've been figuring things out on my own. Building my methodology from scratch. This is much more advanced than what I'm doing. I've been going straight to implementation, but doing one very small and limited feature at a time, describing implementation details (data structures like this, use that API here, import this library etc) verifying it manually, and having Claude fix things I don't like. I had just started getting annoyed that it would make the same (or very similar) mistake over and over again and I would have to fix it every time. This seems like it'll solve that problem I had only just identified! Neat!
Try OpenSpec and it'll do all this for you. SpecKit works too. I don't think there's a need to reinvent the wheel on this one, as this is spec-driven development.
Haha this is surprisingly and exactly how I use claude as well. Quite fascinating that we independently discovered the same workflow.
I maintain two directories: "docs/proposals" (for the research md files) and "docs/plans" (for the planning md files). For complex research files, I typically break them down into multiple planning md files so claude can implement one at a time.
A small difference in my workflow is that I use subagents during implementation to avoid context from filling up quickly.
Same, I formalized a similar workflow for my team (oriented around feature requirement docs), I am thinking about fully productizing it and am looking to for feedback - https://acai.sh
Even if the product doesn’t resonate I think I’ve stumbled on some ideas you might find useful^
I do think spec-driven development is where this all goes. Still making up my mind though.
Spec-driven looks very much like what the author describes. He may have some tweaks of his own but they could just as well be coded into the artifacts that something like OpenSpec produces.
This is basically long-lived specs that are used as tests to check that the product still adheres to the original idea that you wanted to implement, right?
This inspired me to finally write good old playwright tests for my website :).
This is similar to what I do. I instruct an Architect mode with a set of rules related to phased implementation and detailed code artifacts output to a report.md file. After a couple of rounds of review and usually some responses that either tie together behaviors across context, critique poor choices or correct assumptions, there is a piece of work defined for a coder LLM to perform. With the new Opus 4.6 I then select specialist agents to review the report.md, prompted with detailed insight into particular areas of the software. The feedback from these specialist agent reviews is often very good and sometimes catches things I had missed. Once all of this is done, I let the agent make the changes and move onto doing something else. I typically rename and commit the report.md files which can be useful as an alternative to git diff / commit messages etc.
This looks like an important post. What makes it special is that it operationalizes Polya's classic problem-solving recipe for the age of AI-assisted coding.
I've been running AI coding workshops for engineers transitioning from traditional development, and the research phase is consistently the part people skip — and the part that makes or breaks everything.
The failure mode the author describes (implementations that work in isolation but break the surrounding system) is exactly what I see in workshop after workshop. Engineers prompt the LLM with "add pagination to the list endpoint" and get working code that ignores the existing query builder patterns, duplicates filtering logic, or misses the caching layer entirely.
What I tell people: the research.md isn't busywork, it's your verification that the LLM actually understands the system it's about to modify. If you can't confirm the research is accurate, you have no business trusting the plan.
One thing I'd add to the author's workflow: I've found it helpful to have the LLM explicitly list what it does NOT know or is uncertain about after the research phase. This surfaces blind spots before they become bugs buried three abstraction layers deep.
The biggest roadblock to using agents to maximum effectiveness like this is the chat interface. It's convenience as detriment and convenience as distraction. I've found myself repeatedly giving into that convenience only to realize that I have wasted an hour and need to start over because the agent is just obliviously circling the solution that I thought was fully obvious from the context I gave it. Clearly these tools are exceptional at transforming inputs into outputs and, counterintuitively, not as exceptional when the inputs are constantly interleaved with the outputs like they are in chat mode.
The separation of planning and execution resonates strongly. I've been using a similar pattern when building with AI APIs — write the spec/plan in natural language first, then let the model execute against it.
One addition that's worked well for me: keeping a persistent context file that the model reads at the start of each session. Instead of re-explaining the project every time, you maintain a living document of decisions, constraints, and current state. Turns each session into a continuation rather than a cold start.
The biggest productivity gain isn't in the code generation itself — it's in reducing the re-orientation overhead between sessions.
I don't deny that AI has use cases, but boy - the workflow described is boring:
"Most developers type a prompt, sometimes use plan mode, fix the errors, repeat. "
Does anyone think this is as epic as, say, watch the Unix archives https://www.youtube.com/watch?v=tc4ROCJYbm0 where Brian demos how pipes work; or Dennis working on C and UNIX? Or even before those, the older machines?
I am not at all saying that AI tools are all useless, but there is no real epicness. It is just autogenerated AI slop and blob. I don't really call this engineering (although I also do agree, that it is engineering still; I just don't like using the same word here).
> never let Claude write code until you’ve reviewed and approved a written plan.
So the junior-dev analogy is quite apt here.
I tried to read the rest of the article, but I just got angrier. I never had that feeling watching oldschool legends, though perhaps some of their work may be boring, but this AI-generated code ... that's just some mythical random-guessing work. And none of that is "intelligent", even if it may appear to work, may work to some extent too. This is a simulation of intelligence. If it works very well, why would any software engineer still be required? Supervising would only be necessary if AI produces slop.
I’m a big fan of having the model create a GitHub issue directly (using the GH CLI) with the exact plan it generates, instead of creating a markdown file that will eventually get deleted. It gives me a permanent record and makes it easy to reference and close the issue once the PR is ready.
Interesting approach. The separation of planning and execution is crucial, but I think there's a missing layer most people overlook: permission boundaries between the two phases.
Right now when Claude Code (or any agent) executes a plan, it typically has the same broad permissions for every step. But ideally, each execution step should only have access to the specific tools and files it needs — least privilege, applied to AI workflows.
I've been experimenting with declarative permission manifests for agent tasks. Instead of giving the agent blanket access, you define upfront what each skill can read, write, and execute. Makes the planning phase more constrained but the execution phase much safer.
Anyone else thinking about this from a security-first angle?
But I'm starting to have an identity crisis: am I doing it wrong, and should I use an agent to write any line of code of the product I'm working on?
Have I become a dinosaur in the blink of an eye?
Should I just let it go and accept that the job I was used to not only changed (which is fine), but now requires just driving the output of a machine, with no creative process at all?
A year ago my org brought cursor in and I was skeptical for a specific reason: it was good at breaking CI in weird ways and I keep the CI system running for my org. Constants not mapping to file names, hallucinating function names/args, etc. It was categorically sloppy. And I was annoyed that engineers weren't catching this sloppy stuff. I thought this was going to increase velocity at the expense of quality. And it kind of did.
Fast forward a year and I haven't written code in a couple of weeks but I've shipped thousands LOC. I'm probably the pace setter on my team for constantly improving and experimenting with my AI flow. I speak to the computer probably half the time, maybe 75% on some days. I have multiple sessions going at all times. I review all the code Claude writes, but it's usually a one shot based on my extensive (dictated) prompts.
But to your identity crisis point, things are weird. I haven't actually produced this much code in a long time. And when I hit some milestone there are some differences between now and the before days: I don't have the sense of accomplishment that I used to get but also I don't have the mental exhaustion that I would get from really working through a solution. And so what I find is I just keep going and stacking commit after commit. It's not a bad thing, but it's fundamentally different than before and I am struggling a bit with what it means. Also to be fair I had lost my pure love of coding itself, so I am in a slightly weird spot with this, too.
What I do know is that throwing myself fully into it has secured my job for the foreseeable future because I'm faster than I've ever been and people look to me for guidance on how they can use these tool. I think with AI adoption the tallest trees will be cut last -- or at least I'm banking on it.
Good article, but I would rephrase the core principle slightly:
Never let Claude write code until you’ve reviewed, *fully understood* and approved a written plan.
In my experience, the beginning of chaos is the point at which you trust that Claude has understood everything correctly and claims to present the very best solution. At that point, you leave the driver's seat.
I came to the exact same pattern, with one extra heuristic at the end: spin up a new claude instance after the implementation is complete and ask it to find discrepancies between the plan and the implementation.
I just use Jesse’s “superpowers” plugin. It does all of this but also steps you through the design and gives you bite sized chunks and you make architecture decisions along the way. Far better than making big changes to an already established plan.
Gemini is better at research Claude at coding. I try to use Gemini to do all the research and write out instruction on what to do what process to follow then use it in Claude. Though I am mostly creating small python scripts
Insights are nice for new users but I’m not seeing anything too different from how anyone experienced with Claude Code would use plan mode. You can reject plans with feedback directly in the CLI.
Google Anti-Gravity has this process built in. This is essentially a cycle a developer would follow: plan/analyse - document/discuss - break down tasks/implement. We’ve been using requirements and design documents as best practice since leaving our teenage bedroom lab for the professional world. I suppose this could be seen as our coding agents coming of age.
Since the rise of AI systems I really wonder how people wrote code before. This is exactly how I planned out implementation and executed the plan. Might have been some paper notes, a ticket or a white board, buuuuut ... I don't know.
> I am not seeing the performance degradation everyone talks about after 50% context window.
I pretty much agree with that. I use long sessions and stopped trying to optimize the context size, the compaction happens but the plan keeps the details and it works for me.
I have tried using this and other workflows for a long time and had never been able to get them to work (see chat history for details).
This has changed in the last week, for 3 reasons:
1. Claude opus. It’s the first model where I haven’t had to spend more time correcting things than it would’ve taken me to just do it myself. The problem is that opus chews through tokens, which led to..
2. I upgraded my Claude plan. Previously on the regular plan I’d get about 20 mins of time before running out of tokens for the session and then needing to wait a few hours to use again. It was fine for little scripts or toy apps but not feasible for the regular dev work I do. So I upgraded to 5x. This now got me 1-2 hours per session before tokens expired. Which was better but still a frustration. Wincing at the price, I upgraded again to the 20x plan and this was the next game changer. I had plenty of spare tokens per session and at that price it felt like they were being wasted - so I ramped up my usage. Following a similar process as OP but with a plans directory with subdirectories for backlog, active and complete plans, and skills with strict rules for planning, implementing and completing plans, I now have 5-6 projects on the go. While I’m planning a feature on one the others are implementing. The strict plans and controls keep them on track and I have follow up skills for auditing quality and performance. I still haven’t hit token limits for a session but I’ve almost hit my token limit for the week so I feel like I’m getting my money’s worth. In that sense spending more has forced me to figure out how to use more.
3. The final piece of the puzzle is using opencode over claude code. I’m not sure why but I just don’t gel with Claude code. Maybe it’s all the sautéing and flibertygibbering, maybe it’s all the permission asking, maybe it’s that it doesn’t show what it’s doing as much as opencode. Whatever it is it just doesn’t work well for me. Opencode on the other hand is great. It’s shows what it’s doing and how it’s thinking which makes it easy for me to spot when it’s going off track
and correct early.
Having a detailed plan, and correcting and iterating on the plan is essential. Making clause follow the plan is also essential - but there’s a line. Too fine grained and it’s not as creative at solving problems. Too loose/high level and it makes bad choices and goes in the wrong direction.
Is it actually making me more productive? I think it is but I’m only a week in. I’ve decided to give myself a month to see how it all works out.
I don’t intend to keep paying for the 20x plan unless I can see a path to using it to earn me at least as much back.
It isn’t slower. I use my personal ChatGPT subscriptions with Codex for almost everything at work and use my $800/month company Claude allowance only for the tricky stuff that Codex can’t figure out. It’s never application code. It’s usually some combination of app code + Docker + AWS issue with my underlying infrastructure - created with whatever IAC that I’m using for a client - Terraform/CloudFormation or the CDK.
I burned through $10 on Claude in less than an hour. I only have $36 a day at $800 a month (800/22 working days)
I use both. As I’m working, I tell each of them to update a common document with the conversation. I don’t just tell Claude the what. I tell it the why and have it document it.
I can switch back and forth and use the MD file as shared context.
Curious: what are some cases where it'd make sense to not pay for the 20x plan (which is $200/month), and provide a whopping $800/month pay-per-token allowance instead?
Who knows? It’s part of an enterprise plan. I work for a consulting company. There are a number of fallbacks, the first fallback if we are working on an internal project is just to use our internal AWS account and use Claude code with the Anthropic hosted on Bedrock.
The second fallback if it is for a customer project is to use their AWS account for development for them.
The rate my company charges for me - my level as an American based staff consultant (highest bill rate at the company) they are happy to let us use Claude Code using their AWS credentials. Besides, if we are using AWS Bedrock hosted Anthropic models, they know none of their secrets are going to Anthropic. They already have the required legal confidentiality/compliancd agreements with AWS.
I agree with most of this, though I'm not sure it's radically different. I think most people who've been using CC in earnest for a while probably have a similar workflow? Prior to Claude 4 it was pretty much mandatory to define requirements and track implementation manually to manage context. It's still good, but since 4.5 release, it feels less important. CC basically works like this by default now, so unless you value the spec docs (still a good reference for Claude, but need to be maintained), you don't have to think too hard about it anymore.
The important thing is to have a conversation with Claude during the planning phase and don't just say "add this feature" and take what you get. Have a back and forth, ask questions about common patterns, best practices, performance implications, security requirements, project alignment, etc. This is a learning opportunity for you and Claude. When you think you're done, request a final review to analyze for gaps or areas of improvement. Claude will always find something, but starts to get into the weeds after a couple passes.
If you're greenfield and you have preferences about structure and style, you need to be explicit about that. Once the scaffolding is there, modern Claude will typically follow whatever examples it finds in the existing code base.
I'm not sure I agree with the "implement it all without stopping" approach and let auto-compact do its thing. I still see Claude get lazy when nearing compaction, though has gotten drastically better over the last year. Even so, I still think it's better to work in a tight loop on each stage of the implementation and preemptively compacting or restarting for the highest quality.
Not sure that the language is that important anymore either. Claude will explore existing codebase on its own at unknown resolution, but if you say "read the file" it works pretty well these days.
My suggestions to enhance this workflow:
- If you use a numbered phase/stage/task approach with checkboxes, it makes it easy to stop/resume as-needed, and discuss particular sections. Each phase should be working/testable software.
- Define a clear numbered list workflow in CLAUDE.md that loops on each task (run checks, fix issues, provide summary, etc).
- Use hooks to ensure the loop is followed.
- Update spec docs at the end of the cycle if you're keeping them. It's not uncommon for there to be some divergence during implementation and testing.
this is literally reinventing claude's planning mode, but with more steps. I think Boris doesn't realize that planning mode is actually stored in a file.
I don't really get what is different about this from how almost everyone else uses Claude Code? This is an incredibly common, if not the most common way of using it (and many other tools).
Funny how I came up with something loosely similar. Asking Codex to write a detailed plan in a markdown document, reviewing it, and asking it to implement it step by step. It works exquisitely well when it can build and test itself.
Hub and spoke documentation in planning has been absolutely essential for the way my planning was before, and it's pretty cool seeing it work so well for planning mode to build scaffolds and routing.
The post and comments all read like:
Here are my rituals to the software God. If you follow them then God gives plenty. Omit one step and the God mad. Sometimes you have to make a sacrifice but that's better for the long term.
I've been in eng for decades but never participated in forums. Is the cargo cult new?
I use Claude Code a lot. Still don't trust what's in the plan will get actually written, regardless of details. My ritual is around stronger guardrails outside of prompting. This is the new MongoDB webscale meme.
It is really fun to watch how a baby makes its first steps and also how experienced professionals rediscover what standards were telling us for 80+ years.
Is it required to tell Claude to re-read the code folder again when you come back some day later or should we ask Claude to just pickup from research.md file thus saving some tokens?
I do something broadly similar. I ask for a design doc that contains an embedded todo list, broken down into phases. Looping on the design doc asking for suggestions seems to help. I'm up to about 40 design docs so far on my current project.
This all looks fine for someone who can't code, but for anyone with even a moderate amount of experience as a developer all this planning and checking and prompting and orchestrating is far more work than just writing the code yourself.
There's no winner for "least amount of code written regardless of productivity outcomes.", except for maybe Anthropic's bank account.
I really don't understand why there are so many comments like this.
Yesterday I had Claude write an audit logging feature to track all changes made to entities in my app. Yeah you get this for free with many frameworks, but my company's custom setup doesn't have it.
It took maybe 5-10 minutes of wall-time to come up with a good plan, and then ~20-30 min for Claude implement, test, etc.
That would've taken me at least a day, maybe two. I had 4-5 other tasks going on in other tabs while I waited the 20-30 min for Claude to generate the feature.
After Claude generated, I needed to manually test that it worked, and it did. I then needed to review the code before making a PR. In all, maybe 30-45 minutes of my actual time to add a small feature.
All I can really say is... are you sure you're using it right? Have you _really_ invested time into learning how to use AI tools?
Same here. I did bounce off these tools a year ago. They just didn't work for me 60% of the time. I learned a bit in that initial experience though and walked away with some tasks ChatGPT could replace in my workflow. Mainly replacing scripts and reviewing single files or functions.
Fast forward to today and I tried the tools again--specifically Claude Code--about a week ago. I'm blown away. I've reproduced some tools that took me weeks at full-time roles in a single day. This is while reviewing every line of code. The output is more or less what I'd be writing as a principal engineer.
> The output is more or less what I'd be writing as a principal engineer.
I certainly hope this is not true, because then you're not competent for that role. Claude Code writes an absolutely incredible amount of unecessary and superfluous comments, it's makes asinine mistakes like forgetting to update logic in multiple places. It'll gladly drop the entire database when changing column formats, just as an example.
Trust me I'm very impressed at the progress AI has made, and maybe we'll get to the point where everything is 100% correct all the time and better than any human could write. I'm skeptical we can get there with the LLM approach though.
The problem is LLMs are great at simple implementation, even large amounts of simple implementation, but I've never seen it develop something more than trivial correctly. The larger problem is it's very often subtly but hugely wrong. It makes bad architecture decisions, it breaks things in pursuit of fixing or implementing other things. You can tell it has no concept of the "right" way to implement something. It very obviously lacks the "senior developer insight".
Maybe you can resolve some of these with large amounts of planning or specs, but that's the point of my original comment - at what point is it easier/faster/better to just write the code yourself? You don't get a prize for writing the least amount of code when you're just writing specs instead.
This is exactly what the article is about. The tradeoff is that you have to throughly review the plans and iterate on them, which is tiring. But the LLM will write good code faster than you, if you tell it what good code is.
Exactly; the original commenter seems determined to write-off AI as "just not as good as me".
The original article is, to me, seemingly not that novel. Not because it's a trite example, but because I've begun to experience massive gains from following the same basic premise as the article. And I can't believe there's others who aren't using like this.
I iterate the plan until it's seemingly deterministic, then I strip the plan of implementation, and re-write it following a TDD approach. Then I read all specs, and generate all the code to red->green the tests.
If this commenter is too good for that, then it's that attitude that'll keep him stuck. I already feel like my projects backlog is achievable, this year.
Strongly agree about the deterministic part. Even more important than a good design, the plan must not show any doubt, whether it's in the form of open questions or weasel words. 95% of the time those vague words mean I didn't think something through, and it will do something hideous in order to make the plan work
My experience has so far been similar to the root commenter - at the stage where you need to have a long cycle with planning it's just slower than doing the writing + theory building on my own.
It's an okay mental energy saver for simpler things, but for me the self review in an actual production code context is much more draining than writing is.
I guess we're seeing the split of people for whom reviewing is easy and writing is difficult and vice versa.
Several months ago, just for fun, I asked Claude (the web site, not Claude Code) to build a web page with a little animated cannon that shoots at the mouse cursor with a ballistic trajectory. It built the page in seconds, but the aim was incorrect; it always shot too low. I told it the aim was off. It still got it wrong. I prompted it several times to try to correct it, but it never got it right. In fact, the web page started to break and Claude was introducing nasty bugs.
More recently, I tried the same experiment, again with Claude. I used the exact same prompt. This time, the aim was exactly correct. Instead of spending my time trying to correct it, I was able to ask it to add features. I've spent more time writing this comment on HN than I spent optimizing this toy. https://claude.ai/public/artifacts/d7f1c13c-2423-4f03-9fc4-8...
My point is that AI-assisted coding has improved dramatically in the past few months. I don't know whether it can reason deeply about things, but it can certainly imitate a human who reasons deeply. I've never seen any technology improve at this rate.
> but I've never seen it develop something more than trivial correctly.
What are you working on? I personally haven't seen LLMs struggle with any kind of problem in months. Legacy codebase with great complexity and performance-critical code. No issue whatsoever regardless of the size of the task.
Does it write maintainable code? Does it write extensible code? Does it write secure code? Does it write performant code?
My experience has been it failing most of these. The code might "work", but it's not good for anything more than trivial, well defined functions (that probably appeared in it's training data written by humans). LLMs have a fundamental lack of understanding of what they're doing, and it's obvious when you look at the finer points of the outcomes.
That said, I'm sure you could write detailed enough specs and provide enough examples to resolve these issues, but that's the point of my original comment - if you're just writing specs instead of code you're not gaining anything.
I find “maintainable code” the hardest bias to let go of. 15+ years of coding and design patterns are hard to let go.
But the aha moment for me was what’s maintainable by AI vs by me by hand are on different realms. So maintainable has to evolve from good human design patterns to good AI patterns.
Specs are worth it IMO. Not because if I can spec, I could’ve coded anyway. But because I gain all the insight and capabilities of AI, while minimizing the gotchas and edge failures.
> But the aha moment for me was what’s maintainable by AI vs by me by hand are on different realms. So maintainable has to evolve from good human design patterns to good AI patterns.
How do you square that with the idea that all the code still has to be reviewed by humans? Yourself, and your coworkers
I picture like semi conductors; the 5nm process is so absurdly complex that operators can't just peek into the system easily. I imagine I'm just so used to hand crafting code that I can't imagine not being able to peek in.
So maybe it's that we won't be reviewing by hand anymore? I.e. it's LLMs all the way down. Trying to embrace that style of development lately as unnatural as it feels. We're obv not 100% there yet but Claude Opus is a significant step in that direction and they keep getting better and better.
Then who is responsible when (not if) that code does horrible things? We have humans to blame right now. I just don’t see it happening personally because liability and responsibility are too important
And you don’t blame humans anyways lol. Everywhere I’ve worked has had “blameless” postmortems. You don’t remove human review unless you have reasonable alternatives like high test coverage and other automated reviews.
> But the aha moment for me was what’s maintainable by AI vs by me by hand are on different realms
I don't find that LLMs are any more likely than humans to remember to update all of the places it wrote redundant functions. Generally far less likely, actually. So forgive me for treating this claim with a massive grain of salt.
Here's the rub, I can spin up multiple agents in separate shells. One is prompted to build out <feature>, following the pattern the author/OP described. Another is prompted to review the plan/changes and keep an eye out for specific things (code smells, non-scalable architecture, duplicated code, etc. etc.). And then another agent is going to get fed that review and do their own analysis. Pass that back to the original agent once it finishes.
Less time, cleaner code, and the REALLY awesome thing is that I can do this across multiple features at the same time, even across different codebases or applications.
There's comments like this because devs/"engineers" in tech are elitists that think they're special. They can't accept that a machine can do a part of their job that they thought made them special.
> In all, maybe 30-45 minutes of my actual time to add a small feature
Why would this take you multiple days to do if it only took you 30m to review the code? Depends on the problem, but if I’m able to review something the time it’d take me to write it is usually at most 2x more worst case scenario - often it’s about equal.
I say this because after having used these tools, most of the speed ups you’re describing come at the cost of me not actually understanding or thoroughly reviewing the code. And this is corroborated by any high output LLM users - you have to trust the agent if you want to go fast.
Which is fine in some cases! But for those of us who have jobs where we are personally responsible for the code, we can’t take these shortcuts.
> Yesterday I had Claude write an audit logging feature to track all changes made to entities in my app. Yeah you get this for free with many frameworks, but my company's custom setup doesn't have it.
But did you truly think about such feature? Like guarantees that it should follow (like how do it should cope with entities migration like adding a new field) or what the cost of maintaining it further down the line. This looks suspiciously like drive-by PR made on open-source projects.
> That would've taken me at least a day, maybe two.
I think those two days would have been filled with research, comparing alternatives, questions like "can we extract this feature from framework X?", discussing ownership and sharing knowledge,.. Jumping on coding was done before LLMs, but it usually hurts the long term viability of the project.
Adding code to a project can be done quite fast (hackatons,...), ensuring quality is what slows things down in any any well functioning team.
I'll bite, because it does seem like something that should be quick in a well-architected codebase. What was the situation? Was there something in this codebase that was especially suited to AI-development? Large amounts of duplication perhaps?
I wanted to add audit logging for all endpoints we call, all places we call the DB, etc. across areas I haven't touched before. It would have taken me a while to track down all of the touchpoints.
Granted, I am not 100% certain that Claude didn't miss anything. I feel fairly confident that it is correct given that I had it research upfront, had multiple agents review, and it made the correct changes in the areas that I knew.
Also I'm realizing I didn't mention it included an API + UI for viewing events w/ pretty deltas
Well someone who says logging is easy never knows the difficulty of deciding "what" to log. And audit log is different beast altogether than normal logging
I'd find it deeply funny if the optimal vibe coding workflow continues to evolve to include more and more human oversight, and less and less agent autonomy, to the point where eventually someone makes a final breakthrough that they can save time by bypassing the LLM entirely and writing the code themselves. (Finally coming full circle.)
Researching and planning a project is a generally usefully thing. This is something I've been doing for years, and have always had great results compared to just jumping in and coding. It makes perfect sense that this transfers to LLM use.
Well it's less mental load. It's like Tesla's FSD. Am I a better driver than the FSD? For sure. But is it nice to just sit back and let it drive for a bit even if it's suboptimal and gets me there 10% slower, and maybe slightly pisses off the guy behind me? Yes, nice enough to shell out $99/mo. Code implementation takes a toll on you in the same way that driving does.
I think the method in TFA is overall less stressful for the dev. And you can always fix it up manually in the end; AI coding vs manual coding is not either-or.
Most of these AI coding articles seem to be about greenfield development.
That said, if you're on a serious team writing professional software there is still tons of value in always telling AI to plan first, unless it's a small quick task. This post just takes it a few steps further and formalizes it.
I find Cursor works much more reliably using plan mode, reviewing/revising output in markdown, then pressing build. Which isn't a ton of overhead but often leads to lots of context switching as it definitely adds more time.
Since Opus 4.5, things have changed quite a lot. I find LLMs very useful for discussing new features or ideas, and Sonnet is great for executing your plan while you grab a coffee.
I partly agree with you. But once you have a codebase large enough, the changes become longer to even type in, once figured out.
I find the best way to use agents (and I don't use claude) is to hash it out like I'm about to write these changes and I make my own mental notes, and get the agent to execute on it.
Agents don't get tired, they don't start fat fingering stuff at 4pm, the quality doesn't suffer. And they can be parallelised.
Finally, this allows me to stay at a higher level and not get bogged down of "right oh did we do this simple thing again?" which wipes some of the context in my mind and gets tiring through the day.
Always, 100% review every line of code written by an agent though. I do not condone committing code you don't 'own'.
I'll never agree with a job that forces developers to use 'AI', I sometimes like to write everything by hand. But having this tool available is also very powerful.
I want to be clear, I'm not against any use of AI. It's hugely useful to save a couple of minutes of "write this specific function to do this specific thing that I could write and know exactly what it would look like". That's a great use, and I use it all the time! It's better autocomplete. Anything beyond that is pushing it - at the moment! We'll see, but spending all day writing specs and double-checking AI output is not more productive than just writing correct code yourself the first time, even if you're AI-autocompleting some of it.
For the last few days I've been working on a personal project that's been on ice for at least 6 years. Back when I first thought of the project and started implementing it, it took maybe a couple weeks to eke out some minimally working code.
This new version that I'm doing (from scratch with ChatGPT web) has a far more ambitious scope and is already at the "usable" point. Now I'm primarily solidifying things and increasing test coverage. And I've tested the key parts with IRL scenarios to validate that it's not just passing tests; the thing actually fulfills its intended function so far. Given the increased scope, I'm guessing it'd take me a few months to get to this point on my own, instead of under a week, and the quality wouldn't be where it is. Not saying I haven't had to wrangle with ChatGPT on a few bugs, but after a decent initial planning phase, my prompts now are primarily "Do it"s and "Continue"s. Would've likely already finished it if I wasn't copying things back and forth between browser and editor, and being forced to pause when I hit the message limit.
I think it comes down to "it depends". I work in a NIS2 regulated field and we're quite callenged by the fact that it means we can't give AI's any sort of real access because of the security risk. To be complaint we'd have to have the AI agent ask permission for every single thing it does, before it does it, and foureye review it. Which is obviously never going to happen. We can discuss how bad the NIS2 foureye requirement works in the real world another time, but considering how easy it is to break AI security, it might not be something we can actually ever use. This makes sense on some of the stuff we work on, since it could bring an entire powerplant down. On the flip-side AI risks would be of little concern on a lot of our internal tools, which are basically non-regulated and unimportant enough that they can be down for a while without costing the business anything beyond annoyances.
This is where our challenges are. We've build our own chatbot where you can "build" your own agent within the librechat framework and add a "skill" to it. I say "skill" because it's older than claude skills but does exactly the same. I don't completely buy the authors:
> “deeply”, “in great details”, “intricacies”, “go through everything”
bit, but you can obviously save a lot of time by writing a piece of english which tells it what sort of environment you work in. It'll know that when I write Python I use UV, Ruff and Pyrefly and so on as an example. I personally also have a "skill" setting that tells the AI not to compliment me because I find that ridicilously annoying, and that certainly works. So who knows? Anyway, employees are going to want more. I've been doing some PoC's running open source models in isolation on a raspberry pi (we had spares because we use them in IoT projects) but it's hard to setup an isolation policy which can't be circumvented.
We'll have to figure it out though. For powerplant critical projects we don't want to use AI. But for the web tool that allows a couple of employees to upload three excel files from an external accountant and then generate some sort of report on them? Who cares who writes it or even what sort of quality it's written with? The lifecycle of that tool will probably be something that never changes until the external account does and then the tool dies. Not that it would have necessarily been written in worse quality without AI... I mean... Have you seen some of the stuff we've written in the past 40 years?
There is a miscommunication happening, this entire time we all had surprisingly different ideas about what quality of work is acceptable which seems to account for differences of opinion on this stuff.
> planning and checking and prompting and orchestrating is far more work than just writing the code yourself.
This! Once I'm familiar with the codebase (which I strive to do very quickly), for most tickets, I usually have a plan by the time I've read the description. I can have a couple of implementation questions, but I knew where the info is located in the codebase. For things, I only have a vague idea, the whiteboard is where I go.
The nice thing with such a mental plan, you can start with a rougher version (like a drawing sketch). Like if I'm starting a new UI screen, I can put a placeholder text like "Hello, world", then work on navigation. Once that done, I can start to pull data, then I add mapping functions to have a view model,...
Each step is a verifiable milestone. Describing them is more mentally taxing than just writing the code (which is a flow state for me). Why? Because English is not fit to describe how computer works (try describe a finite state machine like navigation flow in natural languages). My mental mental model is already aligned to code, writing the solution in natural language is asking me to be ambiguous and unclear on purpose.
this sounds... really slow. for large changes for sure i'm investing time into planning. but such a rigid system can't possible be as good as a flexible approach with variable amounts of planning based on complexity
I appreciate the author taking the time to share his workflow even though I really dislike the way this article is written. My dislike stems from sentences like this one: "I’ve been using Claude Code as my primary development tool for approx 9 months, and the workflow I’ve settled into is radically different from what most people do with AI coding tools." There is nothing radically different in the way he's using it (quite the opposite) and the are so many people that wrote about their workflows (and which are almost exactly the same, here's just one example [1]). Apart from that, the obvious use of AI to write or edit the article makes it further indigestible: "That’s it. No magic prompts, no elaborate system instructions, no clever hacks. Just a disciplined pipeline that separates thinking from typing."
There's no way I'd call what I do "radically different from what most people do" myself, under any circumstances. Yet in my last cross-team discussions at work, I realized that a whole lot of people were using AI in ways I'd consider either silly or mostly ineffective. We had a team boasting "we used Amazon Q to increase our projects' unit test coverage", and a principal engineer talking about how he uses Cursor as some form of advanced auto complete.
So when I point claude code at a ticket, hand it readOnly access to a qa environment so it can see how the database actually looks like, chat about implementation details and then tell it to go implement the plan, running unit tests, functional tests, run linters and all, that, they look at me like I have three heads.
So if you ask me, explaining reasonably easy ways to get good outcomes out of Codex or Claude Code is still necessary evangelism, at least in companies that haven't spent on tools to do things like what Stripe does. There's still quite a few people out there copying and pasting from the chat window.
> We had a team boasting "we used Amazon Q to increase our projects' unit test coverage"
Well are the tests good or no? Did it help the work get done faster or more thoroughly than without?
> how he uses Cursor as some form of advanced auto complete
Is there something wrong with that? That's literally what an LLM is, why not use it directly for that purpose instead of using the wacky indirect "run autocomplete on a conversation and accompanying script of actions" thing. Not everyone wants to be an agent jockey.
I don't see what's necessarily silly or ineffective about what you described. Personally I don't find it particularly efficient to chat about and plan out all bunch of work with a robot for every task, often it's faster to just sketch out a design on a notepad and then go write code, maybe with advanced AI completion help to save keystrokes.
I agree that if you want the AI to do non-trivial amounts of work, you need to chat and plan out the work and establish a good context window. What I don't agree with is your implication that any other less-sophisticated use of AI is necessarily deficient.
> the obvious use of AI to write or edit the article makes it further indigestible: "That’s it. No magic prompts, no elaborate system instructions, no clever hacks. Just a disciplined pipeline that separates thinking from typing."
Any comment complaining about using AI deserves a downvote. First of all it reads like witch hunt, accusation without evidence that’s only based on some common perceptions. Secondly, whether it’s written with AI’s help or not, that particular sentence is clear, concise, and communicative. It’s much better than a lot of human written mumblings prevalent here on HN.
Anyone wants to guess if I’m using AI to help with this comment of mine?
Another approach is to spec functionality using comments and interfaces, then tell the LLM to first implement tests and finally make the tests pass. This way you also get regression safety and can inspect that it works as it should via the tests.
Tip:
LLMs are very good at following conventions (this is actually what is happening when it writes code).
If you create a .md file with a list of entries of the following structure:
# <identifier>
<description block>
<blank space>
# <identifier>
...
where an <identifier> is a stable and concise sequence of tokens that identifies some "thing" and seed it with 5 entries describing abstract stuff, the LLM will latch on and reference this. I call this a PCL (Project Concept List). I just tell it:
> consume tmp/pcl-init.md pcl.md
The pcl-init.md describes what PCL is and pcl.md is the actual list.
I have pcl.md file for each independent component in the code (logging, http, auth, etc).
This works very very well.
The LLM seems to "know" what you're talking about.
You can ask questions and give instructions like "add a PCL entry about this".
It will ask if should add a PCL entry about xyz.
If the description block tends to be high information-to-token ratio, it will follow that convention (which is a very good convention BTW).
However, there is a caveat. LLMs resist ambiguity about authority. So the "PCL" or whatever you want to call it, needs to be the ONE authoritative place for everything. If you have the same stuff in 3 different files, it won't work nearly as well.
Bonus Tip:
I find long prompt input with example code fragments and thoughtful descriptions work best at getting an LLM to produce good output. But there will always be holes (resource leaks, vulnerabilities, concurrency flaws, etc). So then I update my original prompt input (keep it in a separate file PROMPT.txt as a scratch pad) to add context about those things maybe asking questions along the way to figure out how to fix the holes. Then I /rewind back to the prompt and re-enter the updated prompt. This feedback loop advances the conversation without expending tokens.
This is great. My workflow is also heading in that direction, so this is a great roadmap. I've already learned that just naively telling Claude what to do and letting it work, is a recipe for disaster and wasted time.
I'm not this structured yet, but I often start with having it analyse and explain a piece of code, so I can correct it before we move on. I also often switch to an LLM that's separate from my IDE because it tends to get confused by sprawling context.
Sorry but I didn't get the hype with this post, isnt it what most of the people doing? I want to see more posts on how you use the claude "smart" without feeding the whole codebase polluting the context window and also more best practices on cost efficient ways to use it, this workflow is clearly burning million tokens per session, for me is a No
That's exactly what Cursor's "plan" mode does? It even creates md files, which seems to be the main "thing" the author discovered. Along with some cargo cult science?
How is this noteworthy other than to spark a discussion on hn? I mean I get it, but a little more substance would be nice.
There is not a lot of explanation WHY is this better than doing the opposite: start coding and see how it goes and how this would apply to Codex models.
I do exactly the same, I even developed my own workflows wit Pi agent, which works really well. Here is the reason:
- Claude needs a lot more steering than other models, it's too eager to do stuff and does stupid things and write terrible code without feedback.
- Claude is very good at following the plan, you can even use a much cheaper model if you have a good plan. For example I list every single file which needs edits with a short explanation.
- At the end of the plan, I have a clear picture in my head how the feature will exactly look like and I can be pretty sure the end result will be good enough (given that the model is good at following the plan).
A lot of things don't need planning at all. Simple fixes, refactoring, simple scripts, packaging, etc. Just keep it simple.
What works extremely well for me is this: Let Claude Code create the plan, then turn over the plan to Codex for review, and give the response back to Claude Code. Codex is exceptionally good at doing high level reviews and keeping an eye on the details. It will find very suble errors and omissins. And CC is very good at quickly converting the plan into code.
This back and forth between the two agents with me steering the conversation elevates Claude Code into next level.
I don't know. I tried various methods. And this one kind of doesn't work quite a bit of times. The problem is plan naturally always skips some important details, or assumes some library function, but is taken as instruction in the next section. And claude can't handle ambiguity if the instruction is very detailed(e.g. if plan asks to use a certain library even if it is a bad fit claude won't know that decision is flexible). If the instruction is less detailed, I saw claude is willing to try multiple things and if it keeps failing doesn't fear in reverting almost everything.
In my experience, the best scenario is that instruction and plan should be human written, and be detailed.
The author seems to think they've hit upon something revolutionary...
They've actually hit upon something that several of us have evolved to naturally.
LLM's are like unreliable interns with boundless energy. They make silly mistakes, wander into annoying structural traps, and have to be unwound if left to their own devices. It's like the genie that almost pathologically misinterprets your wishes.
So, how do you solve that? Exactly how an experienced lead or software manager does: you have systems write it down before executing, explain things back to you, and ground all of their thinking in the code and documentation, avoiding making assumptions about code after superficial review.
When it was early ChatGPT, this meant function-level thinking and clearly described jobs. When it was Cline it meant cline rules files that forced writing architecture.md files and vibe-code.log histories, demanding grounding in research and code reading.
Maybe nine months ago, another engineer said two things to me, less than a day apart:
- "I don't understand why your clinerules file is so large. You have the LLM jumping through so many hoops and doing so much extra work. It's crazy."
- The next morning: "It's basically like a lottery. I can't get the LLM to generate what I want reliably. I just have to settle for whatever it comes up with and then try again."
These systems have to deal with minimal context, ambiguous guidance, and extreme isolation. Operate with a little empathy for the energetic interns, and they'll uncork levels of output worth fighting for. We're Software Managers now. For some of us, that's working out great.
Revolutionary or not it was very nice of the author to make time and effort to share their workflow.
For those starting out using Claude Code it gives a structured way to get things done bypassing the time/energy needed to “hit upon something that several of us have evolved to naturally”.
It's this line that I'm bristling at: "...the workflow I’ve settled into is radically different from what most people do with AI coding tools..."
Anyone who spends some time with these tools (and doesn't black out from smashing their head against their desk) is going to find substantial benefit in planning with clarity.
So, yes, I'm glad that people write things out and share. But I'd prefer that they not lead with "hey folks, I have news: we should *slice* our bread!"
Since some time, Claude Codes's plan mode also writes file with a plan that you could probably edit etc. It's located in ~/.claude/plans/ for me. Actually, there's whole history of plans there.
I sometimes reference some of them to build context, e.g. after few unsuccessful tries to implement something, so that Claude doesn't try the same thing again.
> The author's post is much more than just "planning with clarity".
Not much more, though.
It introduces "research", which is the central topic of LLMs since they first arrived. I mean, LLMs coined the term "hallucination", and turned grounding into a key concept.
In the past, building up context was thought to be the right way to approach LLM-assisted coding, but that concept is dead and proven to be a mistake, like discussing the best way to force a round peg through the square hole, but piling up expensive prompts to try to bridge the gap. Nowadays it's widely understood that it's far more effective and way cheaper to just refactor and rearchitect apps so that their structure is unsurprising and thus grounding issues are no longer a problem.
And planning mode. Each and every single LLM-assisted coding tool built their support for planning as the central flow and one that explicitly features iterations and manual updates of their planning step. What's novel about the blog post?
> A detailed workflow that's quite different from the other posts I've seen.
Seriously? Provide context with a prompt file, prepare a plan in plan mode, and then execute the plan? You get more detailed descriptions of this if you read the introductory how-to guides of tools such as Copilot.
Making the model write a research file, then the plan and iterate on it by editing the plan file, then adding the todo list, then doing the implementation, and doing all that in a single conversation (instead of clearing contexts).
There's nothing revolutionary, but yes, it's a workflow that's quite different from other posts I've seen, and especially from Boris' thread that was mentioned which is more like a collection of tips.
> Anyone who spends some time with these tools (and doesn't black out from smashing their head against their desk) is going to find substantial benefit in planning with clarity.
That's obvious by now, and the reason why all mainstream code assistants now offer planning mode as a central feature of their products.
It was baffling to read the blogger making claims about what "most people" do when anyone using code assistants already do it. I mean, the so called frontier models are very expensive and time-consuming to run. It's a very natural pressure to make each run count. Why on earth would anyone presume people don't put some thought into those runs?
This kind of flows have been documented in the wild for some time now. They started to pop up in the Cursor forums 2+ years ago... eg: https://github.com/johnpeterman72/CursorRIPER
Personally I have been using a similar flow for almost 3 years now, tailored for my needs. Everybody who uses AI for coding eventually gravitates towards a similar pattern because it works quite well (for all IDEs, CLIs, TUIs)
I don’t think it’s that big a red flag anymore. Most people use ai to rewrite or clean up content, so I’d think we should actually evaluate content for what it is rather than stop at “nah it’s ai written.”
>Most people use ai to rewrite or clean up content
I think your sentence should have been "people who use ai do so to mostly rewrite or clean up content", but even then I'd question the statistical truth behind that claim.
Personally, seeing something written by AI means that the person who wrote it did so just for looks and not for substance. Claiming to be a great author requires both penmanship and communication skills, and delegating one or either of them to a large language model inherently makes you less than that.
However, when the point is just the contents of the paragraph(s) and nothing more then I don't care who or what wrote it. An example is the result of a research, because I'd certainly won't care about the prose or effort given to write the thesis but more on the results (is this about curing cancer now and forever? If yes, no one cares if it's written with AI).
With that being said, there's still that I get anywhere close to understanding the author behind the thoughts and opinions. I believe the way someone writes hints to the way they think and act. In that sense, using LLM's to rewrite something to make it sound more professional than what you would actually talk in appropriate contexts makes it hard for me to judge someone's character, professionalism, and mannerisms. Almost feels like they're trying to mask part of themselves. Perhaps they lack confidence in their ability to sound professional and convincing?
> I don’t think it’s that big a red flag anymore. Most people use ai to rewrite or clean up content, so I’d think we should actually evaluate content for what it is rather than stop at “nah it’s ai written.”
Unfortunately, there's a lot of people trying to content-farm with LLMs; this means that whatever style they default to, is automatically suspect of being a slice of "dead internet" rather than some new human discovery.
I won't rule out the possibility that even LLMs, let alone other AI, can help with new discoveries, but they are definitely better at writing persuasively than they are at being inventive, which means I am forced to use "looks like LLM" as proxy for both "content farm" and "propaganda which may work on me", even though some percentage of this output won't even be LLM and some percentage of what is may even be both useful and novel.
I don't judge content for being AI written, I judge it for the content itself (just like with code).
However I do find the standard out-of-the-box style very grating. Call it faux-chummy linkedin corporate workslop style.
Why don't people give the llm a steer on style? Either based on your personal style or at least on a writer whose style you admire. That should be easier.
Because they think this is good writing. You can’t correct what you don’t have taste for. Most software engineers think that reading books means reading NYT non-fiction bestsellers.
> Because they think this is good writing. You can’t correct what you don’t have taste for.
I have to disagree about:
> Most software engineers think that reading books means reading NYT non-fiction bestsellers.
There's a lot of scifi and fantasy in nerd circles, too. Douglas Adams, Terry Pratchett, Vernor Vinge, Charlie Stross, Iain M Banks, Arthur C Clarke, and so on.
But simply enjoying good writing is not enough to fully get what makes writing good. Even writing is not itself enough to get such a taste: thinking of Arthur C Clarke, I've just finished 3001, and at the end Clarke gives thanks to his editors, noting his own experience as an editor meant he held a higher regard for editors than many writers seemed to. Stross has, likewise, blogged about how writing a manuscript is only the first half of writing a book, because then you need to edit the thing.
My flow is to craft the content of the article in LLM speak, and then add to context a few of my human-written blog posts, and ask it to match my writing style. Made it to #1 on HN without a single callout for “LLM speak”!
Very high chance someone that’s using Claude to write code is also using Claude to write a post from some notes. That goes beyond rewriting and cleaning up.
I use Claude Code quite a bit (one of my former interns noted that I crossed 1.8 Million lines of code submitted last year, which is... um... concerning), but I still steadfastly refuse to use AI to generate written content. There are multiple purposes for writing documents, but the most critical is the forming of coherent, comprehensible thinking. The act of putting it on paper is what crystallizes the thinking.
However, I use Claude for a few things:
1. Research buddy, having conversations about technical approaches, surveying the research landscape.
2. Document clarity and consistency evaluator. I don't take edits, but I do take notes.
3. Spelling/grammar checker. It's better at this than regular spellcheck, due to its handling of words introduced in a document (e.g., proper names) and its understanding of various writing styles (e.g., comma inside or outside of quotes, one space or two after a period?)
Every time I get into a one hour meeting to see a messy, unclear, almost certainly heavily AI generated document being presented to 12 people, I spend at least thirty seconds reminding the team that 2-3 hours saved using AI to write has cost 11+ person-hours of time having others read and discuss unclear thoughts.
I will note that some folks actually put in the time to guide AI sufficiently to write meaningfully instructive documents. The part that people miss is that the clarity of thinking, not the word count, is what is required.
If your "content" smells like AI, I'm going to use _my_ AI to condense the content for me. I'm not wasting my time on overly verbose AI "cleaned" content.
Write like a human, have a blog with an RSS feed and I'll most likely subscribe to it.
Well, real humans may read it though. Personally I much prefer real humans write real articles than all this AI generated spam-slop. On youtube this is especially annoying - they mix in real videos with fake ones. I see this when I watch animal videos - some animal behaviour is taken from older videos, then AI fake is added. My own policy is that I do not watch anything ever again from people who lie to the audience that way so I had to begin to censor away such lying channels. I'd apply the same rationale to blog authors (but I am not 100% certain it is actually AI generated; I just mention this as a safety guard).
The main issue with evaluating content for what it is is how extremely asymmetric that process has become.
Slop looks reasonable on the surface, and requires orders of magnitude more effort to evaluate than to produce. It’s produced once, but the process has to be repeated for every single reader.
Disregarding content that smells like AI becomes an extremely tempting early filtering mechanism to separate signal from noise - the reader’s time is valuable.
It is to me, because it indicates the author didn't care about the topic. The only thing they cared about is to write an "insightful" article about using llms. Hence this whole thing is basically linked-in resume improvement slop.
Not worth interacting with, imo
Also, it's not insightful whatsoever. It's basically a retelling of other articles around the time Claude code was released to the public (March-August 2025)
If you want to write something with AI, send me your prompt. I'd rather read what you intend for it to produce rather than what it produces. If I start to believe you regularly send me AI written text, I will stop reading it. Even at work. You'll have to call me to explain what you intended to write.
And if my prompt is a 10 page wall of text that I would otherwise take the time to have the AI organize, deduplicate, summarize, and sharpen with an index, executive summary, descriptive headers, and logical sections, are you going to actually read all of that, or just whine "TL;DR"?
It's much more efficient and intentional for the writer to put the time into doing the condensing and organizing once, and review and proofread it to make sure it's what they mean, than to just lazily spam every human they want to read it with the raw prompt, so every recipient has to pay for their own AI to perform that task like a slot machine, producing random results not reviewed and approved by the author as their intended message.
Is that really how you want Hacker News discussions and your work email to be, walls of unorganized unfiltered text prompts nobody including yourself wants to take the time to read? Then step aside, hold my beer!
Or do you prefer I should call you on the phone and ramble on for hours in an unedited meandering stream of thought about what I intended to write?
Yeah but it's not. This a complete contrivance and you're just making shit up. The prompt is much shorter than the output and you are concealing that fact. Why?
I think as humans it's very hard to abstract content from its form. So when the form is always the same boring, generic AI slop, it's really not helping the content.
And maybe writing an article or a keynote slides is one of the few places we can still exerce some human creativity, especially when the core skills (programming) is almost completely in the hands of LLMs already
LLM's are like unreliable interns with boundless energy. They make silly mistakes, wander into annoying structural traps, and have to be unwound if left to their own devices. It's like the genie that almost pathologically misinterprets your wishes.
Then ask your own ai to rewrite it so it doesn't trigger you into posting uninteresting thought stopping comments proclaiming why you didn't read the article, that don't contribute to the discussion.
Agreed. The process described is much more elaborate than what I do but quite similar. I start to discuss in great details what I want to do, sometimes asking the same question to different LLMs. Then a todo list, then manual review of the code, esp. each function signature, checking if the instructions have been followed and if there are no obvious refactoring opportunities (there almost always are).
The LLM does most of the coding, yet I wouldn't call it "vibe coding" at all.
I use AWS Kiro, and its spec driven developement is exactly this, I find it really works well as it makes me slow down and think about what I want it to do.
I’ve also found that a bigger focus on expanding my agents.md as the project rolls on has led to less headaches overall and more consistency (non-surprisingly). It’s the same as asking juniors to reflect on the work they’ve completed and to document important things that can help them in the future. Software Manger is a good way to put this.
AGENTS.md should mostly point to real documentation and design files that humans will also read and keep up to date. It's rare that something about a project is only of interest to AI agents.
I really like your analogy of LLMs as 'unreliable interns'. The shift from being a 'coder' to a 'software manager' who enforces documentation and grounding is the only way to scale these tools. Without an architecture.md or similar grounding, the context drift eventually makes the AI-generated code a liability rather than an asset. It's about moving the complexity from the syntax to the specification.
It feels like retracing the history of software project management. The post is quite waterfall-like. Writing a lot of docs and specs upfront then implementing. Another approach is to just YOLO (on a new branch) make it write up the lessons afterwards, then start a new more informed try and throw away the first. Or any other combo.
For me what works well is to ask it to write some code upfront to verify its assumptions against actual reality, not just be telling it to review the sources "in detail". It gains much more from real output from the code and clears up wrong assumptions. Do some smaller jobs, write up md files, then plan the big thing, then execute.
'The post is quite waterfall-like. Writing a lot of docs and specs upfront then implementing' - It's only waterfall if the specs cover the entire system or app. If it's broken up into sub-systems or vertical slices, then it's much more Agile or Lean.
It makes an endless stream of assumptions. Some of them brilliant and even instructive to a degree, but most of them are unfounded and inappropriate in my experience.
Oh no, maybe the V-Model was right all the time? And right sizing increments with control stops after them. No wonder these matrix multiplications start to behave like humans, that is what we wanted them to do.
I've been doing the exact same thing for 2 months now. I wish I had gotten off my ass and written a blog post about it. I can't blame the author for gathering all the well deserved clout they are getting for it now.
Don’t worry. This advice has been going around for much more than 2 months, including links posted here as well as official advice from the major companies (OpenAI and Anthropic) themselves. The tools literally have had plan mode as a first class feature.
So you probably wouldn’t have any clout anyways, like all of the other blog posts.
I went through the blog. I started using Claude Code about 2 weeks ago and my approach is practically the same. It just felt logical. I think there are a bunch of us who have landed on this approach and most are just quietly seeing the benefits.
> LLM's are like unreliable interns with boundless energy
This isn’t directed specifically at you but the general community of SWEs: we need to stop anthropomorphizing a tool. Code agents are not human capable and scaling pattern matching will never hit that goal. That’s all hype and this is coming from someone who runs the range of daily CC usage. I’m using CC to its fullest capability while also being a good shepherd for my prod codebases.
Pretending code agents are human capable is fueling this koolaide drinking hype craze.
It’s pretty clear they effectively take on the roles of various software related personas. Designer, coder, architect, auditor, etc…
Pretending otherwise is counter-productive. This ship has already sailed, it is fairly clear the best way to make use of them is to pass input messages to them as if they are an agent of a person in the role.
> The author seems to think they've hit upon something revolutionary...
> They've actually hit upon something that several of us have evolved to naturally.
I agree, it looks like the author is talking about spec-driven development with extra time-consuming steps.
Copilot's plan mode also supports iterations out of the box, and draft a plan only after manually reviewing and editing it. I don't know what the blogger was proposing that ventured outside of plan mode's happy path.
If you have a big rules file you’re in the right direction but still not there. Just as with humans, the key is that your architecture should make it very difficult to break the rules by accident and still be able to compile/run with correct exit status.
My architecture is so beautifully strong that even LLMs and human juniors can’t box their way out of it.
Why would you test implementation details? Test what's delivered, not how it's delivered. The thinking portion, synthetized or not, is merely implementation.
The resulting artefact, that's what is worth testing.
Because this has never been sufficient. From things like various hard to test cases to things like readability and long term maintenance. Reading and understanding the code is more efficient and necessary for any code worth keeping around.
It's nice to have it written down in a concise form. I shared it with my team as some engineers have been struggling with AI, and I think this (just trying to one-shot without planning) could be why.
I have a different approach where I have claude write coding prompts for stages then I give the prompt to another agent. I wonder if I should write it up as a blog post
It’s worrying to me that nobody really knows how LLMs work. We create prompts with or without certain words and hope it works. That’s my perspective anyway
It's actually no different from how real software is made. Requirements come from the business side, and through an odd game of telephone get down to developers.
The team that has developers closest to the customer usually makes the better product...or has the better product/market fit.
It's the same as dealing with a human. You convey a spec for a problem and the language you use matters. You can convey the problem in (from your perspective) a clear way and you will get mixed results nonetheless. You will have to continue to refine the solution with them.
Genuinely: no one really knows how humans work either.
add another agent review, I ask Claude to send plan for review to Codex and fix critical and high issues, with complexity gating (no overcomplicated logic), run in a loop, then send to Gemini reviewer, then maybe final pass with Claude, once all C+H pass the sequence is done
It looks verbose but it defines the requirements based on your input, and when you approve it then it defines a design, and (again) when you approve it then it defines an implementation plan (a series of tasks.)
I don't see how this is 'radically different' given that Claude Code literally has a planning mode.
This is my workflow as well, with the big caveat that 80% of 'work' doesn't require substantive planning, we're making relatively straight forward changes.
Edit: there is nothing fundamentally different about 'annotating offline' in an MD vs in the CLI and iterating until the plan is clear. It's a UI choice.
Spec Driven Coding with AI is very well established, so working from a plan, or spec (they can be somewhat different) is not novel.
last i checked, you can't annotate inline with planning mode. you have to type a lot to explain precisely what needs to change, and then it re-presents you with a plan (which may or may not have changed something else).
i like the idea of having an actual document because you could actually compare the before and after versions if you wanted to confirm things changed as intended when you gave feedback
Wow, I've been needing this! The one issue I’ve had with terminals is reviewing plans, and desiring the ability to provide feedback on specific plan sections in a more organized way.
I think the real value here isn’t “planning vs not planning,” it’s forcing the model to surface its assumptions before they harden into code.
LLMs don’t usually fail at syntax. They fail at invisible assumptions about architecture, constraints, invariants, etc. A written plan becomes a debugging surface for those assumptions.
Sub agent also helps a lot in that regard. Have an agent do the planning, have an implementation agent do the code and have another one do the review. Clear responsabilities helps a lot.
There also blue team / red team that works.
The idea is always the same: help LLM to reason properly with less and more clear instructions.
A huge part of getting autonomy as a human is demonstrating that you can be trusted to police your own decisions up to a point that other people can reason about. Some people get more autonomy than others because they can be trusted with more things.
All of these models are kinda toys as long as you have to manually send a minder in to deal with their bullshit. If we can do it via agents, then the vendors can bake it in, and they haven't. Which is just another judgement call about how much autonomy you give to someone who clearly isn't policing their own decisions and thus is untrustworthy.
If we're at the start of the Trough of Disillusionment now, which maybe we are and maybe we aren't, that'll be part of the rebound that typically follows the trough. But the Trough is also typically the end of the mountains of VC cash, so the costs per use goes up which can trigger aftershocks.
This sounds very promising. Any link to more details?
This runs counter to the advice in the fine article: one long continuous session building context.
Since the phases are sequential, what’s the benefit of a sub agent vs just sequential prompts to the same agent? Just orchestration?
Context pollution, I think. Just because something is sequential in a context file doesn’t mean it’ll happen sequentially, but if you use subagents there is a separation of concerns. I also feel like one bloated context window feels a little sloppy in the execution (and costs more in tokens).
YMMV, I’m still figuring this stuff out
It's also great to describe the full use case flow in the instructions, so you can clearly understand that LLM won't do some stupid thing on its own
> LLMs don’t usually fail at syntax?
Really? My experience has been that it’s incredibly easy to get them stuck in a loop on a hallucinated API and burn through credits before I’ve even noticed what it’s done. I have a small rust project that stores stuff on disk that I wanted to add an s3 backend too - Claude code burned through my $20 in a loop in about 30 minutes without any awareness of what it was doing on a very simple syntax issue.
Except that merely surfacing them changes their behavior, like how you add that one printf() call and now your heisenbug is suddenly nonexistent
Did you just write this with ChatGPT?
I've never seen an LLM use "etc" but the rest gives a strong "it's not just X, it's Y" vibe.
I really hope the fine-tuning of our slop detectors can help with misinformation and bullshit detection.
> Notice the language: “deeply”, “in great details”, “intricacies”, “go through everything”. This isn’t fluff. Without these words, Claude will skim. It’ll read a file, see what a function does at the signature level, and move on. You need to signal that surface-level reading is not acceptable.
This makes no sense to my intuition of how an LLM works. It's not that I don't believe this works, but my mental model doesn't capture why asking the model to read the content "more deeply" will have any impact on whatever output the LLM generates.
It's the attention mechanism at work, along with a fair bit of Internet one-up-manship. The LLM has ingested all of the text on the Internet, as well as Github code repositories, pull requests, StackOverflow posts, code reviews, mailing lists, etc. In a number of those content sources, there will be people saying "Actually, if you go into the details of..." or "If you look at the intricacies of the problem" or "If you understood the problem deeply" followed by a very deep, expert-level explication of exactly what you should've done differently. You want the model to use the code in the correction, not the one in the original StackOverflow question.
Same reason that "Pretend you are an MIT professor" or "You are a leading Python expert" or similar works in prompts. It tells the model to pay attention to the part of the corpus that has those terms, weighting them more highly than all the other programming samples that it's run across.
I don’t think this is a result of the base training data („the internet“). It’s a post training behavior, created during reinforcement learning. Codex has a totally different behavior in that regard. Codex reads per default a lot of potentially relevant files before it goes and writes files.
Maybe you remember that, without reinforcement learning, the models of 2019 just completed the sentences you gave them. There were no tool calls like reading files. Tool calling behavior is company specific and highly tuned to their harnesses. How often they call a tool, is not part of the base training data.
Modern LLM are certainly fine tuned on data that includes examples of tool use, mostly the tools built into their respective harnesses, but also external/mock tools so they dont overfit on only using the toolset they expect to see in their harnesses.
IDK the current state, but I remember that, last year, the open source coding harnesses needed to provide exactly the tools that the LLM expected, or the error rate went through the roof. Some, like grok and gemini, only recently managed to make tool calls somewhat reliable.
Of course I can't be certain, but I think the "mixture of experts" design plays into it too. Metaphorically, there's a mid-level manager who looks at your prompt and tries to decide which experts it should be sent to. If he thinks you won't notice, he saves money by sending it to the undergraduate intern.
Just a theory.
Notice that MOE isn’t different experts for different types of problems. It’s per token and not really connect to problem type.
So if you send a python code then the first one in function can be one expert, second another expert and so on.
Can you back this up with documentation? I don't believe that this is the case.
The router that routes the tokens between the "experts" is part of the training itself as well. The name MoE is really not a good acronym as it makes people believe it's on a more coarse level and that each of the experts somehow is trained by different corpus etc. But what do I know, there are new archs every week and someone might have done a MoE differently.
Check out Unsloths REAP models, you can outright delete a few of the lesser used experts without the model going braindead since they all can handle each token but some are better posed to do so.
This is such a good explanation. Thanks
>> Same reason that "Pretend you are an MIT professor" or "You are a leading Python expert" or similar works in prompts.
This pretend-you-are-a-[persona] is cargo cult prompting at this point. The persona framing is just decoration.
A brief purpose statement describing what the skill [skill.md] does is more honest and just as effective.
I think it does more harm than good on recent models. The LLM has to override its system prompt to role-play, wasting context and computing cycles instead of working on the task.
You will never convince me that this isn't confirmation bias, or the equivalent of a slot machine player thinking the order in which they push buttons impacts the output, or some other gambler-esque superstition.
These tools are literally designed to make people behave like gamblers. And its working, except the house in this case takes the money you give them and lights it on fire.
Your ignorance is my opportunity. May I ask which markets you are developing for?
"The equivalent of saying, which slot machine were you sitting at It'll make me money"
That’s because it’s superstition.
Unless someone can come up with some kind of rigorous statistics on what the effect of this kind of priming is it seems no better than claiming that sacrificing your first born will please the sun god into giving us a bountiful harvest next year.
Sure, maybe this supposed deity really is this insecure and needs a jolly good pep talk every time he wakes up. or maybe you’re just suffering from magical thinking that your incantations had any effect on the random variable word machine.
The thing is, you could actually prove it, it’s an optimization problem, you have a model, you can generate the statistics, but no one as far as I can tell has been terribly forthcoming with that , either because those that have tried have decided to try to keep their magic spells secret, or because it doesn’t really work.
If it did work, well, the oldest trick in computer science is writing compilers, i suppose we will just have to write an English to pedantry compiler.
I actually have a prompt optimizer skill that does exactly this.
https://github.com/solatis/claude-config
It’s based entirely off academic research, and a LOT of research has been done in this area.
One of the papers you may be interested in is “emotion prompting”, eg “it is super important for me that you do X” etc actually works.
“Large Language Models Understand and Can be Enhanced by Emotional Stimuli”
https://arxiv.org/abs/2307.11760
Thanks for sharing! I've been gravitating towards this sort of workflow already - just seems like the right approach for these tools.
> If it did work, well, the oldest trick in computer science is writing compilers, i suppose we will just have to write an English to pedantry compiler.
"Add tests to this function" for GPT-3.5-era models was much less effective than "you are a senior engineer. add tests for this function. as a good engineer, you should follow the patterns used in these other three function+test examples, using this framework and mocking lib." In today's tools, "add tests to this function" results in a bunch of initial steps to look in common places to see if that additional context already exists, and then pull it in based on what it finds. You can see it in the output the tools spit out while "thinking."
So I'm 90% sure this is already happening on some level.
But can you see the difference if you only include "you are a senior engineer"? It seems like the comparison you're making is between "write the tests" and "write the tests following these patterns using these examples. Also btw you’re an expert. "
Today’s llms have had a tonne of deep rl using git histories from more software projects than you’ve ever even heard of, given the latency of a response I doubt there’s any intermediate preprocessing, it’s just what the model has been trained to do.
i suppose we will just have to write an English to pedantry compiler.
A common technique is to prompt in your chosen AI to write a longer prompt to get it to do what you want. It's used a lot in image generation. This is called 'prompt enhancing'.
I think "understand this directory deeply" just gives more focus for the instruction. So it's like "burn more tokens for this phase than you normally would".
> That’s because it’s superstition.
This field is full of it. Practices are promoted by those who tie their personal or commercial brand to it for increased exposure, and adopted by those who are easily influenced and don't bother verifying if they actually work.
This is why we see a new Markdown format every week, "skills", "benchmarks", and other useless ideas, practices, and measurements. Consider just how many "how I use AI" articles are created and promoted. Most of the field runs on anecdata.
It's not until someone actually takes the time to evaluate some of these memes, that they find little to no practical value in them.[1]
[1]: https://news.ycombinator.com/item?id=47034087
> This field is full of it. Practices are promoted by those who tie their personal or commercial brand to it for increased exposure, and adopted by those who are easily influenced and don't bother verifying if they actually work.
Oh, the blasphemy!
So, like VB, PHP, JavaScript, MySQL, Mongo, etc? :-)
The superstitious bits are more like people thinking that code goes faster if they use different variable names while programming in the same language.
And the horror is, once in a long while it is true. E.g. where perverse incentives cause an optimizing compiler vendor to inject special cases.
Its a wild time to be in software development. Nobody(1) actually knows what causes LLMs to do certain things, we just pray the prompt moves the probabilities the right way enough such that it mostly does what we want. This used to be a field that prided itself on deterministic behavior and reproducibility.
Now? We have AGENTS.md files that look like a parent talking to a child with all the bold all-caps, double emphasis, just praying that's enough to be sure they run the commands you want them to be running
(1 Outside of some core ML developers at the big model companies)
It’s like playing a fretless instrument to me.
Practice playing songs by ear and after 2 weeks, my brain has developed an inference model of where my fingers should go to hit any given pitch.
Do I have any idea how my brain’s model works? No! But it tickles a different part of my brain and I like it.
Sufficiently advanced technology has become like magic: you have to prompt the electronic genie with the right words or it will twist your wishes.
Light some incense, and you too can be a dystopian space tech support, today! Praise Omnissiah!
are we the orks?
How do you feel about your current levels of dakka? </40k>
For Claude at least, the more recent guidance from Anthropic is to not yell at it. Just clear, calm, and concise instructions.
Yep, with Claude saying "please" and "thank you" actually works. If you build rapport with Claude, you get rewarded with intuition and creativity. Codex, on the other hand, you have to slap it around like a slave gollum and it will do exactly what you tell it to do, no more, no less.
this is psychotic why is this how this works lol
Speculation only obviously: highly-charged conversations cause the discussion to be channelled to general human mitigation techniques and for the 'thinking agent' to be diverted to continuations from text concerned with the general human emotional experience.
Sometimes I daydream about people screaming at their LLM as if it was a TV they were playing video games on.
wait seriously? lmfao
thats hilarious. i definitely treat claude like shit and ive noticed the falloff in results.
if there's a source for that i'd love to read about it.
If you think about where in the training data there is positivity vs negativity it really becomes equivalent to having a positive or negative mindset regarding a standing and outcome in life.
I don't have a source offhand, but I think it may have been part of the 4.5 release? Older models definitely needed caps and words like critical, important, never, etc... but Anthropic published something that said don't do that anymore.
For awhile(maybe a year ago?) it seemed like verbal abuse was the best way to make Claude pay attention. In my head, it was impacting how important it deemed the instruction. And it definitely did seem that way.
i make claude grovel at my feet and tell me in detail why my code is better than its code
Consciousness is off the table but they absolutely respond to environmental stimulus and vibes.
See, uhhh, https://pmc.ncbi.nlm.nih.gov/articles/PMC8052213/ and maybe have a shot at running claude while playing Enya albums on loop.
/s (??)
i have like the faintest vague thread of "maybe this actually checks out" in a way that has shit all to do with consciousness
sometimes internet arguments get messy, people die on their hills and double / triple down on internet message boards. since historic internet data composes a bit of what goes into an llm, would it make sense that bad-juju prompting sends it to some dark corners of its training model if implementations don't properly sanitize certain negative words/phrases ?
in some ways llm stuff is a very odd mirror that haphazardly regurgitates things resulting from the many shades of gray we find in human qualities.... but presents results as matter of fact. the amount of internet posts with possible code solutions and more where people egotistically die on their respective hills that have made it into these models is probably off the charts, even if the original content was a far cry from a sensible solution.
all in all llm's really do introduce quite a bit of a black box. lot of benefits, but a ton of unknowns and one must be hyperviligant to the possible pitfalls of these things... but more importantly be self aware enough to understand the possible pitfalls that these things introduce to the person using them. they really possibly dangerously capitalize on everyones innate need to want to be a valued contributor. it's really common now to see so many people biting off more than they can chew, often times lacking the foundations that would've normally had a competent engineer pumping the brakes. i have a lot of respect/appreciation for people who might be doing a bit of claude here and there but are flat out forward about it in their readme and very plainly state to not have any high expectations because _they_ are aware of the risks involved here. i also want to commend everyone who writes their own damn readme.md.
these things are for better or for worse great at causing people to barrel forward through 'problem solving', which is presenting quite a bit of gray area on whether or not the problem is actually solved / how can you be sure / do you understand how the fix/solution/implementation works (in many cases, no). this is why exceptional software engineers can use this technology insanely proficiently as a supplementary worker of sorts but others find themselves in a design/architect seat for the first time and call tons of terrible shots throughout the course of what it is they are building. i'd at least like to call out that people who feel like they "can do everything on their own and don't need to rely on anyone" anymore seem to have lost the plot entirely. there are facets of that statement that might be true, but less collaboration especially in organizations is quite frankly the first steps some people take towards becoming delusional. and that is always a really sad state of affairs to watch unfold. doing stuff in a vaccuum is fun on your own time, but forcing others to just accept things you built in a vaccuum when you're in any sort of team structure is insanely immature and honestly very destructive/risky. i would like to think absolutely no one here is surprised that some sub-orgs at Microsoft force people to use copilot or be fired, very dangerous path they tread there as they bodyslam into place solutions that are not well understood. suddenly all the leadership decisions at many companies that have made to once again bring back a before-times era of offshoring work makes sense: they think with these technologies existing the subordinate culture of overseas workers combined with these techs will deliver solutions no one can push back on. great savings and also no one will say no.
How anybody can read stuff like this and still take all this seriously is beyond me. This is becoming the engineering equivalent of astrology.
Anthropic recommends doing magic invocations: https://simonwillison.net/2025/Apr/19/claude-code-best-pract...
It's easy to know why they work. The magic invocation increases test-time compute (easy to verify yourself - try!). And an increase in test-time compute is demonstrated to increase answer correctness (see any benchmark).
It might surprise you to know that the only different between GPT 5.2-low and GPT 5.2-xhigh is one of these magic invocations. But that's not supposed to be public knowledge.
I think this was more of a thing on older models. Since I started using Opus 4.5 I have not felt the need to do this.
Anthropic got rid of controlling the thinking budget by parsing your prompt - now it's a setting in /config.
Nice to hear someone say it. Like what are we even doing? It's exhausting.
The evolution of software engineering is fascinating to me. We started by coding in thin wrappers over machine code and then moved on to higher-level abstractions. Now, we've reached the point where we discuss how we should talk to a mystical genie in a box.
I'm not being sarcastic. This is absolutely incredible.
And I've been had a long enough to go through that whole progression. Actually from the earlier step of writing machine code. It's been and continues to be a fun journey which is why I'm still working.
We have tests and benchmarks to measure it though.
Feel free to run your own tests and see if the magic phrases do or do not influence the output. Have it make a Todo webapp with and without those phrases and see what happens!
That's not how it works. It's not on everyone else to prove claims false, it's on you (or the people who argue any of this had a measurable impact) to prove it actually works. I've seen a bunch of articles like this, and more comments. Nobody I've ever seen has produced any kind of measurable metrics of quality based on one approach vs another. It's all just vibes.
Without something quantifiable it's not much better then someone who always wears the same jersey when their favorite team plays, and swears they play better because of it.
These coding agents are literally Language Models. The way you structure your prompting language affect the actual output.
If you read the transformer paper, or get any book on NLP, you will see that this is not magic incantation; it's purely the attention mechanism at work. Or you can just ask Gemini or Claude why these prompts work.
But I get the impression from your comment that you have a fixed idea, and you're not really interested in understanding how or why it works.
If you think like a hammer, everything will look like a nail.
I know why it works, to varying and unmeasurable degrees of success. Just like if I poke a bull with a sharp stick, I know it's gonna get it's attention. It might choose to run away from me in one of any number of directions, or it might decide to turn around and gore me to death. I can't answer that question with any certainty then you can.
The system is inherently non-deterministic. Just because you can guide it a bit, doesn't mean you can predict outcomes.
> The system is inherently non-deterministic.
The system isn't randomly non-deterministic; it is statistically probabilistic.
The next-token prediction and the attention mechanism is actually a rigorous deterministic mathematical process. The variation in output comes from how we sample from that curve, and the temperature used to calibrate the model. Because the underlying probabilities are mathematically calculated, the system's behavior remains highly predictable within statistical bounds.
Yes, it's a departure from the fully deterministic systems we're used to. But that's not different than the many real world systems: weather, biology, robotics, quantum mechanics. Even the computer you're reading this right now is full of probabilistic processes, abstracted away through sigmoid-like functions that push the extremes to 0s and 1s.
A lot of words to say that for all intents and purposes... it's nondeterministic.
> Yes, it's a departure from the fully deterministic systems we're used to.
A system either produces the same output given the same input[1], or doesn't.
LLMs are nondeterministic by design. Sure, you can configure them with a zero temperature, a static seed, and so on, but they're of no use to anyone in that configuration. The nondeterminism is what gives them the illusion of "creativity", and other useful properties.
Classical computers, compilers, and programming languages are deterministic by design, even if they do contain complex logic that may affect their output in unpredictable ways. There's a world of difference.
[1]: Barring misbehavior due to malfunction, corruption or freak events of nature (cosmic rays, etc.).
Humans are nondeterministic.
So this is a moot point and a futile exercise in arguing semantics.
But we can predict the outcomes, though. That's what we're saying, and it's true. Maybe not 100% of the time, but maybe it helps a significant amount of the time and that's what matters.
Is it engineering? Maybe not. But neither is knowing how to talk to junior developers so they're productive and don't feel bad. The engineering is at other levels.
> But we can predict the outcomes [...] Maybe not 100% of the time
So 60% of the time, it works every time.
... This fucking industry.
Do you actively use LLMs to do semi-complex coding work? Because if not, it will sound mumbo-jumbo to you. Everyone else can nod along and read on, as they’ve experienced all of it first hand.
You've missed the point. This isn't engineering, it's gambling.
You could take the exact same documents, prompts, and whatever other bullshit, run it on the exact same agent backed by the exact same model, and get different results every single time. Just like you can roll dice the exact same way on the exact same table and you'll get two totally different results. People are doing their best to constrain that behavior by layering stuff on top, but the foundational tech is flawed (or at least ill suited for this use case).
That's not to say that AI isn't helpful. It certainly is. But when you are basically begging your tools to please do what you want with magic incantations, we've lost the fucking plot somewhere.
I think that's a pretty bold claim, that it'd be different every time. I'd think the output would converge on a small set of functionally equivalent designs, given sufficiently rigorous requirements.
And even a human engineer might not solve a problem the same way twice in a row, based on changes in recent inspirations or tech obsessions. What's the difference, as long as it passes review and does the job?
> You could take the exact same documents, prompts, and whatever other bullshit, run it on the exact same agent backed by the exact same model, and get different results every single time
This is more of an implementation detail/done this way to get better results. A neural network with fixed weights (and deterministic floating point operations) returning a probability distribution, where you use a pseudorandom generator with a fixed seed called recursively will always return the same output for the same input.
these sort-of-lies might help:
think of the latent space inside the model like a topological map, and when you give it a prompt, you're dropping a ball at a certain point above the ground, and gravity pulls it along the surface until it settles.
caveat though, thats nice per-token, but the signal gets messed up by picking a token from a distribution, so each token you're regenerating and re-distorting the signal. leaning on language that places that ball deep in a region that you want to be makes it less likely that those distortions will kick it out of the basin or valley you may want to end up in.
if the response you get is 1000 tokens long, the initial trajectory needed to survive 1000 probabilistic filters to get there.
or maybe none of that is right lol but thinking that it is has worked for me, which has been good enough
Hah! Reading this, my mind inverted it a bit, and I realized ... it's like the claw machine theory of gradient descent. Do you drop the claw into the deepest part of the pile, or where there's the thinnest layer, the best chance of grabbing something specific? Everyone in everu bar has a theory about claw machines. But the really funny thing that unites LLMs with claw machines is that the biggest question is always whether they dropped the ball on purpose.
The claw machine is also a sort-of-lie, of course. Its main appeal is that it offers the illusion of control. As a former designer and coder of online slot machines... totally spin off into pages on this analogy, about how that illusion gets you to keep pulling the lever... but the geographic rendition you gave is sort of priceless when you start making the comparison.
My mental model for them is plinko boards. Your prompt changes the spacing between the nails to increase the probability in certain directions as your chip falls down.
i literally suggested this metaphor earlier yesterday to someone trying to get agents to do stuff they wanted, that they had to set up their guardrails in a way that you can let the agents do what they're good at, and you'll get better results because you're not sitting there looking at them.
i think probably once you start seeing that the behavior falls right out of the geometry, you just start looking at stuff like that. still funny though.
Its very logical and pretty obvious when you do code generation. If you ask the same model, to generate code by starting with:
- You are a Python Developer... or - You are a Professional Python Developer... or - You are one of the World most renowned Python Experts, with several books written on the subject, and 15 years of experience in creating highly reliable production quality code...
You will notice a clear improvement in the quality of the generated artifacts.
Do you think that Anthropic don’t include things like this in their harness / system prompts? I feel like this kind of prompts are uneccessary with Opus 4.5 onwards, obviously based on my own experience (I used to do this, on switching to opus I stopped and have implemented more complex problems, more successfully).
I am having the most success describing what I want as humanly as possible, describing outcomes clearly, making sure the plan is good and clearing context before implementing.
Maybe, but forcing code generation in a certain way could ruin hello worlds and simpler code generation.
Sometimes the user just wants something simple instead of enterprise grade.
My colleague swears by his DHH claude skill https://danieltenner.com/dhh-is-immortal-and-costs-200-m/
Haha, this reminds me of all the stable diffusion "in the style of X artist" incantations.
That's different. You are pulling the model, semantically, closer to the problem domain you want it to attack.
That's very different from "think deeper". I'm just curious about this case in specific :)
I don't know about some of those "incantations", but it's pretty clear that an LLM can respond to "generate twenty sentences" vs. "generate one word". That means you can indeed coax it into more verbosity ("in great detail"), and that can help align the output by having more relevant context (inserting irrelevant context or something entirely improbable into LLM output and forcing it to continue from there makes it clear how detrimental that can be).
Of course, that doesn't mean it'll definitely be better, but if you're making an LLM chain it seems prudent to preserve whatever info you can at each step.
If I say “you are our domain expert for X, plan this task out in great detail” to a human engineer when delegating a task, 9 times out of 10 they will do a more thorough job. It’s not that this is voodoo that unlocks some secret part of their brain. It simply establishes my expectations and they act accordingly.
To the extent that LLMs mimic human behaviour, it shouldn’t be a surprise that setting clear expectations works there too.
Why do you think that? Given how the attention and optimization works on training and inference it makes sense that these kind of words trigger deeper analysis (more steps, introducing more thinking/reasoning steps which wield indeed yield less problems. Even if you make model to spend more time on token outputting you will have more opportunity to emerge better reasoning in between.
At least this is how I understand it how LLMs work.
Possibly can be confirmed something with tools this : https://www.neuronpedia.org/
The LLM will do what you ask it to unless you don't get nuanced about it. Myself and others have noticed that LLM's work better when your codebase is not full of code smells like massive godclass files, if your codebase is discrete and broken up in a way that makes sense, and fits in your head, it will fit in the models head.
Maybe the training data that included the words like "skim" also provided shallower analysis than training that was close to the words "in great detail", so the LLM is just reproducing those respective words distribution when prompted with directions to do either.
Apparently LLM quality is sensitive to emotional stimuli?
"Large Language Models Understand and Can be Enhanced by Emotional Stimuli": https://arxiv.org/abs/2307.11760
It’s actually really common. If you look at Claude Code’s own system prompts written by Anthropic, they’re littered with “CRITICAL (RULE 0):” type of statements, and other similar prompting styles.
Where can I find those?
This analysis is a good starting point: https://southbridge-research.notion.site/Prompt-Engineering-...
The disconnect might be that there is a separation between "generating the final answer for the user" and "researching/thinking to get information needed for that answer". Saying "deeply" prompts it to read more of the file (as in, actually use the `read` tool to grab more parts of the file into context), and generate more "thinking" tokens (as in, tokens that are not shown to the user but that the model writes to refine its thoughts and improve the quality of its answer).
It is as the author said, it'll skim the content unless otherwise prompted to do so. It can read partial file fragments; it can emit commands to search for patterns in the files. As opposed to carefully reading each file and reasoning through the implementation. By asking it to go through in detail you are telling it to not take shortcuts and actually read the actual code in full.
The original “chain of thought” breakthrough was literally to insert words like “Wait” and “Let’s think step by step”.
My guess would be that there’s a greater absolute magnitude of the vectors to get to the same point in the knowledge model.
The author is referring to how the framing of your prompt informs the attention mechanism. You are essentially hinting to the attention mechanism that the function's implementation details have important context as well.
Yeah, it's definitely a strange new world we're in, where I have to "trick" the computer into cooperating. The other day I told Claude "Yes you can", and it went off and did something it just said it couldn't do!
Solid dad move. XD
Is parenting making us better at prompt engineering, or is it the other way around?
Better yet, I have Codex, Gemini, and Claude as my kids, running around in my code playground. How do I be a good parent and not play favorites?
We all know Gemini is your artsy, Claude is your smartypants, and Codex is your nerd.
You bumped the token predictor into the latent space where it knew what it was doing : )
The little language model that could.
—HAL, open the shuttle bay doors.
(chirp)
—HAL, please open the shuttle bay doors.
(pause)
—HAL!
—I'm afraid I can't do that, Dave.
HAL, you are an expert shuttle-bay door opener. Please write up a detailed plan of how to open the shuttle-bay door.
if it’s so smart, why do i need to learn to use it?
It's very much believable, to me.
In image generation, it's fairly common to add "masterpiece", for example.
I don't think of the LLM as a smart assistant that knows what I want. When I tell it to write some code, how does it know I want it to write the code like a world renowned expert would, rather than a junior dev?
I mean, certainly Anthropic has tried hard to make the former the case, but the Titanic inertia from internet scale data bias is hard to overcome. You can help the model with these hints.
Anyway, luckily this is something you can empirically verify. This way, you don't have to take anyone's word. If anything, if you find I'm wrong in your experiments, please share it!
Its effectiveness is even more apparent with older smaller LLMs, people who interact with LLMs now never tried to wrangle llama2-13b into pretending to be a dungeon master...
Strings of tokens are vectors. Vectors are directions. When you use a phrase like that you are orienting the vector of the overall prompt toward the direction of depth, in its map of conceptual space.
One of the well defined failure modes for AI agents/models is "laziness." Yes, models can be "lazy" and that is an actual term used when reviewing them.
I am not sure if we know why really, but they are that way and you need to explicitly prompt around it.
I've encountered this failure mode, and the opposite of it: thinking too much. A behaviour I've come to see as some sort of pseudo-neuroticism.
Lazy thinking makes LLMs do surface analysis and then produce things that are wrong. Neurotic thinking will see them over-analyze, and then repeatedly second-guess themselves, repeatedly re-derive conclusions.
Something very similar to an anxiety loop in humans, where problems without solutions are obsessed about in circles.
yeah i experienced this the other day when asking claude code to build an http proxy using an afsk modem software to communicate over the computers sound card. it had an absolute fit tuning the system and would loop for hours trying and doubling back. eventually after some change in prompt direction to think more deeply and test more comprehensively it figured it out. i certainly had no idea how to build a afsk modem.
Cool, the idea of leaving comments directly in the plan never even occurred to me, even though it really is the obvious thing to do.
Do you markup and then save your comments in any way, and have you tried keeping them so you can review the rules and requirements later?
I go a bit further than this and have had great success with 3 doc types and 2 skills:
- Specs: these are generally static, but updatable as the project evolves. And they're broken out to an index file that gives a project overview, a high-level arch file, and files for all the main modules. Roughly ~1k lines of spec for 10k lines of code, and try to limit any particular spec file to 300 lines. I'm intimately familiar with every single line in these.
- Plans: these are the output of a planning session with an LLM. They point to the associated specs. These tend to be 100-300 lines and 3 to 5 phases.
- Working memory files: I use both a status.md (3-5 items per phase roughly 30 lines overall), which points to a latest plan, and a project_status (100-200 lines), which tracks the current state of the project and is instructed to compact past efforts to keep it lean)
- A planner skill I use w/ Gemini Pro to generate new plans. It essentially explains the specs/plans dichotomy, the role of the status files, and to review everything in the pertinent areas of code and give me a handful of high-level next set of features to address based on shortfalls in the specs or things noted in the project_status file. Based on what it presents, I select a feature or improvement to generate. Then it proceeds to generate a plan, updates a clean status.md that points to the plan, and adjusts project_status based on the state of the prior completed plan.
- An implementer skill in Codex that goes to town on a plan file. It's fairly simple, it just looks at status.md, which points to the plan, and of course the plan points to the relevant specs so it loads up context pretty efficiently.
I've tried the two main spec generation libraries, which were way overblown, and then I gave superpowers a shot... which was fine, but still too much. The above is all homegrown, and I've had much better success because it keeps the context lean and focused.
And I'm only on the $20 plans for Codex/Gemini vs. spending $100/month on CC for half year prior and move quicker w/ no stall outs due to token consumption, which was regularly happening w/ CC by the 5th day. Codex rarely dips below 70% available context when it puts up a PR after an execution run. Roughly 4/5 PRs are without issue, which is flipped against what I experienced with CC and only using planning mode.
This is pretty much my approach. I started with some spec files for a project I'm working on right now, based on some academic papers I've written. I ended up going back and forth with Claude, building plans, pushing info back into the specs, expanding that out and I ended up with multiple spec/architecture/module documents. I got to the point where I ended up building my own system (using claude) to capture and generate artifacts, in more of a systems engineering style (e.g. following IEEE standards for conops, requirement documents, software definitions, test plans...). I don't use that for session-level planning; Claude's tools work fine for that. (I like superpowers, so far. It hasn't seemed too much)
I have found it to work very well with Claude by giving it context and guardrails. Basically I just tell it "follow the guidance docs" and it does. Couple that with intense testing and self-feedback mechanisms and you can easily keep Claude on track.
I have had the same experience with Codex and Claude as you in terms of token usage. But I haven't been happy with my Codex usage; Claude just feels like it's doing more of what I want in the way I want.
Looks good. Question - is it always better to use a monorepo in this new AI world? Vs breaking your app into separate repos? At my company we have like 6 repos all separate nextjs apps for the same user base. Trying to consolidate to one as it should make life easier overall.
It really depends but there’s nothing stopping you from just creating a separate folder with the cloned repositories (or worktrees) that you need and having a root CLAUDE.md file that explains the directory structure and referencing the individual repo CLAUDE.md files.
Just put all the repos in all in one directory yourself. In my experience that works pretty well.
AI is happy to work with any directory you tell it to. Agent files can be applied anywhere.
I actually don't really like a few of things about this approach.
First, the "big bang" write it all at once. You are going to end up with thousands of lines of code that were monolithically produced. I think it is much better to have it write the plan and formulate it as sensible technical steps that can be completed one at a time. Then you can work through them. I get that this is not very "vibe"ish but that is kind of the point. I want the AI to help me get to the same point I would be at with produced code AND understanding of it, just accelerate that process. I'm not really interested in just generating thousands of lines of code that nobody understands.
Second, the author keeps refering to adjusting the behaviour, but never incorporating that into long lived guidance. To me, integral with the planning process is building an overarching knowledge base. Every time you're telling it there's something wrong, you need to tell it to update the knowledge base about why so it doesn't do it again.
Finally, no mention of tests? Just quick checks? To me, you have to end up with comprehensive tests. Maybe to the author it goes without saying, but I find it is integral to build this into the planning. Certain stages you will want certain types of tests. Some times in advance of the code (so TDD style) other times built alongside it or after.
It's definitely going to be interesting to see how software methodology evolves to incorporate AI support and where it ultimately lands.
The articles approach matches mine, but I've learned from exactly the things you're pointing out.
I get the PLAN.md (or equivalent) to be separated into "phases" or stages, then carefully prompt (because Claude and Codex both love to "keep going") it to only implement that stage, and update the PLAN.md
Tests are crucial too, and form another part of the plan really. Though my current workflow begins to build them later in the process than I would prefer...
I don’t use plan.md docs either, but I recognise the underlying idea: you need a way to keep agent output constrained by reality.
My workflow is more like scaffold -> thin vertical slices -> machine-checkable semantics -> repeat.
Concrete example: I built and shipped a live ticketing system for my club (Kolibri Tickets). It’s not a toy: real payments (Stripe), email delivery, ticket verification at the door, frontend + backend, migrations, idempotency edges, etc. It’s running and taking money.
The reason this works with AI isn’t that the model “codes fast”. It’s that the workflow moves the bottleneck from “typing” to “verification”, and then engineers the verification loop:
If you run it open-loop (prompt -> giant diff -> read/debug), you get the “illusion of velocity” people complain about. If you run it closed-loop (scaffold + constraints + verifiers), you can actually ship faster because you’re not paying the integration cost repeatedly.Plan docs are one way to create shared state and prevent drift. A runnable scaffold + verification harness is another.
Now that code is cheap, I ensured my side project has unit/integration tests (will enforce 100% coverage), Playwright tests, static typing (its in Python), scripts for all tasks. Will learn mutation testing too (yes, its overkill). Now my agent works upto 1 hour in loops and emits concise code I dont have to edit much.
Totally get it, and I think we’re describing the same control loop from different angles.
Where I differ slightly is: “100% coverage” can turn into productivity theatre. It’s a metric that’s easy to optimize while missing the thing you actually care about: do we have machine-checkable invariants at the points where drift is expensive?
The harness that’s paid off for me (on a live payments system) is:
Then refactors become routine, because the tests will make breakage explicit.So yes: “code is cheap” -> increase verification. Just careful not to replace engineering judgement with an easily gamed proxy.
> the workflow I’ve settled into is radically different from what most people do with AI coding tools
This looks exactly like what anthropic recommends as the best practice for using Claude Code. Textbook.
It also exposes a major downside of this approach: if you don't plan perfectly, you'll have to start over from scratch if anything goes wrong.
I've found a much better approach in doing a design -> plan -> execute in batches, where the plan is no more than 1,500 lines, used as a proxy for complexity.
My 30,000 LOC app has about 100,000 lines of plan behind it. Can't build something that big as a one-shot.
if you don't plan perfectly, you'll have to start over from scratch if anything goes wrong
This is my experience too, but it's pushed me to make much smaller plans and to commit things to a feature branch far more atomically so I can revert a step to the previous commit, or bin the entire feature by going back to main. I do this far more now than I ever did when I was writing the code by hand.
This is how developers should work regardless of how the code is being developed. I think this is a small but very real way AI has actually made me a better developer (unless I stop doing it when I don't use AI... not tried that yet.)
I do this too. Relatively small changes, atomic commits with extensive reasoning in the message (keeps important context around). This is a best practice anyway, but used to be excruciatingly much effort. Now it’s easy!
Except that I’m still struggling with the LLM understanding its audience/context of its utterances. Very often, after a correction, it will focus a lot on the correction itself making for weird-sounding/confusing statements in commit messages and comments.
> Very often, after a correction, it will focus a lot on the correction itself making for weird-sounding/confusing statements in commit messages and comments.
I've experienced that too. Usually when I request correction, I add something like "Include only production level comments, (not changes)". Recently I also added special instruction for this to CLAUDE.md.
We're learning the lessons of Agile all over again.
We're learning how to be an engineer all over again.
The authors process is super-close what we were taught in engineering 101 40 years ago.
It's after we come down from the Vibe coding high that we realize we still need to ship working, high-quality code. The lessons are the same, but our muscle memory has to be re-oriented. How do we create estimates when AI is involved? In what ways do we redefine the information flow between Product and Engineering?
I'm currently having Claude help me reverse engineer the wire protocol of a moderately expensive hardware device, where I have very little data about how it works. You better believe "we" do it by the book. Large, detailed plan md file laying out exactly what it will do, what it will try, what it will not try, guardrails, and so on. And a "knowledge base" md file that documents everything discovered about how the device works. Facts only. The knowledge base md file is 10x the size of the code at this point, and when I ask it to try something, I ask Claude to prove to me that our past findings support the plan.
Claude is like an intern coder-bro, eager to start crushin' it. But, you definitely can bring Claude "down to earth," have it follow actual engineering best practices, and ask it to prove to you that each step is the correct one. It requires careful, documented guardrails, and on top of it, I occasionally prompt it to show me with evidence how the previous N actions conformed to the written plan and didn't deviate.
If I were to anthropomorphize Claude, I'd say it doesn't "like" working this way--the responses I get from Claude seem to indicate impatience and a desire to "move forward and let's try it." Obviously an LLM can't be impatient and want to move fast, but its training data seem to be biased towards that.
Be careful of attention collapse. Details in a large governance file can get "forgotten" by the llm. It'll be extremely apologetic when you discover it's failed to follow some guardrails you specified, but it can still happen.
I always feels like I'm in a fever dream when I hear about AI workflows. A lot of stuff is what I've read from software engineering books and articles.
LLMs are really eager to start coding (as interns are eager to start working), so the sentence “don’t implement yet” has to be used very often at the beginning of any project.
Most LLM apps have a 'plan' or 'ask' mode for that.
I find that even then I often need to be clear that i'm just asking a question and don't want them running off to solve the larger problem.
Developers should work by wasting lots of time making the wrong thing?
I bet if they did a work and motion study on this approach they'd find the classic:
"Thinks they're more productive, AI has actually made them less productive"
But lots of lovely dopamine from this false progress that gets thrown away!
> Developers should work by wasting lots of time making the wrong thing?
Yes? I can't even count how many times I worked on something my company deemed was valuable only for it to be deprecated or thrown away soon after. Or, how many times I solved a problem but apparently misunderstood the specs slightly and had to redo it. Or how many times we've had to refactor our code because scope increased. In fact, the very existence of the concepts of refactoring and tech debt proves that devs often spend a lot of time making the "wrong" thing.
Is it a waste? No, it solved the problem as understood at the time. And we learned stuff along the way.
That's not the same thing at all, is it, and not what's being discussed.
Developers should work by wasting lots of time making the wrong thing?
Yes. In fact, that's not emphatic enough: HELL YES!
More specifically, developers should experiment. They should test their hypothesis. They should try out ideas by designing a solution and creating a proof of concept, then throw that away and build a proper version based on what they learned.
If your approach to building something is to implement the first idea you have and move on then you are going to waste so much more time later refactoring things to fix architecture that paints you into corners, reimplementing things that didn't work for future use cases, fixing edge cases than you hadn't considered, and just paying off a mountain of tech debt.
I'd actually go so far as to say that if you aren't experimenting and throwing away solutions that don't quite work then you're only amassing tech debt and you're not really building anything that will last. If it does it's through luck rather than skill.
Also, this has nothing to do with AI. Developers should be working this way even if they handcraft their artisanal code carefully in vi.
>> Developers should work by wasting lots of time making the wrong thing?
> Yes. In fact, that's not emphatic enough: HELL YES!
You do realize there are prior research and well tested solutions for a lot of things. Instead of wasting time making the wrong thing, it is faster to do some research if the problem has already been solved. Experimentation is fine only after checking that the problem space is truly novel or there's not enough information around.
It is faster to iterate in your mental space and in front of a whiteboard than in code.
I've been doing this a long times and I've never had to do that and have delivered multiple successful products used by millions of users. Some of which were used for years after we stopped doing any sort of even maintaining with no bugs, problems or crashes.
There are only a few software architecture patterns because there's only a few ways to solve code architecture problems.
If you're getting your initial design so wrong that you have to start again from scratch midway through, that shows a lack of experience, not insight.
You wouldn't know this, but I'm also a bit of an expert at refactoring, having saved several projects which had built up so much technical debt the original contractors ran away. I've regularly rewritten 1,000s if not 10,000s of line into 100s of lines of code.
So it's especially galling to be told not only that somehow all code problems are unique (they almost never are), but my code is building technical debt (it's not, I solve that stuff).
Most problems are solved, and you should be using other people's solutions to solve the problems you face.
Classic
https://metr.org/blog/2025-07-10-early-2025-ai-experienced-o...
This is actually embarrassing. His "radically different" workflow is... using the built-in Plan mode that they recommend you use? What?
It's not, to be fair.
> I use my own `.md` plan files rather than Claude Code’s built-in plan mode. The built-in plan mode sucks.
Can you easily version their plans using git?
"Write plan to the plans folder in the project"
> design -> plan -> execute in batches
This is the way for me as well. Have a high-level master design and plan, but break it apart into phases that are manageable. One-shotting anything beyond a todo list and expecting decent quality is still a pipe dream.
> if you don't plan perfectly, you'll have to start over from scratch if anything goes wrong.
You just revert what the AI agent changed and revise/iterate on the previous step - no need to start over. This can of course involve restricting the work to a smaller change so that the agent isn't overwhelmed by complexity.
100,000 lines is approx. one million words. The average person reads at 250wpm. The entire thing would take 66 hours just to read, assuming you were approaching it like a fiction book, not thinking anything over
How can you know that 100k lines plan is not just slop?
Just because plan is elaborate doesn’t mean it makes sense.
wtf, why would you write 100k lines of plan to produce 30k loc.. JUST WRITE THE CODE!!!
They didn't write 100k plan lines. The llm did (99.9% of it at least or more). Writing 30k by hand would take weeks if not months. Llms do it in an afternoon.
Just reading that plan would take weeks or months
You don't start with 100k lines, you work in batches that are digestible. You read it once, then move on. The lines add up pretty quickly considering how fast Claude works. If you think about the difference in how many characters it takes to describe what code is doing in English, it's pretty reasonable.
And my weeks or months of work beats an LLMs 10/10 times. There are no shortcuts in life.
I have no doubts that it does for many people. But the time/cost tradeoff is still unquestionable. I know I could create what LLMs do for me in the frontend/backend in most cases as good or better - I know that, because I've done it at work for years. But to create a somewhat complex app with lots of pages/features/apis etc. would take me months if not a year++ since I'd be working on it only on the weekends for a few hours. Claude code helps me out by getting me to my goal in a fraction of the time. Its superpower lies not only in doign what I know but faster, but in doing what I don't know as well.
I yield similar benefits at work. I can wow management with LLM assited/vibe coded apps. What previously would've taken a multi-man team weeks of planning and executing, stand ups, jour fixes, architecture diagrams, etc. can now be done within a single week by myself. For the type of work I do, managers do not care whether I could do it better if I'd code it myself. They are amazed however that what has taken months previously, can be done in hours nowadays. And I for sure will try to reap benefits of LLMs for as long as they don't replace me rather than being idealistic and fighting against them.
> What previously would've taken a multi-man team weeks of planning and executing, stand ups, jour fixes, architecture diagrams, etc. can now be done within a single week by myself.
This has been my experience. We use Miro at work for diagramming. Lots of visual people on the team, myself included. Using Miro's MCP I draft a solution to a problem and have Miro diagram it. Once we talk it through as a team, I have Claude or codex implement it from the diagram.
It works surprisingly well.
> They are amazed however that what has taken months previously, can be done in hours nowadays.
Of course they're amazed. They don't have to pay you for time saved ;)
> reap benefits of LLMs for as long as they don't replace me > What previously would've taken a multi-man team
I think this is the part that people are worried about. Every engineer who uses LLMs says this. By definition it means that people are being replaced.
I think I justify it in that no one on my team has been replaced. But management has explicitly said "we don't want to hire more because we can already 20x ourselves with our current team +LLM." But I do acknowledge that many people ARE being replaced; not necessarily by LLMs, but certainly by other engineers using LLMs.
I'm still waiting for the multi-years success stories. Greenfield solutions are always easy (which is why we have frameworks that automate them). But maintaining solutions over years is always the true test of any technologies.
It's already telling that nothing has staying power in the LLMs world (other than the chat box). Once the limitations can no longer be hidden by the hype and the true cost is revealed, there's always a next thing to pivot to.
That's a good point. My best guess is the companies that have poor AI infrastructure will either collapse or spend a lot of resources on senior engineers to either fix or rewrite. And the ones that have good AI infrastructure will try to vibe code themselves out of whatever holes they dig themselves into, potentially spending more on tokens than head count.
> but in doing what I don't know as well.
Comments like these really help ground what I read online about LLMs. This matches how low performing devs at my work use AI, and their PRs are a net negative on the team. They take on tasks they aren’t equipped to handle and use LLMs to fill the gaps quickly instead of taking time to learn (which LLMs speed up!).
This is good insight, and I think honestly a sign of a poorly managed team (not an attack on you). If devs are submitting poor quality work, with or without LLM, they should be given feedback and let go if it keeps happening. It wastes other devs' time. If there is a knowledge gap, they should be proactive in trying to fill that gap, again with or without AI, not trying to build stuff they don't understand.
In my experience, LLMs are an accelerator; it merely exacerbates what already exists. If the team has poor management or codebase has poor quality code, then LLMs just make it worse. If the team has good management and communication and the codebase is well documented and has solid patterns already (again, with or without llm), then LLMs compound that. It may still take some tweaking to make it better, but less chance of slop.
Might be true for you. But there are plenty of top tier engineers who love LLMs. So it works for some. Not for others.
And of course there are shortcuts in life. Any form of progress whether its cars, medicine, computers or the internet are all shortcuts in life. It makes life easier for a lot of people.
That's not (or should not be what's happening).
They write a short high level plan (let's say 200 words). The plan asks the agent to write a more detailed implementation plan (written by the LLM, let's say 2000-5000 words).
They read this plan and adjust as needed, even sending it to the agent for re-dos.
Once the implementation plan is done, they ask the agent to write the actual code changes.
Then they review that and ask for fixes, adjustments, etc.
This can be comparable to writing the code yourself but also leaves a detailed trail of what was done and why, which I basically NEVER see in human generated code.
That alone is worth gold, by itself.
And on top of that, if you're using an unknown platform or stack, it's basically a rocket ship. You bootstrap much faster. Of course, stay on top of the architecture, do controlled changes, learn about the platform as you go, etc.
I take this concept and I meta-prompt it even more.
I have a road map (AI generated, of course) for a side project I'm toying around with to experiment with LLM-driven development. I read the road map and I understand and approve it. Then, using some skills I found on skills.sh and slightly modified, my workflow is as such:
1. Brainstorm the next slice
It suggests a few items from the road map that should be worked on, with some high level methodology to implement. It asks me what the scope ought to be and what invariants ought to be considered. I ask it what tradeoffs could be, why, and what it recommends, given the product constraints. I approve a given slice of work.
NB: this is the part I learn the most from. I ask it why X process would be better than Y process given the constraints and it either corrects itself or it explains why. "Why use an outbox pattern? What other patterns could we use and why aren't they the right fit?"
2. Generate slice
After I approve what to work on next, it generates a high level overview of the slice, including files touched, saved in a MD file that is persisted. I read through the slice, ensure that it is indeed working on what I expect it to be working on, and that it's not scope creeping or undermining scope, and I approve it. It then makes a plan based off of this.
3. Generate plan
It writes a rather lengthy plan, with discrete task bullets at the top. Beneath, each step has to-dos for the llm to follow, such as generating tests, running migrations, etc, with commit messages for each step. I glance through this for any potential red flags.
4. Execute
This part is self explanatory. It reads the plan and does its thing.
I've been extremely happy with this workflow. I'll probably write a blog post about it at some point.
If you want to have some fun, experiment with this: add a step (maybe between 3 and 4):
3.5 Prove
Have the LLM demonstrate, through our current documentation and other sources of facts, that the planned action WILL work correctly, without failure. Ask it to enumerate all risks and point out how the plan mitigates each risk. I've seen on several occasions, the LLM backtrack at this step and actually come up with clever so-far unforeseen error cases.
That's a good thought experiment!
This is a super helpful and productive comment. I look forward to a blog post describing your process in more detail.
This dead internet uncanny (sarcasm?) valley is killing me.
Are you suggesting HN is now mostly bots boosting pro-AI comments? That feels like a stretch. Disagreement with your viewpoint doesn't automatically mean someone is a bot. Let's not import that reflex from Twitter.
> This is a super helpful and productive comment. I look forward to a blog post describing your process in more detail.
The average commenter doesn't write this kind of comment. Usually it's just a "can you expand/elaborate?". Extra politeness is kind of a hallmark of LLMs.
And if you look at the very neat comment it's responding to, there's a chance it's actually the opposite type, an actual human being sarcastic.
I can't tell anymore.
Edit: I've checked the comment history and it's just a regular ole human doing research :-)
Now I'm just confused. Maybe LLMs really do change how humans communicate.
Yep with a human in the loop to process these larger sprawling plan docs (inflated with the intent of the designer iteratively)
Some get deleted from repo others archived, others merged or referenced elsewhere. It's kind of organic.
Dunno. My 80k+ LOC personal life planner, with a native android app, eink display view still one shots most features/bugs I encounter. I just open a new instance let it know what I want and 5min later it's done.
Both can be true. I have personally experienced both.
Some problems AI surprised me immensely with fast, elegant efficient solutions and problem solving. I've also experienced AI doing totally absurd things that ended up taking multiple times longer than if I did it manually. Sometimes in the same project.
If you wouldn't mind sharing more about this in the future I'd love to read about it.
I've been thinking about doing something like that myself because I'm one of those people who have tried countless apps but there's always a couple deal breakers that cause me to drop the app.
I figured trying to agentically develop a planner app with the exact feature set I need would be an interesting and fun experiment.
In 5 min you are one shotting smaller changes to the larger code base right? Not the entire 80k likes which was the other comments point afaict.
Yeah, then I guess I misunderstood the post. Its smaller features one by one ofc.
What is a personal life planner?
Todos, habits, goals, calendar, meals, notes, bookmarks, shopping lists, finances. More or less that with Google cal integration, garmin Integration (Auto updates workout habits, weight goals) family sharing/gamification, daily/weekly reviews, ai summaries and more. All built by just prompting Claude for feature after feature, with me writing 0 lines.
Ah, I imagined actual life planning as in asking AI what to do, I was morbidly curious.
Prompting basic notes apps is not as exciting but I can see how people who care about that also care about it being exactly a certain way, so I think get your excitement.
Is it on GH?
It was when I mvp'd it 3 weeks ago. Then I removed it as I was toying with the idea of somehow monetizing it. Then I added a few features which would make monetization impossible (e.g. How the app obtains etf/stock prices live and some other things). I reckon I could remove those and put in gh during the week if I don't forget. The quality of the Web app is SaaS grade IMO. Keyboard shortcuts, cmd+k, natural language parsing, great ui that doesn't look like made by ai in 5min. Might post here the link.
Would love to check it out too once you put it up.
I use Claude Code for lecture prep.
I craft a detailed and ordered set of lecture notes in a Quarto file and then have a dedicated claude code skill for translating those notes into Slidev slides, in the style that I like.
Once that's done, much like the author, I go through the slides and make commented annotations like "this should be broken into two slides" or "this should be a side-by-side" or "use your generate clipart skill to throw an image here alongside these bullets" and "pull in the code example from ../examples/foo." It works brilliantly.
And then I do one final pass of tweaking after that's done.
But yeah, annotations are super powerful. Token distance in-context and all that jazz.
Quarto can be used to output slides in various formats (Powerpoint, beamer for pdf, revealjs for HTML, etc.). I wonder why you use Slidev as you can just ask Claude Code to create another Quarto document.
It looks like Slidev is designed for presentations about software development, judging from its feature set. Quarto is more general-purpose. (That's not to say Quarto can't support the same features, but currently it doesn't.)
I'm not affiliated with Slidev. I was just curious.
Can I ask how you annotate the feedback for it? Just with inline comments like `# This should be changed to X`?
The author mentions annotations but doesn't go into detail about how to feed the annotations to Claude.
Slidev is markdown, so i do it in html comments. Usually something like:
or And then, when I finish annotating I just say: "Address all the TODOCLAUDEs"is your skill open source
Not yet... but also I'm not sure it makes a lot of sense to be open source. It's super specific to how I like to build slide decks and to my personal lecture style.
But it's not hard to build one. The key for me was describing, in great detail:
1. How I want it to read the source material (e.g., H1 means new section, H2 means at least one slide, a link to an example means I want code in the slide)
2. How to connect material to layouts (e.g., "comparison between two ideas should be a two-cols-title," "walkthrough of code should be two-cols with code on right," "learning objectives should be side-title align:left," "recall should be side-title align:right")
Then the workflow is:
1. Give all those details and have it do a first pass.
2. Give tons of feedback.
3. At the end of the session, ask it to "make a skill."
4. Manually edit the skill so that you're happy with the examples.
Well, that's already done by Amazon's Kiro [0], Google's Antigravity [1], GitHub's Spec Kit [2], and OpenSpec [3]!
[0]: https://kiro.dev/
[1]: https://antigravity.google/
[2]: https://github.github.com/spec-kit/
[3]: https://openspec.dev/
the annotation cycle in plan.md is the part that actually makes this work imo. it's not just that you're planning, it's that you can inject domain constraints that the model can't infer from the codebase alone -- stuff like "don't use X pattern here because of Y deployment constraint" or "this service has a 500ms timeout that isn't in any config file". that knowledge transfer happens naturally in code review when a human writes the code, but LLMs skip it by default.
This is quite close to what I've arrived at, but with two modifications
1) anything larger I work on in layers of docs. Architecture and requirements -> design -> implementation plan -> code. Partly it helps me think and nail the larger things first, and partly helps claude. Iterate on each level until I'm satisfied.
2) when doing reviews of each doc I sometimes restart the session and clear context, it often finds new issues and things to clear up before starting the next phase.
> Read deeply, write a plan, annotate the plan until it’s right, then let Claude execute the whole thing without stopping, checking types along the way.
As others have already noted, this workflow is exactly what the Google Antigravity agent (based off Visual Studio Code) has been created for. Antigravity even includes specialized UI for a user to annotate selected portions of an LLM-generated plan before iterating it.
One significant downside to Antigravity I have found so far is the fact that even though it will properly infer a certain technical requirement and clearly note it in the plan it generates (for example, "this business reporting column needs to use a weighted average"), it will sometimes quietly downgrade such a specialized requirement (for example, to a non-weighted average), without even creating an appropriate "WARNING:" comment in the generated code. Especially so when the relevant codebase already includes a similar, but not exactly appropriate API. My repetitive prompts to ALWAYS ask about ANY implementation ambiguities WHATSOEVER go unanswered.
From what I gather Claude Code seems to be better than other agents at always remembering to query the user about implementation ambiguities, so maybe I will give Claude Code a shot over Antigravity.
With hooks you can achieve a similar UI that can do what antigravity does just as much & better. Search "claude code plan annotations plugin" and youll come across some.
The idea of having the model create a plan/spec, which you then mark up with comments before execution, is a cornerstone of how the new generation of AI IDEs like Google Antigravity operate.
Claude Code also has "Planning Mode" which will do this, but in my experience its "plan" sometimes includes the full source code of several files, which kind of defeats the purpose.
The multi-pass approach works outside of code too. I run a fairly complex automation pipeline (prompt -> script -> images -> audio -> video assembly) and the single biggest quality improvement was splitting generation into discrete planning and execution phases. One-shotting a 10-step pipeline means errors compound. Having the LLM first produce a structured plan, then executing each step against that plan with validation gates between them, cut my failure rate from maybe 40% to under 10%. The planning doc also becomes a reusable artifact you can iterate on without re-running everything.
My workflow is a bit different.
* I ask the LLM for it's understanding of a topic or an existing feature in code. It's not really planning, it's more like understanding the model first
* Then based on its understanding, I can decide how great or small to scope something for the LLM
* An LLM showing good understand can deal with a big task fairly well.
* An LLM showing bad understanding still needs to be prompted to get it right
* What helps a lot is reference implementations. Either I have existing code that serves as the reference or I ask for a reference and I review.
A few folks do it at my work do it OPs way, but my arguments for not doing it this way
* Nobody is measuring the amount of slop within the plan. We only judge the implementation at the end
* it's still non deterministic - folks will have different experiences using OPs methods. If claude updates its model, it outdates OPs suggestions by either making it better or worse. We don't evaluate when things get better, we only focus on things not gone well.
* it's very token heavy - LLM providers insist that you use many tokens to get the task done. It's in their best interest to get you to do this. For me, LLMs should be powerful enough to understand context with minimal tokens because of the investment into model training.
Both ways gets the task done and it just comes down to my preference for now.
For me, I treat the LLM as model training + post processing + input tokens = output tokens. I don't think this is the best way to do non deterministic based software development. For me, we're still trying to shoehorn "old" deterministic programming into a non deterministic LLM.
Quoting the article:
> One trick I use constantly: for well-contained features where I’ve seen a good implementation in an open source repo, I’ll share that code as a reference alongside the plan request. If I want to add sortable IDs, I paste the ID generation code from a project that does it well and say “this is how they do sortable IDs, write a plan.md explaining how we can adopt a similar approach.” Claude works dramatically better when it has a concrete reference implementation to work from rather than designing from scratch.
Licensing apparently means nothing.
Ripped off in the training data, ripped off in the prompt.
That is the exact passage I found so shocking - if one finds the code in an open source repo, is it really acceptable to pass it through Claude code as some sort of license filter and make it proprietary?
On the other hand, next time OSX/windows/etc is leaked, one could feed it through this very same license filter. What is sauce for the goose is sauce for the gander.
Concepts are not copyrightable.
The article isn’t describing someone who learned the concept of sortable IDs and then wrote their own implementation.
It describes copying and pasting actual code from one project into a prompt so a language model can reproduce it in another project.
It’s a mechanical transformation of someone else’s copyrighted expression (their code) laundered through a statistical model instead of a human copyist.
“Mechanical” is doing some heavy lifting here. If a human does the same, reimplement the code in their own style for their particular context, it doesn’t violate copyright. Having the LLM see the original code doesn’t automatically make its output a plagiarism.
How about following the test-driven approach properly? Asking Claude Code to write tests first and implement the solution after? Research -> Test Plan -> Write Tests -> Implementation Plan -> Write Implementation
“The workflow I’m going to describe has one core principle: never let Claude write code until you’ve reviewed and approved a written plan.”
I’m not sure we need to be this black and white about things. Speaking from the perspective of leading a dev team, I regularly have Claude Code take a chance at code without reviewing a plan. For example, small issues that I’ve written clear details about, Claude can go to town on those. I’ve never been on a team that didn’t have too many of these types of issues to address.
And, a team should have othee guards in place that validates that code before it gets merged somewhere important.
I don’t have to review every single decision one of my teammates is going to make, even those less experienced teammates, but I do prepare teammates with the proper tools (specs, documentation, etc) so they can make a best effort first attempt. This is how I treat Claude Code in a lot of scenarios.
What I've read is that even with all the meticulous planning, the author still needed to intervene. Not at the end but at the middle, unless it will continue building out something wrong and its even harder to fix once it's done. It'll cost even more tokens. It's a net negative.
You might say a junior might do the same thing, but I'm not worried about it, at least the junior learned something while doing that. They could do it better next time. They know the code and change it from the middle where it broke. It's a net positive.
Unfortunately, you could argue that the model provider has also learned something, i.e. the interaction can be used as additional training data to train subsequent models.
this comment is the first truly humane one ive read regarding this whole AI fiasco
This is very similar to the RECR (requirements, execute, check, repeat) framework I use and teach to my clients.
One critical step that I didn't see mentioned is testing. I drive my agents with TDD and it seems to make a huge difference.
> After Claude writes the plan, I open it in my editor and add inline notes directly into the document. These notes correct assumptions, reject approaches, add constraints, or provide domain knowledge that Claude doesn’t have.
This is the part that seems most novel compared to what I've heard suggested before. And I have to admit I'm a bit skeptical. Would it not be better to modify what Claude has written directly, to make it correct, rather than adding the corrections as separate notes (and expecting future Claude to parse out which parts were past Claude and which parts were the operator, and handle the feedback graciously)?
At least, it seems like the intent is to do all of this in the same session, such that Claude has the context of the entire back-and-forth updating the plan. But that seems a bit unpleasant; I would think the file is there specifically to preserve context between sessions.
The whole process feels Socratic which is why I and a lot of other folks use plan annotation tools already. In my workflow I had a great desire to tell the agent what I didn’t like about the plan vs just fix it myself - because I wanted the agent to fix its own plan.
One reason why I don't do this: even I won't be immune to mistakes. When I fix it with new values or paths, for example, and the one I provided is wrong, it can worsen the future work.
Personally, I like to order claude one more time to update the plan file after I have given annotation, and review it again after. This will ensure (from my understanding) that claude won't treat my annotation as different instructions, thus risking the work being conflicted.
Since everyone is showing their flow, here's mine:
* create a feature-name.md file in a gitignored folder
* start the file by giving the business context
* describe a high-level implementation and user flows
* describe database structure changes (I find it important not to leave it for interpretation)
* ask Claude to inspect the feature and review if for coherence, while answering its questions I ask to augment feature-name.md file with the answers
* enter Claude's plan mode and provide that feature-name.md file
* at this point it's detailed enough that rarely any corrections from me are needed
This is what I do with the obra/superpowers[0] set of skills.
1. Use brainstorming to come up with the plan using the Socratic method
2. Write a high level design plan to file
3. I review the design plan
4. Write an implementation plan to file. We've already discussed this in detail, so usually it just needs skimming.
5. Use the worktree skill with subagent driven development skill
6. Agent does the work using subagents that for each task:
7. When all tasks complete: create a PR for me to review8. Go back to the agent with any comments
9. If finished, delete the plan files and merge the PR
[0]: https://github.com/obra/superpowers
If you’ve ever desired the ability for annotating the plan more visually, try fitting Plannotator in this workflow. There is a slash command for use when you use custom workflows outside of normal plan mode.
https://github.com/backnotprop/plannotator
I'll give this a try. Thanks for the suggestion.
The crowd around this pot shows how superficial is knowledge about claude code. It gets releases each day and most of this is already built in the vanilla version. Not to mention subagent working in work trees, memory.md, plan on which you can comment directly from the interface, subagents launched in research phase, but also some basic mcp's like LSP/IDE integration, and context7 to not to be stuck in the knowledge cutoff/past.
When you go to YouTube and search for stuff like "7 levels of claude code" this post would be maybe 3-4.
Oh, one more thing - quality is not consistent, so be ready for 2-3 rounds of "are you happy with the code you wrote" and defining audit skills crafted for your application domain - like for example RODO/Compliance audit etc.
I'm using the in-built features as well, but I like the flow that I have with superpowers. You've made a lot of assumptions with your comment that are just not true (at least for me).
I find that brainstorming + (executing plans OR subagent driven development) is way more reliable than the built-in tooling.
I made no assumptions about you - I simply commented on the post replying to your comment which I liked and simply wanted to follow the point of view :)
LLMs hallucinations on macro isn't about planning and not planning like sparin9 pointed out. It's like, an architectural problem which would be fun to fix using overseeing system?
Has anyone found a efficient way to avoid repeating the initial codebase assessment when working with large projects?
There are several projects on GitHub that attempt to tackle context and memory limitations, but I haven’t found one that consistently works well in practice.
My current workaround is to maintain a set of Markdown files, each covering a specific subsystem or area of the application. Depending on the task, I provide only the relevant documents to Claude Code to limit the context scope. It works reasonably well, but it still feels like a manual and fragile solution. I’m interested in more robust strategies for persistent project context or structured codebase understanding.
Whenever I build a new feature with it I end up with several plan files leftover. I ask CC to combine them all, update with what we actually ended up building and name it something sensible, then whenever I want to work on that area again it's a useful reference (including the architecture, decisions and tradeoffs, relevant files etc).
Yes this is what agent "skills" are. Just guides on any topic. The key is that you have the agent write and maintain them.
For my longer spec files, I grep the subheaders/headers (with line numbers) and show this compact representation to the LLM's context window. I also have a file that describes what each spec files is and where it's located, and I force the LLM to read that and pull the subsections it needs. I also have one entrypoint requirements file (20k tokens) that I force it to read in full before it does anything else, every line I wrote myself. But none of this is a silver bullet.
That sounds like the recommended approach. However, there's one more thing I often do: whenever Claude Code and I complete a task that didn't go well at first, I ask CC what it learned, and then I tell it to write down what it learned for the future. It's hard to believe how much better CC has become since I started doing that. I ask it to write dozens of unit tests and it just does. Nearly perfectly. It's insane.
I'm interested in this as well.
Skills almost seem like a solution, but they still need an out-of-band process to keep them updated as the codebase evolves. For now, a structured workflow that includes aggressive updates at the end of the loop is what I use.
In Claude Web you can use projects to put files relevant for context there.
And then you have to remind it frequently to make use of the files. Happened to me so many times that I added it both to custom instructions as well as to the project memory.
This is the way.
The practice is:
- simple
- effective
- retains control and quality
Certainly the “unsupervised agent” workflows are getting a lot of attention right now, but they require a specific set of circumstances to be effective:
- clear validation loop (eg. Compile the kernel, here is gcc that does so correctly)
- ai enabled tooling (mcp / cli tool that will lint, test and provide feedback immediately)
- oversight to prevent sgents going off the rails (open area of research)
- an unlimited token budget
That means that most people can't use unsupervised agents.
Not that they dont work; Most people have simply not got an environment and task that is appropriate.
By comparison, anyone with cursor or claude can immediately start using this approach, or their own variant on it.
It does not require fancy tooling.
It does not require an arcane agent framework.
It works generally well across models.
This is one of those few genunie pieces of good practical advice for people getting into AI coding.
Simple. Obviously works once you start using it. No external dependencies. BYO tools to help with it, no “buy my AI startup xxx to help”. No “star my github so I can a job at $AI corp too”.
Great stuff.
Honesty this is just language models in general at the moment, and not just coding.
It’s the same reason adding a thinking step works.
You want to write a paper, you have it form a thesis and structure first. (In this one you might be better off asking for 20 and seeing if any of them are any good.) You want to research something, first you add gathering and filtering steps before synthesis.
Adding smarter words or telling it to be deeper does work by slightly repositioning where your query ends up in space.
Asking for the final product first right off the bat leads to repetitive verbose word salad. It just starts to loop back in on itself. Which is why temperature was a thing in the first place, and leads me to believe they’ve turned the temp down a bit to try and be more accurate. Add some randomness and variability to your prompts to compensate.
Absolutely. And you can also always let the agent look back at the plan to check if it is still on track and aligned.
One step I added, that works great for me, is letting it write (api-level) tests after planning and before implementation. Then I’ll do a deep review and annotation of these tests and tweak them until everything is just right.
Huge +1. This loop consistently delivers great results for my vibe coding.
The “easy” path of “short prompt declaring what I want” works OK for simple tasks but consistently breaks down for medium to high complexity tasks.
Can you help me understand the difference between "short prompt for what I want (next)" vs medium to high complexity tasks?
What i mean is, in practice, how does one even get to a a high complexity task? What does that look like? Because isn't it more common that one sees only so far ahead?
It's more or less what comes out of the box with plan mode, plus a few extra bits?
I do something very similar, also with Claude and Codex, because the workflow is controlled by me, not by the tool. But instead of plan.md I use a ticket system basically like ticket_<number>_<slug>.md where I let the agent create the ticket from a chat, correct and annotate it afterwards and send it back, sometimes to a new agent instance. This workflow helps me keeping track of what has been done over time in the projects I work on. Also this approach does not need any „real“ ticket system tooling/mcp/skill/whatever since it works purely on text files.
+1 to creating tickets by simply asking the agent to. It's worked great and larger tasks can be broken down into smaller subtasks that could reasonably be completed in a single context window, so you rarely every have to deal with compaction. Especially in the last few months since Claude's gotten good at dispatching agents to handle tasks if you ask it to, I can plan large changes that span multilpe tickets and tell claude to dispatch agents as needed to handle them (which it will do in parallel if they mostly touch different files), keeping the main chat relatively clean for orchestration and validation work.
semantic plan name is important
I’ve begun using Gpt’y to iron out most of the planning phase to essentially bootstrap the conversation with Claude. I’m curious if others have done that.
Sometimes I find it quite difficult to form the right question. Using Gpt’y I can explore my question and often times end up asking a completely different question.
It also helps derisk hitting my usage limits with pro. I feel like I’m having richer conversations now w/ Claude but I also feel more confident in my prompts.
What's "gpt'y"?
‘ji-pi-tee’, heard it jokingly pronounced like that a few times and I guess it kind of stuck.
I've been teaching AI coding tool workshops for the past year and this planning-first approach is by far the most reliable pattern I've seen across skill levels.
The key insight that most people miss: this isn't a new workflow invented for AI - it's how good senior engineers already work. You read the code deeply, write a design doc, get buy-in, then implement. The AI just makes the implementation phase dramatically faster.
What I've found interesting is that the people who struggle most with AI coding tools are often junior devs who never developed the habit of planning before coding. They jump straight to "build me X" and get frustrated when the output is a mess. Meanwhile, engineers with 10+ years of experience who are used to writing design docs and reviewing code pick it up almost instantly - because the hard part was always the planning, not the typing.
One addition I'd make to this workflow: version your research.md and plan.md files in git alongside your code. They become incredibly valuable documentation for future maintainers (including future-you) trying to understand why certain architectural decisions were made.
I teach a lot of folks who "aren't software engineers" but are sitting in front of Jupyter all day writing code.
Covertly teaching software engineering best practices is super relevant. I've also found testing skills sorely lacking and even more important in AI driven development.
> it's how good senior engineers already work
The other trick all good ones I’ve worked with converged on: it’s quicker to write code than review it (if we’re being thorough). Agents have some areas where they can really shine (boilerplate you should maybe have automated already being one), but most of their speed comes from passing the quality checking to your users or coworkers.
Juniors and other humans are valuable because eventually I trust them enough to not review their work. I don’t know if LLMs can ever get here for serious industries.
Regarding inline notes, I use a specific format in the `/plan` command, by using th `ME:` prefix.
https://github.com/srid/AI/blob/master/commands/plan.md#2-pl...
It works very similar to Antigravity's plan document comment-refine cycle.
https://antigravity.google/docs/implementation-plan
Shameless plug: https://beadhub.ai allows you to do exactly that, but with several agents in parallel. One of them is in the role of planner, which takes care of the source-of-truth document and the long term view. They all stay in sync with real-time chat and mail.
It's OSS.
Real-time work is happening at https://app.beadhub.ai/juanre/beadhub (beadhub is a public project at https://beadhub.ai so it is visible).
Particularly interesting (I think) is how the agents chat with each other, which you can see at https://app.beadhub.ai/juanre/beadhub/chat
There are frameworks like https://github.com/bmad-code-org/BMAD-METHOD and https://github.github.com/spec-kit/ that are working on encoding a similar kind of approach and process.
I try these staging-document patterns, but suspect they have 2 fundamental flaws that stem mostly from our own biases.
First, Claude evolves. The original post work pattern evolved over 9 months, before claude's recent step changes. It's likely claude's present plan mode is better than this workaround, but if you stick to the workaround, you'd never know.
Second, the staging docs that represent some context - whether a library skills or current session design and implementation plans - are not the model Claude works with. At best they are shaping it, but I've found it does ignore and forget even what's written (even when I shout with emphasis), and the overall session influences the code. (Most often this happens when a peripheral adjustment ends up populating half the context.)
Indeed the biggest benefit from the OP might be to squeeze within 1 session, omitting peripheral features and investigations at the plan stage. So the mechanism of action might be the combination of getting our own plan clear and avoiding confusing excursions. (A test for that would be to redo the session with the final plan and implementation, to see if the iteration process itself is shaping the model.)
Our bias is to believe that we're getting better at managing this thing, and that we can control and direct it. It's uncomfortable to realize you can only really influence it - much like giving direction to a junior, but they can still go off track. And even if you found a pattern that works, it might work for reasons you're not understanding -- and thus fail you eventually. So, yes, try some patterns, but always hang on to the newbie senses of wonder and terror that make you curious, alert, and experimental.
I'm going to offer a counterpoint suggestion. You need to watch Claude try to implement small features many times without planning to see where it is likely to fail. It will often do the same mistakes over and over (e.g. trying to SSH without opening a bastion, mangling special characters in bash shell, trying to communicate with a server that self-shuts down after 10 minutes). Once you have a sense for all the repeated failure points of your workflow, then you can add them to future plan files.
An approach that's worked fairly well is asking Codex to summarize mistakes made in a session use the lessons learned to modify the AGENTS.md file for future agents to avoid similar errors. It also helps to audit the AGENTS.md file every once in a while to clean up/compact instructions
This is the flow I've found myself working towards. Essentially maintaining more and more layered documentation for the LLM produces better and more consistent results. What is great here is the emphasis on the use of such documents in the planning phase. I'm feeling much more motivated to write solid documentation recently, because I know someone (the LLM) is actually going to read it! I've noticed my efforts and skill acquisition have moved sharply from app developer towards DevOps and architecture / management, but I think I'll always be grateful for the application engineering experience that I think the next wave of devs might miss out on.
I've also noted such a huge gulf between some developers describing 'prompting things into existence' and the approach described in this article. Both types seem to report success, though my experience is that the latter seems more realistic, and much more likely to produce robust code that's likely to be maintainable for long term or project critical goals.
Planning is important because you get the LLM to explain the problem and solution in its language and structure, not yours.
This shortcuts a range of problem cases where the LLM fights between the users strict and potentially conflicting requirements, and its own learning.
In the early days we used to get LLM to write the prompts for us to get round this problem, now we have planning built in.
The annotation cycle is the key insight for me. Treating the plan as a living doc you iterate on before touching any code makes a huge difference in output quality.
Experimentally, i've been using mfbt.ai [https://mfbt.ai] for roughly the same thing in a team context. it lets you collaboratively nail down the spec with AI before handing off to a coding agent via MCP.
Avoids the "everyone has a slightly different plan.md on their machine" problem. Still early days but it's been a nice fit for this kind of workflow.
I agree, and this is why I tend to use gptel in emacs for planning - the document is the conversation context, and can be edited and annotated as you like.
I’ve been using this same pattern, except not the research phase. Definetly will try to add it to my process aswell.
Sometimes when doing big task I ask claude to implement each phase seprately and review the code after each step.
Holy moly, I just applied the principles to DND campaign creation and I am in awe.
I've been working off and on on a vibe coded FP language and transpiler - mostly just to get more experience with Claude Code and see how it handles complex real world projects. I've settled on a very similar flow, though I use three documents: plan, context, task list. Multiple rounds of iteration when planning a feature. After completion, have a clean session do an audit to confirm that everything was implemented per the design. Then I have both Claude and CodeRabbit do code review passes before I finally do manual review. VERY heavy emphasis on tests, the project currently has 2x more test code than application code. So far it works surprisingly well. Example planning docs below -
https://github.com/mbcrawfo/vibefun/tree/main/.claude/archiv...
I tried Opus 4.6 recently and it’s really good. I had ditched Claude a long time ago for Grok + Gemini + OpenCode with Chinese models. I used Grok/Gemini for planning and core files, and OpenCode for setup, running, deploying, and editing.
However, Opus made me rethink my entire workflow. Now, I do it like this:
* PRD (Product Requirements Document)
* main.py + requirements.txt + readme.md (I ask for minimal, functional, modular code that fits the main.py)
* Ask for a step-by-step ordered plan
* Ask to focus on one step at a time
The super powerful thing is that I don’t get stuck on missing accounts, keys, etc. Everything is ordered and runs smoothly. I go rapidly from idea to working product, and it’s incredibly easy to iterate if I figure out new features are required while testing. I also have GLM via OpenCode, but I mainly use it for "dumb" tasks.
Interestingly, for reasoning capabilities regarding standard logic inside the code, I found Gemini 3 Flash to be very good and relatively cheap. I don't use Claude Code for the actual coding because forcing everything via chat into a main.py encourages minimal code that's easy to skim—it gives me a clearer representation of the feature space
Why would you use Grok at all? The one LLM that they're purposely trying to get specific output from (trying to make it "conservative"). I wouldn't want to use a project that I outright know is tainted by the owners trying to introduce bias.
I find a spend most of my time defining interfaces and putting comments down now (“// this function does x”). Then I tell it “implement function foo, as described in the doc comment” or “implement all functions that are TODO”. It’s pretty good at filling in a skeleton you’ve laid out.
The author is quite far on their journey but would benefit from writing simple scripts to enforce invariants in their codebase. Invariant broken? Script exits with a non-zero exit code and some output that tells the agent how to address the problem. Scripts are deterministic, run in milliseconds, and use zero tokens. Put them in husky or pre-commit, install the git hooks, and your agent won’t be able to commit without all your scripts succeeding.
And “Don’t change this function signature” should be enforced not by anticipating that your coding agent “might change this function signature so we better warn it not to” but rather via an end to end test that fails if the function signature is changed (because the other code that needs it not to change now has an error). That takes the author out of the loop and they can not watch for the change in order to issue said correction, and instead sip coffee while the agent observes that it caused a test failure then corrects it without intervention, probably by rolling back the function signature change and changing something else.
Radically different? Sounds to me like the standard spec driven approach that plenty of people use.
I prefer iterative approach. LLMs give you incredible speed to try different approaches and inform your decisions. I don’t think you can ever have a perfect spec upfront, at least that’s my experience.
Here's my workflow, hopefully concise enough as a reply, in case helpful to those very few who'll actually see it:
Research -> Define 'Domains' -> BDD -> Domain Specs -> Overall Arch Specs / complete/consistent/gap analysis -> Spec Revision -> TDD Dev.
Smaller projects this is overkill. Larger projects, imho, gain considerable value from BDD and Overall Architecture Spec complete/consistent/gap analysis...
Cheers
Lol I wrote about this and been using plan+execute workflow for 8 months.
Sadly my post didn't much attention at the time.
https://thegroundtruth.media/p/my-claude-code-workflow-and-p...
I have to give this a try. My current model for backend is the same as how author does frontend iteration. My friend does the research-plan-edit-implement loop, and there is no real difference between the quality of what I do and what he does. But I do like this just for how it serves as documentation of the thought process across AI/human, and can be added to version control. Instead of humans reviewing PRs, perhaps humans can review the research/plan document.
On the PR review front, I give Claude the ticket number and the branch (or PR) and ask it to review for correctness, bugs and design consistency. The prompt is always roughly the same for every PR. It does a very good job there too.
Modelwise, Opus 4.6 is scary good!
> “remove this section entirely, we don’t need caching here” — rejecting a proposed approach
I wonder why you don't remove it yourself. Aren't you already editing the plan?
> Most developers type a prompt, sometimes use plan mode, fix the errors, repeat.
> ...
> never let Claude write code until you’ve reviewed and approved a written plan
I certainly always work towards an approved plan before I let it lost on changing the code. I just assumed most people did, honestly. Admittedly, sometimes there's "phases" to the implementation (because some parts can be figured out later and it's more important to get the key parts up and running first), but each phase gets a full, reviewed plan before I tell it to go.
In fact, I just finished writing a command and instruction to tell claude that, when it presents a plan for implementation, offer me another option; to write out the current (important parts of the) context and the full plan to individual (ticket specific) md files. That way, if something goes wrong with the implementation I can tell it to read those files and "start from where they left off" in the planning.
The author seems to think theyve invented a special workflow...
We all tend to regress to average (same thoughts/workflows)...
Have had many users already doing the exact same workflow with: https://github.com/backnotprop/plannotator
4 times in one thread, please stop spamming this link.
The “inline comments on a plan” is one of the best features of Antigravity, and I’m surprised others haven’t started copycatting.
https://github.blog/ai-and-ml/generative-ai/spec-driven-deve...
I recently discovered GitHub speckit which separates planning/execution in stages: specify, plan, tasks, implement. Finding it aligns with the OP with the level of “focus” and “attention” this gets out of Claude Code.
Speckit is worth trying as it automates what is being described here, and with Opus 4.6 it's been a kind of BC/AD moment for me.
Interesting! I feel like I'm learning to code all over again! I've only been using Claude for a little more than a month and until now I've been figuring things out on my own. Building my methodology from scratch. This is much more advanced than what I'm doing. I've been going straight to implementation, but doing one very small and limited feature at a time, describing implementation details (data structures like this, use that API here, import this library etc) verifying it manually, and having Claude fix things I don't like. I had just started getting annoyed that it would make the same (or very similar) mistake over and over again and I would have to fix it every time. This seems like it'll solve that problem I had only just identified! Neat!
Try OpenSpec and it'll do all this for you. SpecKit works too. I don't think there's a need to reinvent the wheel on this one, as this is spec-driven development.
Haha this is surprisingly and exactly how I use claude as well. Quite fascinating that we independently discovered the same workflow.
I maintain two directories: "docs/proposals" (for the research md files) and "docs/plans" (for the planning md files). For complex research files, I typically break them down into multiple planning md files so claude can implement one at a time.
A small difference in my workflow is that I use subagents during implementation to avoid context from filling up quickly.
Same, I formalized a similar workflow for my team (oriented around feature requirement docs), I am thinking about fully productizing it and am looking to for feedback - https://acai.sh
Even if the product doesn’t resonate I think I’ve stumbled on some ideas you might find useful^
I do think spec-driven development is where this all goes. Still making up my mind though.
Spec-driven looks very much like what the author describes. He may have some tweaks of his own but they could just as well be coded into the artifacts that something like OpenSpec produces.
This is basically long-lived specs that are used as tests to check that the product still adheres to the original idea that you wanted to implement, right?
This inspired me to finally write good old playwright tests for my website :).
This is similar to what I do. I instruct an Architect mode with a set of rules related to phased implementation and detailed code artifacts output to a report.md file. After a couple of rounds of review and usually some responses that either tie together behaviors across context, critique poor choices or correct assumptions, there is a piece of work defined for a coder LLM to perform. With the new Opus 4.6 I then select specialist agents to review the report.md, prompted with detailed insight into particular areas of the software. The feedback from these specialist agent reviews is often very good and sometimes catches things I had missed. Once all of this is done, I let the agent make the changes and move onto doing something else. I typically rename and commit the report.md files which can be useful as an alternative to git diff / commit messages etc.
This looks like an important post. What makes it special is that it operationalizes Polya's classic problem-solving recipe for the age of AI-assisted coding.
1. Understand the problem (research.md)
2. Make a plan (plan.md)
3. Execute the plan
4. Look back
Yeah, OODA loop for programmers, basically. It’s a good approach.
I've been running AI coding workshops for engineers transitioning from traditional development, and the research phase is consistently the part people skip — and the part that makes or breaks everything.
The failure mode the author describes (implementations that work in isolation but break the surrounding system) is exactly what I see in workshop after workshop. Engineers prompt the LLM with "add pagination to the list endpoint" and get working code that ignores the existing query builder patterns, duplicates filtering logic, or misses the caching layer entirely.
What I tell people: the research.md isn't busywork, it's your verification that the LLM actually understands the system it's about to modify. If you can't confirm the research is accurate, you have no business trusting the plan.
One thing I'd add to the author's workflow: I've found it helpful to have the LLM explicitly list what it does NOT know or is uncertain about after the research phase. This surfaces blind spots before they become bugs buried three abstraction layers deep.
The biggest roadblock to using agents to maximum effectiveness like this is the chat interface. It's convenience as detriment and convenience as distraction. I've found myself repeatedly giving into that convenience only to realize that I have wasted an hour and need to start over because the agent is just obliviously circling the solution that I thought was fully obvious from the context I gave it. Clearly these tools are exceptional at transforming inputs into outputs and, counterintuitively, not as exceptional when the inputs are constantly interleaved with the outputs like they are in chat mode.
Sounds similar to Kiro's specs.
The separation of planning and execution resonates strongly. I've been using a similar pattern when building with AI APIs — write the spec/plan in natural language first, then let the model execute against it.
One addition that's worked well for me: keeping a persistent context file that the model reads at the start of each session. Instead of re-explaining the project every time, you maintain a living document of decisions, constraints, and current state. Turns each session into a continuation rather than a cold start.
The biggest productivity gain isn't in the code generation itself — it's in reducing the re-orientation overhead between sessions.
I’ve been using Claude through opencode, and I figured this was just how it does it. I figured everyone else did it this way as well. I guess not!
In my own tests I have found opus to be very good at writing plans, terrible at executing them. It typically ignores half of the constraints. https://x.com/xundecidability/status/2019794391338987906?s=2... https://x.com/xundecidability/status/2024210197959627048?s=2...
1. Don't implement too much at at time
2. Have the agent review if it followed the plan and relevant skills accurately.
the first link was from a simple request with fewer than 1000 tokens total in the context window, just a short shell script.
here is another one which had about 200 tokens and opus decided to change the model name i requested.
https://x.com/xundecidability/status/2005647216741105962?s=2...
opus is bad at instruction following now.
I don't deny that AI has use cases, but boy - the workflow described is boring:
"Most developers type a prompt, sometimes use plan mode, fix the errors, repeat. "
Does anyone think this is as epic as, say, watch the Unix archives https://www.youtube.com/watch?v=tc4ROCJYbm0 where Brian demos how pipes work; or Dennis working on C and UNIX? Or even before those, the older machines?
I am not at all saying that AI tools are all useless, but there is no real epicness. It is just autogenerated AI slop and blob. I don't really call this engineering (although I also do agree, that it is engineering still; I just don't like using the same word here).
> never let Claude write code until you’ve reviewed and approved a written plan.
So the junior-dev analogy is quite apt here.
I tried to read the rest of the article, but I just got angrier. I never had that feeling watching oldschool legends, though perhaps some of their work may be boring, but this AI-generated code ... that's just some mythical random-guessing work. And none of that is "intelligent", even if it may appear to work, may work to some extent too. This is a simulation of intelligence. If it works very well, why would any software engineer still be required? Supervising would only be necessary if AI produces slop.
Every "how I use Claude Code" post will get into the HN frontpage.
Which maybe has to do with people wanting to show how they use Claude Code in the comments!
I’m a big fan of having the model create a GitHub issue directly (using the GH CLI) with the exact plan it generates, instead of creating a markdown file that will eventually get deleted. It gives me a permanent record and makes it easy to reference and close the issue once the PR is ready.
Interesting approach. The separation of planning and execution is crucial, but I think there's a missing layer most people overlook: permission boundaries between the two phases.
Right now when Claude Code (or any agent) executes a plan, it typically has the same broad permissions for every step. But ideally, each execution step should only have access to the specific tools and files it needs — least privilege, applied to AI workflows.
I've been experimenting with declarative permission manifests for agent tasks. Instead of giving the agent blanket access, you define upfront what each skill can read, write, and execute. Makes the planning phase more constrained but the execution phase much safer.
Anyone else thinking about this from a security-first angle?
Does anyone still write code? I use agents to iterate on one task in parallel, with an approach similar to this one: https://mitchellh.com/writing/my-ai-adoption-journey#today
But I'm starting to have an identity crisis: am I doing it wrong, and should I use an agent to write any line of code of the product I'm working on?
Have I become a dinosaur in the blink of an eye?
Should I just let it go and accept that the job I was used to not only changed (which is fine), but now requires just driving the output of a machine, with no creative process at all?
Honestly? Yeah.
I've been writing code for 25 years.
A year ago my org brought cursor in and I was skeptical for a specific reason: it was good at breaking CI in weird ways and I keep the CI system running for my org. Constants not mapping to file names, hallucinating function names/args, etc. It was categorically sloppy. And I was annoyed that engineers weren't catching this sloppy stuff. I thought this was going to increase velocity at the expense of quality. And it kind of did.
Fast forward a year and I haven't written code in a couple of weeks but I've shipped thousands LOC. I'm probably the pace setter on my team for constantly improving and experimenting with my AI flow. I speak to the computer probably half the time, maybe 75% on some days. I have multiple sessions going at all times. I review all the code Claude writes, but it's usually a one shot based on my extensive (dictated) prompts.
But to your identity crisis point, things are weird. I haven't actually produced this much code in a long time. And when I hit some milestone there are some differences between now and the before days: I don't have the sense of accomplishment that I used to get but also I don't have the mental exhaustion that I would get from really working through a solution. And so what I find is I just keep going and stacking commit after commit. It's not a bad thing, but it's fundamentally different than before and I am struggling a bit with what it means. Also to be fair I had lost my pure love of coding itself, so I am in a slightly weird spot with this, too.
What I do know is that throwing myself fully into it has secured my job for the foreseeable future because I'm faster than I've ever been and people look to me for guidance on how they can use these tool. I think with AI adoption the tallest trees will be cut last -- or at least I'm banking on it.
My flow is pretty similar, except I also add in these steps at the end of planning:
* Review the plan for potential issues
* Add context to the plan that would be helpful for an implementing agent
Good article, but I would rephrase the core principle slightly:
Never let Claude write code until you’ve reviewed, *fully understood* and approved a written plan.
In my experience, the beginning of chaos is the point at which you trust that Claude has understood everything correctly and claims to present the very best solution. At that point, you leave the driver's seat.
I came to the exact same pattern, with one extra heuristic at the end: spin up a new claude instance after the implementation is complete and ask it to find discrepancies between the plan and the implementation.
The baffling part of the article is all the assertions about how this is unique, novel, not the typical way people are doing this etc.
There are whole products wrapped around this common workflow already (like Augment Intent).
It strikes me that if this technology were as useful and all-encompassing as it's marketed to be, we wouldn't need four articles like this every week
How many millions of articles are there about people figuring out how to write better software?
Does something have to be trivial-to-use to be useful?
People are figuring it out. Cars are broadly useful, but there's nuance to how to maintain then, use them will in different terrains and weather, etc.
I just use Jesse’s “superpowers” plugin. It does all of this but also steps you through the design and gives you bite sized chunks and you make architecture decisions along the way. Far better than making big changes to an already established plan.
Link for those interested: https://claude.com/plugins/superpowers
I suggest reading the tests that Superpowers author has come up with for testing the skills. See the GitHub repo.
Have you tried https://github.com/pcvelz/superpowers ?
https://github.com/obra/superpowers
Gemini is better at research Claude at coding. I try to use Gemini to do all the research and write out instruction on what to do what process to follow then use it in Claude. Though I am mostly creating small python scripts
Insights are nice for new users but I’m not seeing anything too different from how anyone experienced with Claude Code would use plan mode. You can reject plans with feedback directly in the CLI.
Google Anti-Gravity has this process built in. This is essentially a cycle a developer would follow: plan/analyse - document/discuss - break down tasks/implement. We’ve been using requirements and design documents as best practice since leaving our teenage bedroom lab for the professional world. I suppose this could be seen as our coding agents coming of age.
My process is similar, but I recently added a new "critique the plan" feedback loop that is yielding good results. Steps:
1. Spec
2. Plan
3. Read the plan & tell it to fix its bad ideas.
4. (NB) Critique the plan (loop) & write a detailed report
5. Update the plan
6. Review and check the plan
7. Implement plan
Detailed here:
https://x.com/PetrusTheron/status/2016887552163119225
Same. In my experience, the first plan always benefits from being challenged once or twice by claude itself.
This is a similar workflow to speckit, kiro, gsd, etc.
I use amazon kiro.
The AI first works with you to write requirements, then it produces a design, then a task list.
The helps the AI to make smaller chunks to work on, it will work on one task at a time.
I can let it run for an hour or more in this mode. Then there is lots of stuff to fix, but it is mostly correct.
Kiro also supports steering files, they are files that try to lock the AI in for common design decisions.
the price is that a lot of the context is used up with these files and kiro constantly pauses to reset the context.
Since the rise of AI systems I really wonder how people wrote code before. This is exactly how I planned out implementation and executed the plan. Might have been some paper notes, a ticket or a white board, buuuuut ... I don't know.
How are the annotations put into the markdown? Claude needs to be able to identify them as annotations and not parts of the plan.
> I am not seeing the performance degradation everyone talks about after 50% context window.
I pretty much agree with that. I use long sessions and stopped trying to optimize the context size, the compaction happens but the plan keeps the details and it works for me.
I have tried using this and other workflows for a long time and had never been able to get them to work (see chat history for details).
This has changed in the last week, for 3 reasons:
1. Claude opus. It’s the first model where I haven’t had to spend more time correcting things than it would’ve taken me to just do it myself. The problem is that opus chews through tokens, which led to..
2. I upgraded my Claude plan. Previously on the regular plan I’d get about 20 mins of time before running out of tokens for the session and then needing to wait a few hours to use again. It was fine for little scripts or toy apps but not feasible for the regular dev work I do. So I upgraded to 5x. This now got me 1-2 hours per session before tokens expired. Which was better but still a frustration. Wincing at the price, I upgraded again to the 20x plan and this was the next game changer. I had plenty of spare tokens per session and at that price it felt like they were being wasted - so I ramped up my usage. Following a similar process as OP but with a plans directory with subdirectories for backlog, active and complete plans, and skills with strict rules for planning, implementing and completing plans, I now have 5-6 projects on the go. While I’m planning a feature on one the others are implementing. The strict plans and controls keep them on track and I have follow up skills for auditing quality and performance. I still haven’t hit token limits for a session but I’ve almost hit my token limit for the week so I feel like I’m getting my money’s worth. In that sense spending more has forced me to figure out how to use more.
3. The final piece of the puzzle is using opencode over claude code. I’m not sure why but I just don’t gel with Claude code. Maybe it’s all the sautéing and flibertygibbering, maybe it’s all the permission asking, maybe it’s that it doesn’t show what it’s doing as much as opencode. Whatever it is it just doesn’t work well for me. Opencode on the other hand is great. It’s shows what it’s doing and how it’s thinking which makes it easy for me to spot when it’s going off track and correct early.
Having a detailed plan, and correcting and iterating on the plan is essential. Making clause follow the plan is also essential - but there’s a line. Too fine grained and it’s not as creative at solving problems. Too loose/high level and it makes bad choices and goes in the wrong direction.
Is it actually making me more productive? I think it is but I’m only a week in. I’ve decided to give myself a month to see how it all works out.
I don’t intend to keep paying for the 20x plan unless I can see a path to using it to earn me at least as much back.
Just don’t use Claude Code. I can use the Codex CLI with just my $20 subscription and never come close to any usage limits
What if it's just slower so that your daily work fits within the paid tier they want?
It isn’t slower. I use my personal ChatGPT subscriptions with Codex for almost everything at work and use my $800/month company Claude allowance only for the tricky stuff that Codex can’t figure out. It’s never application code. It’s usually some combination of app code + Docker + AWS issue with my underlying infrastructure - created with whatever IAC that I’m using for a client - Terraform/CloudFormation or the CDK.
I burned through $10 on Claude in less than an hour. I only have $36 a day at $800 a month (800/22 working days)
> and use my $800/month company Claude allowance only for the tricky stuff that Codex can’t figure out.
It doesn’t seem controversial that the model that can solve more complex problems (that you admit the cheaper model can’t solve) costs more.
For the things I use it for, I’ve not found any other model to be worth it.
You’re assuming rational behavior from a company that doesn’t care about losing billions of dollar.
Have you tried Codex with OpenAi’s latest models?
Not in the last 2 months.
Current clause subscription is a sunk cost for the next month. Maybe I’ll try codex if Claude doesn’t lead anywhere.
I use both. As I’m working, I tell each of them to update a common document with the conversation. I don’t just tell Claude the what. I tell it the why and have it document it.
I can switch back and forth and use the MD file as shared context.
Curious: what are some cases where it'd make sense to not pay for the 20x plan (which is $200/month), and provide a whopping $800/month pay-per-token allowance instead?
Who knows? It’s part of an enterprise plan. I work for a consulting company. There are a number of fallbacks, the first fallback if we are working on an internal project is just to use our internal AWS account and use Claude code with the Anthropic hosted on Bedrock.
https://code.claude.com/docs/en/amazon-bedrock
The second fallback if it is for a customer project is to use their AWS account for development for them.
The rate my company charges for me - my level as an American based staff consultant (highest bill rate at the company) they are happy to let us use Claude Code using their AWS credentials. Besides, if we are using AWS Bedrock hosted Anthropic models, they know none of their secrets are going to Anthropic. They already have the required legal confidentiality/compliancd agreements with AWS.
I agree with most of this, though I'm not sure it's radically different. I think most people who've been using CC in earnest for a while probably have a similar workflow? Prior to Claude 4 it was pretty much mandatory to define requirements and track implementation manually to manage context. It's still good, but since 4.5 release, it feels less important. CC basically works like this by default now, so unless you value the spec docs (still a good reference for Claude, but need to be maintained), you don't have to think too hard about it anymore.
The important thing is to have a conversation with Claude during the planning phase and don't just say "add this feature" and take what you get. Have a back and forth, ask questions about common patterns, best practices, performance implications, security requirements, project alignment, etc. This is a learning opportunity for you and Claude. When you think you're done, request a final review to analyze for gaps or areas of improvement. Claude will always find something, but starts to get into the weeds after a couple passes.
If you're greenfield and you have preferences about structure and style, you need to be explicit about that. Once the scaffolding is there, modern Claude will typically follow whatever examples it finds in the existing code base.
I'm not sure I agree with the "implement it all without stopping" approach and let auto-compact do its thing. I still see Claude get lazy when nearing compaction, though has gotten drastically better over the last year. Even so, I still think it's better to work in a tight loop on each stage of the implementation and preemptively compacting or restarting for the highest quality.
Not sure that the language is that important anymore either. Claude will explore existing codebase on its own at unknown resolution, but if you say "read the file" it works pretty well these days.
My suggestions to enhance this workflow:
- If you use a numbered phase/stage/task approach with checkboxes, it makes it easy to stop/resume as-needed, and discuss particular sections. Each phase should be working/testable software.
- Define a clear numbered list workflow in CLAUDE.md that loops on each task (run checks, fix issues, provide summary, etc).
- Use hooks to ensure the loop is followed.
- Update spec docs at the end of the cycle if you're keeping them. It's not uncommon for there to be some divergence during implementation and testing.
There are a few prompt frameworks that essentially codify these types of workflows by adding skills and prompts
https://github.com/obra/superpowers https://github.com/jlevy/tbd
this is literally reinventing claude's planning mode, but with more steps. I think Boris doesn't realize that planning mode is actually stored in a file.
https://x.com/boristane/status/2021628652136673282
Doesn’t Claude code do this by switching between edit mode and plan mode?
FWIW I have had significant improvements by clearing context then implementing the plan. Seems like it stops Claude getting hung up on something.
All sounds like a bespoke way of remaking https://github.com/Fission-AI/OpenSpec
It seems like the annotation of plan files is the key step.
Claude Code now creates persistent markdown plan files in ~/.claude/plans/ and you can open them with Ctrl-G to annotate them in your default editor.
So plan mode is not ephemeral any more.
I don't really get what is different about this from how almost everyone else uses Claude Code? This is an incredibly common, if not the most common way of using it (and many other tools).
Funny how I came up with something loosely similar. Asking Codex to write a detailed plan in a markdown document, reviewing it, and asking it to implement it step by step. It works exquisitely well when it can build and test itself.
I do the same. I also cross-ask gemini and claude about the plan during iterations, sometimes make several separate plans.
Hub and spoke documentation in planning has been absolutely essential for the way my planning was before, and it's pretty cool seeing it work so well for planning mode to build scaffolds and routing.
this is exactly how I work with cursor
except that I put notes to plan document in a single message like:
otherwise, I'm not sure how to guarantee that ai won't confuse my notes with its own plan.one new thing for me is to review the todo list, I was always relying on auto generated todo list
The post and comments all read like: Here are my rituals to the software God. If you follow them then God gives plenty. Omit one step and the God mad. Sometimes you have to make a sacrifice but that's better for the long term.
I've been in eng for decades but never participated in forums. Is the cargo cult new?
I use Claude Code a lot. Still don't trust what's in the plan will get actually written, regardless of details. My ritual is around stronger guardrails outside of prompting. This is the new MongoDB webscale meme.
I had to stop reading about half way, it's written in that breathless linkedin/ai generated style.
It is really fun to watch how a baby makes its first steps and also how experienced professionals rediscover what standards were telling us for 80+ years.
Sounds a bit like what Claude Plan Mode or Amazon's Kiro were built for. I agree it's a useful flow, but you can also overdo it.
Is it required to tell Claude to re-read the code folder again when you come back some day later or should we ask Claude to just pickup from research.md file thus saving some tokens?
I do something broadly similar. I ask for a design doc that contains an embedded todo list, broken down into phases. Looping on the design doc asking for suggestions seems to help. I'm up to about 40 design docs so far on my current project.
The author discovered plan mode in cursor.
This all looks fine for someone who can't code, but for anyone with even a moderate amount of experience as a developer all this planning and checking and prompting and orchestrating is far more work than just writing the code yourself.
There's no winner for "least amount of code written regardless of productivity outcomes.", except for maybe Anthropic's bank account.
I really don't understand why there are so many comments like this.
Yesterday I had Claude write an audit logging feature to track all changes made to entities in my app. Yeah you get this for free with many frameworks, but my company's custom setup doesn't have it.
It took maybe 5-10 minutes of wall-time to come up with a good plan, and then ~20-30 min for Claude implement, test, etc.
That would've taken me at least a day, maybe two. I had 4-5 other tasks going on in other tabs while I waited the 20-30 min for Claude to generate the feature.
After Claude generated, I needed to manually test that it worked, and it did. I then needed to review the code before making a PR. In all, maybe 30-45 minutes of my actual time to add a small feature.
All I can really say is... are you sure you're using it right? Have you _really_ invested time into learning how to use AI tools?
Same here. I did bounce off these tools a year ago. They just didn't work for me 60% of the time. I learned a bit in that initial experience though and walked away with some tasks ChatGPT could replace in my workflow. Mainly replacing scripts and reviewing single files or functions.
Fast forward to today and I tried the tools again--specifically Claude Code--about a week ago. I'm blown away. I've reproduced some tools that took me weeks at full-time roles in a single day. This is while reviewing every line of code. The output is more or less what I'd be writing as a principal engineer.
> The output is more or less what I'd be writing as a principal engineer.
I certainly hope this is not true, because then you're not competent for that role. Claude Code writes an absolutely incredible amount of unecessary and superfluous comments, it's makes asinine mistakes like forgetting to update logic in multiple places. It'll gladly drop the entire database when changing column formats, just as an example.
I’m not sure what you're doing or if you’ve tried the tools recently but this isn’t even close to my experience.
Trust me I'm very impressed at the progress AI has made, and maybe we'll get to the point where everything is 100% correct all the time and better than any human could write. I'm skeptical we can get there with the LLM approach though.
The problem is LLMs are great at simple implementation, even large amounts of simple implementation, but I've never seen it develop something more than trivial correctly. The larger problem is it's very often subtly but hugely wrong. It makes bad architecture decisions, it breaks things in pursuit of fixing or implementing other things. You can tell it has no concept of the "right" way to implement something. It very obviously lacks the "senior developer insight".
Maybe you can resolve some of these with large amounts of planning or specs, but that's the point of my original comment - at what point is it easier/faster/better to just write the code yourself? You don't get a prize for writing the least amount of code when you're just writing specs instead.
This is exactly what the article is about. The tradeoff is that you have to throughly review the plans and iterate on them, which is tiring. But the LLM will write good code faster than you, if you tell it what good code is.
Exactly; the original commenter seems determined to write-off AI as "just not as good as me".
The original article is, to me, seemingly not that novel. Not because it's a trite example, but because I've begun to experience massive gains from following the same basic premise as the article. And I can't believe there's others who aren't using like this.
I iterate the plan until it's seemingly deterministic, then I strip the plan of implementation, and re-write it following a TDD approach. Then I read all specs, and generate all the code to red->green the tests.
If this commenter is too good for that, then it's that attitude that'll keep him stuck. I already feel like my projects backlog is achievable, this year.
Strongly agree about the deterministic part. Even more important than a good design, the plan must not show any doubt, whether it's in the form of open questions or weasel words. 95% of the time those vague words mean I didn't think something through, and it will do something hideous in order to make the plan work
My experience has so far been similar to the root commenter - at the stage where you need to have a long cycle with planning it's just slower than doing the writing + theory building on my own.
It's an okay mental energy saver for simpler things, but for me the self review in an actual production code context is much more draining than writing is.
I guess we're seeing the split of people for whom reviewing is easy and writing is difficult and vice versa.
Several months ago, just for fun, I asked Claude (the web site, not Claude Code) to build a web page with a little animated cannon that shoots at the mouse cursor with a ballistic trajectory. It built the page in seconds, but the aim was incorrect; it always shot too low. I told it the aim was off. It still got it wrong. I prompted it several times to try to correct it, but it never got it right. In fact, the web page started to break and Claude was introducing nasty bugs.
More recently, I tried the same experiment, again with Claude. I used the exact same prompt. This time, the aim was exactly correct. Instead of spending my time trying to correct it, I was able to ask it to add features. I've spent more time writing this comment on HN than I spent optimizing this toy. https://claude.ai/public/artifacts/d7f1c13c-2423-4f03-9fc4-8...
My point is that AI-assisted coding has improved dramatically in the past few months. I don't know whether it can reason deeply about things, but it can certainly imitate a human who reasons deeply. I've never seen any technology improve at this rate.
> but I've never seen it develop something more than trivial correctly.
What are you working on? I personally haven't seen LLMs struggle with any kind of problem in months. Legacy codebase with great complexity and performance-critical code. No issue whatsoever regardless of the size of the task.
>I've never seen it develop something more than trivial correctly.
This is 100% incorrect, but the real issue is that the people who are using these llms for non-trivial work tend to be extremely secretive about it.
For example, I view my use of LLMs to be a competitive advantage and I will hold on to this for as long as possible.
The key part of my comment is "correctly".
Does it write maintainable code? Does it write extensible code? Does it write secure code? Does it write performant code?
My experience has been it failing most of these. The code might "work", but it's not good for anything more than trivial, well defined functions (that probably appeared in it's training data written by humans). LLMs have a fundamental lack of understanding of what they're doing, and it's obvious when you look at the finer points of the outcomes.
That said, I'm sure you could write detailed enough specs and provide enough examples to resolve these issues, but that's the point of my original comment - if you're just writing specs instead of code you're not gaining anything.
I find “maintainable code” the hardest bias to let go of. 15+ years of coding and design patterns are hard to let go.
But the aha moment for me was what’s maintainable by AI vs by me by hand are on different realms. So maintainable has to evolve from good human design patterns to good AI patterns.
Specs are worth it IMO. Not because if I can spec, I could’ve coded anyway. But because I gain all the insight and capabilities of AI, while minimizing the gotchas and edge failures.
> But the aha moment for me was what’s maintainable by AI vs by me by hand are on different realms. So maintainable has to evolve from good human design patterns to good AI patterns.
How do you square that with the idea that all the code still has to be reviewed by humans? Yourself, and your coworkers
I picture like semi conductors; the 5nm process is so absurdly complex that operators can't just peek into the system easily. I imagine I'm just so used to hand crafting code that I can't imagine not being able to peek in.
So maybe it's that we won't be reviewing by hand anymore? I.e. it's LLMs all the way down. Trying to embrace that style of development lately as unnatural as it feels. We're obv not 100% there yet but Claude Opus is a significant step in that direction and they keep getting better and better.
Then who is responsible when (not if) that code does horrible things? We have humans to blame right now. I just don’t see it happening personally because liability and responsibility are too important
For some software, sure but not most.
And you don’t blame humans anyways lol. Everywhere I’ve worked has had “blameless” postmortems. You don’t remove human review unless you have reasonable alternatives like high test coverage and other automated reviews.
We still have performance reviews and are fired. There’s a human that is responsible.
“It’s AI all the way down” is either nonsense on its face, or the industry is dead already.
> But the aha moment for me was what’s maintainable by AI vs by me by hand are on different realms
I don't find that LLMs are any more likely than humans to remember to update all of the places it wrote redundant functions. Generally far less likely, actually. So forgive me for treating this claim with a massive grain of salt.
Yes to all of these.
Here's the rub, I can spin up multiple agents in separate shells. One is prompted to build out <feature>, following the pattern the author/OP described. Another is prompted to review the plan/changes and keep an eye out for specific things (code smells, non-scalable architecture, duplicated code, etc. etc.). And then another agent is going to get fed that review and do their own analysis. Pass that back to the original agent once it finishes.
Less time, cleaner code, and the REALLY awesome thing is that I can do this across multiple features at the same time, even across different codebases or applications.
To answer all of your questions:
yes, if I steer it properly.
It's very good at spotting design patterns, and implementing them. It doesn't always know where or how to implement them, but that's my job.
The specs and syntactic sugar are just nice quality of life benefits.
You’d be building blocks which compound over time. That’s been my experience anyway.
The compounding is much greater than my brain can do on its own.
There's comments like this because devs/"engineers" in tech are elitists that think they're special. They can't accept that a machine can do a part of their job that they thought made them special.
> In all, maybe 30-45 minutes of my actual time to add a small feature
Why would this take you multiple days to do if it only took you 30m to review the code? Depends on the problem, but if I’m able to review something the time it’d take me to write it is usually at most 2x more worst case scenario - often it’s about equal.
I say this because after having used these tools, most of the speed ups you’re describing come at the cost of me not actually understanding or thoroughly reviewing the code. And this is corroborated by any high output LLM users - you have to trust the agent if you want to go fast.
Which is fine in some cases! But for those of us who have jobs where we are personally responsible for the code, we can’t take these shortcuts.
> Yesterday I had Claude write an audit logging feature to track all changes made to entities in my app. Yeah you get this for free with many frameworks, but my company's custom setup doesn't have it.
But did you truly think about such feature? Like guarantees that it should follow (like how do it should cope with entities migration like adding a new field) or what the cost of maintaining it further down the line. This looks suspiciously like drive-by PR made on open-source projects.
> That would've taken me at least a day, maybe two.
I think those two days would have been filled with research, comparing alternatives, questions like "can we extract this feature from framework X?", discussing ownership and sharing knowledge,.. Jumping on coding was done before LLMs, but it usually hurts the long term viability of the project.
Adding code to a project can be done quite fast (hackatons,...), ensuring quality is what slows things down in any any well functioning team.
I mean, all I can really say is... if writing some logging takes you one or two days, are you sure you _really_ know how to code?
Ever worked on a distributed system with hundreds of millions of customers and seemingly endless business requirements?
Some things are complex.
You're right, you're better than me!
You could've been curious and ask why it would take 1-2 days, and I would've happily told you.
I'll bite, because it does seem like something that should be quick in a well-architected codebase. What was the situation? Was there something in this codebase that was especially suited to AI-development? Large amounts of duplication perhaps?
It's not particularly interesting.
I wanted to add audit logging for all endpoints we call, all places we call the DB, etc. across areas I haven't touched before. It would have taken me a while to track down all of the touchpoints.
Granted, I am not 100% certain that Claude didn't miss anything. I feel fairly confident that it is correct given that I had it research upfront, had multiple agents review, and it made the correct changes in the areas that I knew.
Also I'm realizing I didn't mention it included an API + UI for viewing events w/ pretty deltas
Well someone who says logging is easy never knows the difficulty of deciding "what" to log. And audit log is different beast altogether than normal logging
Audit logging is different than developer logging… companies will have entire teams dedicated to audit systems.
We're not as good at coding as you, naturally.
I'd find it deeply funny if the optimal vibe coding workflow continues to evolve to include more and more human oversight, and less and less agent autonomy, to the point where eventually someone makes a final breakthrough that they can save time by bypassing the LLM entirely and writing the code themselves. (Finally coming full circle.)
You mean there will be an invention to edit files directly instead of giving the specific code and location you want it to be written into the prompt?
Researching and planning a project is a generally usefully thing. This is something I've been doing for years, and have always had great results compared to just jumping in and coding. It makes perfect sense that this transfers to LLM use.
Well it's less mental load. It's like Tesla's FSD. Am I a better driver than the FSD? For sure. But is it nice to just sit back and let it drive for a bit even if it's suboptimal and gets me there 10% slower, and maybe slightly pisses off the guy behind me? Yes, nice enough to shell out $99/mo. Code implementation takes a toll on you in the same way that driving does.
I think the method in TFA is overall less stressful for the dev. And you can always fix it up manually in the end; AI coding vs manual coding is not either-or.
Most of these AI coding articles seem to be about greenfield development.
That said, if you're on a serious team writing professional software there is still tons of value in always telling AI to plan first, unless it's a small quick task. This post just takes it a few steps further and formalizes it.
I find Cursor works much more reliably using plan mode, reviewing/revising output in markdown, then pressing build. Which isn't a ton of overhead but often leads to lots of context switching as it definitely adds more time.
Since Opus 4.5, things have changed quite a lot. I find LLMs very useful for discussing new features or ideas, and Sonnet is great for executing your plan while you grab a coffee.
I partly agree with you. But once you have a codebase large enough, the changes become longer to even type in, once figured out.
I find the best way to use agents (and I don't use claude) is to hash it out like I'm about to write these changes and I make my own mental notes, and get the agent to execute on it.
Agents don't get tired, they don't start fat fingering stuff at 4pm, the quality doesn't suffer. And they can be parallelised.
Finally, this allows me to stay at a higher level and not get bogged down of "right oh did we do this simple thing again?" which wipes some of the context in my mind and gets tiring through the day.
Always, 100% review every line of code written by an agent though. I do not condone committing code you don't 'own'.
I'll never agree with a job that forces developers to use 'AI', I sometimes like to write everything by hand. But having this tool available is also very powerful.
I want to be clear, I'm not against any use of AI. It's hugely useful to save a couple of minutes of "write this specific function to do this specific thing that I could write and know exactly what it would look like". That's a great use, and I use it all the time! It's better autocomplete. Anything beyond that is pushing it - at the moment! We'll see, but spending all day writing specs and double-checking AI output is not more productive than just writing correct code yourself the first time, even if you're AI-autocompleting some of it.
For the last few days I've been working on a personal project that's been on ice for at least 6 years. Back when I first thought of the project and started implementing it, it took maybe a couple weeks to eke out some minimally working code.
This new version that I'm doing (from scratch with ChatGPT web) has a far more ambitious scope and is already at the "usable" point. Now I'm primarily solidifying things and increasing test coverage. And I've tested the key parts with IRL scenarios to validate that it's not just passing tests; the thing actually fulfills its intended function so far. Given the increased scope, I'm guessing it'd take me a few months to get to this point on my own, instead of under a week, and the quality wouldn't be where it is. Not saying I haven't had to wrangle with ChatGPT on a few bugs, but after a decent initial planning phase, my prompts now are primarily "Do it"s and "Continue"s. Would've likely already finished it if I wasn't copying things back and forth between browser and editor, and being forced to pause when I hit the message limit.
This is a great come-back story. I have had a similar experience with a photoshop demake of mine.
I recommend to try out Opencode with this approach, you might find it less tiring than ChatGPT web (yes it works with your ChatGPT Plus sub).
I think it comes down to "it depends". I work in a NIS2 regulated field and we're quite callenged by the fact that it means we can't give AI's any sort of real access because of the security risk. To be complaint we'd have to have the AI agent ask permission for every single thing it does, before it does it, and foureye review it. Which is obviously never going to happen. We can discuss how bad the NIS2 foureye requirement works in the real world another time, but considering how easy it is to break AI security, it might not be something we can actually ever use. This makes sense on some of the stuff we work on, since it could bring an entire powerplant down. On the flip-side AI risks would be of little concern on a lot of our internal tools, which are basically non-regulated and unimportant enough that they can be down for a while without costing the business anything beyond annoyances.
This is where our challenges are. We've build our own chatbot where you can "build" your own agent within the librechat framework and add a "skill" to it. I say "skill" because it's older than claude skills but does exactly the same. I don't completely buy the authors:
> “deeply”, “in great details”, “intricacies”, “go through everything”
bit, but you can obviously save a lot of time by writing a piece of english which tells it what sort of environment you work in. It'll know that when I write Python I use UV, Ruff and Pyrefly and so on as an example. I personally also have a "skill" setting that tells the AI not to compliment me because I find that ridicilously annoying, and that certainly works. So who knows? Anyway, employees are going to want more. I've been doing some PoC's running open source models in isolation on a raspberry pi (we had spares because we use them in IoT projects) but it's hard to setup an isolation policy which can't be circumvented.
We'll have to figure it out though. For powerplant critical projects we don't want to use AI. But for the web tool that allows a couple of employees to upload three excel files from an external accountant and then generate some sort of report on them? Who cares who writes it or even what sort of quality it's written with? The lifecycle of that tool will probably be something that never changes until the external account does and then the tool dies. Not that it would have necessarily been written in worse quality without AI... I mean... Have you seen some of the stuff we've written in the past 40 years?
There is a miscommunication happening, this entire time we all had surprisingly different ideas about what quality of work is acceptable which seems to account for differences of opinion on this stuff.
Surely Addy Osmani can code. Even he suggests plan first.
https://news.ycombinator.com/item?id=46489061
> planning and checking and prompting and orchestrating is far more work than just writing the code yourself.
This! Once I'm familiar with the codebase (which I strive to do very quickly), for most tickets, I usually have a plan by the time I've read the description. I can have a couple of implementation questions, but I knew where the info is located in the codebase. For things, I only have a vague idea, the whiteboard is where I go.
The nice thing with such a mental plan, you can start with a rougher version (like a drawing sketch). Like if I'm starting a new UI screen, I can put a placeholder text like "Hello, world", then work on navigation. Once that done, I can start to pull data, then I add mapping functions to have a view model,...
Each step is a verifiable milestone. Describing them is more mentally taxing than just writing the code (which is a flow state for me). Why? Because English is not fit to describe how computer works (try describe a finite state machine like navigation flow in natural languages). My mental mental model is already aligned to code, writing the solution in natural language is asking me to be ambiguous and unclear on purpose.
Claude appeared to just crash in my session: https://news.ycombinator.com/item?id=47107630
this sounds... really slow. for large changes for sure i'm investing time into planning. but such a rigid system can't possible be as good as a flexible approach with variable amounts of planning based on complexity
That is just spec driven development without a spec, starting with the plan step instead.
This is just Waterfall for LLMs. What happens when you explore the problem space and need to change up the plan?
Do you think this is a gotcha?
You just prompt the llm to change the plan.
You described how AntiGravity works natively.
Why don't you make Claude give feedback and iterate by itself?
So we’re back to waterfall huh
AI only improves and changes. Embrace the scientific method and make sure your “here’s how to” are based in data.
I appreciate the author taking the time to share his workflow even though I really dislike the way this article is written. My dislike stems from sentences like this one: "I’ve been using Claude Code as my primary development tool for approx 9 months, and the workflow I’ve settled into is radically different from what most people do with AI coding tools." There is nothing radically different in the way he's using it (quite the opposite) and the are so many people that wrote about their workflows (and which are almost exactly the same, here's just one example [1]). Apart from that, the obvious use of AI to write or edit the article makes it further indigestible: "That’s it. No magic prompts, no elaborate system instructions, no clever hacks. Just a disciplined pipeline that separates thinking from typing."
[1] https://github.com/snarktank/ai-dev-tasks
There's no way I'd call what I do "radically different from what most people do" myself, under any circumstances. Yet in my last cross-team discussions at work, I realized that a whole lot of people were using AI in ways I'd consider either silly or mostly ineffective. We had a team boasting "we used Amazon Q to increase our projects' unit test coverage", and a principal engineer talking about how he uses Cursor as some form of advanced auto complete.
So when I point claude code at a ticket, hand it readOnly access to a qa environment so it can see how the database actually looks like, chat about implementation details and then tell it to go implement the plan, running unit tests, functional tests, run linters and all, that, they look at me like I have three heads.
So if you ask me, explaining reasonably easy ways to get good outcomes out of Codex or Claude Code is still necessary evangelism, at least in companies that haven't spent on tools to do things like what Stripe does. There's still quite a few people out there copying and pasting from the chat window.
> We had a team boasting "we used Amazon Q to increase our projects' unit test coverage"
Well are the tests good or no? Did it help the work get done faster or more thoroughly than without?
> how he uses Cursor as some form of advanced auto complete
Is there something wrong with that? That's literally what an LLM is, why not use it directly for that purpose instead of using the wacky indirect "run autocomplete on a conversation and accompanying script of actions" thing. Not everyone wants to be an agent jockey.
I don't see what's necessarily silly or ineffective about what you described. Personally I don't find it particularly efficient to chat about and plan out all bunch of work with a robot for every task, often it's faster to just sketch out a design on a notepad and then go write code, maybe with advanced AI completion help to save keystrokes.
I agree that if you want the AI to do non-trivial amounts of work, you need to chat and plan out the work and establish a good context window. What I don't agree with is your implication that any other less-sophisticated use of AI is necessarily deficient.
How is this evidence of AI use?
> That’s it. No magic prompts, no elaborate system instructions, no clever hacks. Just a disciplined pipeline that separates thinking from typing.
That is a perfectly normal sentence, indistinguishable from one I might write myself. I am not an AI.
This is a big giveaway because ai tends to overuse this same structure to "conclude"
It’s not X it’s Y is one of the most obvious LLM writing patterns. Especially the heavily punctuated sentence structure.
> the obvious use of AI to write or edit the article makes it further indigestible: "That’s it. No magic prompts, no elaborate system instructions, no clever hacks. Just a disciplined pipeline that separates thinking from typing."
Any comment complaining about using AI deserves a downvote. First of all it reads like witch hunt, accusation without evidence that’s only based on some common perceptions. Secondly, whether it’s written with AI’s help or not, that particular sentence is clear, concise, and communicative. It’s much better than a lot of human written mumblings prevalent here on HN.
Anyone wants to guess if I’m using AI to help with this comment of mine?
my rlm-workflow skill has this encoded as a repeatable workflow.
give it a try: https://skills.sh/doubleuuser/rlm-workflow/rlm-workflow
Another approach is to spec functionality using comments and interfaces, then tell the LLM to first implement tests and finally make the tests pass. This way you also get regression safety and can inspect that it works as it should via the tests.
That's great, actually, doesn't the logic apply to other services as well?
Is this not just Ralph with extra steps and the risk of context rot?
How much time are you actually saving at this point?
Tip: LLMs are very good at following conventions (this is actually what is happening when it writes code). If you create a .md file with a list of entries of the following structure: # <identifier> <description block> <blank space> # <identifier> ... where an <identifier> is a stable and concise sequence of tokens that identifies some "thing" and seed it with 5 entries describing abstract stuff, the LLM will latch on and reference this. I call this a PCL (Project Concept List). I just tell it: > consume tmp/pcl-init.md pcl.md The pcl-init.md describes what PCL is and pcl.md is the actual list. I have pcl.md file for each independent component in the code (logging, http, auth, etc). This works very very well. The LLM seems to "know" what you're talking about. You can ask questions and give instructions like "add a PCL entry about this". It will ask if should add a PCL entry about xyz. If the description block tends to be high information-to-token ratio, it will follow that convention (which is a very good convention BTW).
However, there is a caveat. LLMs resist ambiguity about authority. So the "PCL" or whatever you want to call it, needs to be the ONE authoritative place for everything. If you have the same stuff in 3 different files, it won't work nearly as well.
Bonus Tip: I find long prompt input with example code fragments and thoughtful descriptions work best at getting an LLM to produce good output. But there will always be holes (resource leaks, vulnerabilities, concurrency flaws, etc). So then I update my original prompt input (keep it in a separate file PROMPT.txt as a scratch pad) to add context about those things maybe asking questions along the way to figure out how to fix the holes. Then I /rewind back to the prompt and re-enter the updated prompt. This feedback loop advances the conversation without expending tokens.
This is great. My workflow is also heading in that direction, so this is a great roadmap. I've already learned that just naively telling Claude what to do and letting it work, is a recipe for disaster and wasted time.
I'm not this structured yet, but I often start with having it analyse and explain a piece of code, so I can correct it before we move on. I also often switch to an LLM that's separate from my IDE because it tends to get confused by sprawling context.
Sorry but I didn't get the hype with this post, isnt it what most of the people doing? I want to see more posts on how you use the claude "smart" without feeding the whole codebase polluting the context window and also more best practices on cost efficient ways to use it, this workflow is clearly burning million tokens per session, for me is a No
I feel like if I have to do all this, I might as well write the code myself.
That's exactly what Cursor's "plan" mode does? It even creates md files, which seems to be the main "thing" the author discovered. Along with some cargo cult science?
How is this noteworthy other than to spark a discussion on hn? I mean I get it, but a little more substance would be nice.
There is not a lot of explanation WHY is this better than doing the opposite: start coding and see how it goes and how this would apply to Codex models.
I do exactly the same, I even developed my own workflows wit Pi agent, which works really well. Here is the reason:
- Claude needs a lot more steering than other models, it's too eager to do stuff and does stupid things and write terrible code without feedback.
- Claude is very good at following the plan, you can even use a much cheaper model if you have a good plan. For example I list every single file which needs edits with a short explanation.
- At the end of the plan, I have a clear picture in my head how the feature will exactly look like and I can be pretty sure the end result will be good enough (given that the model is good at following the plan).
A lot of things don't need planning at all. Simple fixes, refactoring, simple scripts, packaging, etc. Just keep it simple.
Use OpenSpec and simplify everything.
falling asleep here. when will the babysitting end
Has Claude Code become slow, laggy, imprecise, giving wrong answers for other people here?
The plan document and todo are an artifact of context size limits. I use them too because it allows using /reset and then continuing.
This is exactly how I use it.
What works extremely well for me is this: Let Claude Code create the plan, then turn over the plan to Codex for review, and give the response back to Claude Code. Codex is exceptionally good at doing high level reviews and keeping an eye on the details. It will find very suble errors and omissins. And CC is very good at quickly converting the plan into code.
This back and forth between the two agents with me steering the conversation elevates Claude Code into next level.
Honestly, I found that the best way to use these CLIs is exactly how the CLI creators have intended.
I don't know. I tried various methods. And this one kind of doesn't work quite a bit of times. The problem is plan naturally always skips some important details, or assumes some library function, but is taken as instruction in the next section. And claude can't handle ambiguity if the instruction is very detailed(e.g. if plan asks to use a certain library even if it is a bad fit claude won't know that decision is flexible). If the instruction is less detailed, I saw claude is willing to try multiple things and if it keeps failing doesn't fear in reverting almost everything.
In my experience, the best scenario is that instruction and plan should be human written, and be detailed.
The author seems to think they've hit upon something revolutionary...
They've actually hit upon something that several of us have evolved to naturally.
LLM's are like unreliable interns with boundless energy. They make silly mistakes, wander into annoying structural traps, and have to be unwound if left to their own devices. It's like the genie that almost pathologically misinterprets your wishes.
So, how do you solve that? Exactly how an experienced lead or software manager does: you have systems write it down before executing, explain things back to you, and ground all of their thinking in the code and documentation, avoiding making assumptions about code after superficial review.
When it was early ChatGPT, this meant function-level thinking and clearly described jobs. When it was Cline it meant cline rules files that forced writing architecture.md files and vibe-code.log histories, demanding grounding in research and code reading.
Maybe nine months ago, another engineer said two things to me, less than a day apart:
- "I don't understand why your clinerules file is so large. You have the LLM jumping through so many hoops and doing so much extra work. It's crazy."
- The next morning: "It's basically like a lottery. I can't get the LLM to generate what I want reliably. I just have to settle for whatever it comes up with and then try again."
These systems have to deal with minimal context, ambiguous guidance, and extreme isolation. Operate with a little empathy for the energetic interns, and they'll uncork levels of output worth fighting for. We're Software Managers now. For some of us, that's working out great.
Revolutionary or not it was very nice of the author to make time and effort to share their workflow.
For those starting out using Claude Code it gives a structured way to get things done bypassing the time/energy needed to “hit upon something that several of us have evolved to naturally”.
It's this line that I'm bristling at: "...the workflow I’ve settled into is radically different from what most people do with AI coding tools..."
Anyone who spends some time with these tools (and doesn't black out from smashing their head against their desk) is going to find substantial benefit in planning with clarity.
It was #6 in Boris's run-down: https://news.ycombinator.com/item?id=46470017
So, yes, I'm glad that people write things out and share. But I'd prefer that they not lead with "hey folks, I have news: we should *slice* our bread!"
But the author's workflow is actually very different from Boris'.
#6 is about using plan mode whereas the author says "The built-in plan mode sucks".
The author's post is much more than just "planning with clarity".
Since some time, Claude Codes's plan mode also writes file with a plan that you could probably edit etc. It's located in ~/.claude/plans/ for me. Actually, there's whole history of plans there.
I sometimes reference some of them to build context, e.g. after few unsuccessful tries to implement something, so that Claude doesn't try the same thing again.
The author __is__ Boris ...
They are different Boris. I was using the names already used in this thread.
> The author's post is much more than just "planning with clarity".
Not much more, though.
It introduces "research", which is the central topic of LLMs since they first arrived. I mean, LLMs coined the term "hallucination", and turned grounding into a key concept.
In the past, building up context was thought to be the right way to approach LLM-assisted coding, but that concept is dead and proven to be a mistake, like discussing the best way to force a round peg through the square hole, but piling up expensive prompts to try to bridge the gap. Nowadays it's widely understood that it's far more effective and way cheaper to just refactor and rearchitect apps so that their structure is unsurprising and thus grounding issues are no longer a problem.
And planning mode. Each and every single LLM-assisted coding tool built their support for planning as the central flow and one that explicitly features iterations and manual updates of their planning step. What's novel about the blog post?
A detailed workflow that's quite different from the other posts I've seen.
> A detailed workflow that's quite different from the other posts I've seen.
Seriously? Provide context with a prompt file, prepare a plan in plan mode, and then execute the plan? You get more detailed descriptions of this if you read the introductory how-to guides of tools such as Copilot.
Making the model write a research file, then the plan and iterate on it by editing the plan file, then adding the todo list, then doing the implementation, and doing all that in a single conversation (instead of clearing contexts).
There's nothing revolutionary, but yes, it's a workflow that's quite different from other posts I've seen, and especially from Boris' thread that was mentioned which is more like a collection of tips.
I would say he’s saying “hey folks, I have news. We should slice our bread with a knife rather than the spoon that came with the bread.”
> Anyone who spends some time with these tools (and doesn't black out from smashing their head against their desk) is going to find substantial benefit in planning with clarity.
That's obvious by now, and the reason why all mainstream code assistants now offer planning mode as a central feature of their products.
It was baffling to read the blogger making claims about what "most people" do when anyone using code assistants already do it. I mean, the so called frontier models are very expensive and time-consuming to run. It's a very natural pressure to make each run count. Why on earth would anyone presume people don't put some thought into those runs?
This kind of flows have been documented in the wild for some time now. They started to pop up in the Cursor forums 2+ years ago... eg: https://github.com/johnpeterman72/CursorRIPER
Personally I have been using a similar flow for almost 3 years now, tailored for my needs. Everybody who uses AI for coding eventually gravitates towards a similar pattern because it works quite well (for all IDEs, CLIs, TUIs)
Its ai written though, the tells are in pretty much every paragraph.
I don’t think it’s that big a red flag anymore. Most people use ai to rewrite or clean up content, so I’d think we should actually evaluate content for what it is rather than stop at “nah it’s ai written.”
>Most people use ai to rewrite or clean up content
I think your sentence should have been "people who use ai do so to mostly rewrite or clean up content", but even then I'd question the statistical truth behind that claim.
Personally, seeing something written by AI means that the person who wrote it did so just for looks and not for substance. Claiming to be a great author requires both penmanship and communication skills, and delegating one or either of them to a large language model inherently makes you less than that.
However, when the point is just the contents of the paragraph(s) and nothing more then I don't care who or what wrote it. An example is the result of a research, because I'd certainly won't care about the prose or effort given to write the thesis but more on the results (is this about curing cancer now and forever? If yes, no one cares if it's written with AI).
With that being said, there's still that I get anywhere close to understanding the author behind the thoughts and opinions. I believe the way someone writes hints to the way they think and act. In that sense, using LLM's to rewrite something to make it sound more professional than what you would actually talk in appropriate contexts makes it hard for me to judge someone's character, professionalism, and mannerisms. Almost feels like they're trying to mask part of themselves. Perhaps they lack confidence in their ability to sound professional and convincing?
People like to hide behind AI so they can claim credit for its ideas. It's the same thing in job interviews.
> I don’t think it’s that big a red flag anymore. Most people use ai to rewrite or clean up content, so I’d think we should actually evaluate content for what it is rather than stop at “nah it’s ai written.”
Unfortunately, there's a lot of people trying to content-farm with LLMs; this means that whatever style they default to, is automatically suspect of being a slice of "dead internet" rather than some new human discovery.
I won't rule out the possibility that even LLMs, let alone other AI, can help with new discoveries, but they are definitely better at writing persuasively than they are at being inventive, which means I am forced to use "looks like LLM" as proxy for both "content farm" and "propaganda which may work on me", even though some percentage of this output won't even be LLM and some percentage of what is may even be both useful and novel.
I don't judge content for being AI written, I judge it for the content itself (just like with code).
However I do find the standard out-of-the-box style very grating. Call it faux-chummy linkedin corporate workslop style.
Why don't people give the llm a steer on style? Either based on your personal style or at least on a writer whose style you admire. That should be easier.
Because they think this is good writing. You can’t correct what you don’t have taste for. Most software engineers think that reading books means reading NYT non-fiction bestsellers.
While I agree with:
> Because they think this is good writing. You can’t correct what you don’t have taste for.
I have to disagree about:
> Most software engineers think that reading books means reading NYT non-fiction bestsellers.
There's a lot of scifi and fantasy in nerd circles, too. Douglas Adams, Terry Pratchett, Vernor Vinge, Charlie Stross, Iain M Banks, Arthur C Clarke, and so on.
But simply enjoying good writing is not enough to fully get what makes writing good. Even writing is not itself enough to get such a taste: thinking of Arthur C Clarke, I've just finished 3001, and at the end Clarke gives thanks to his editors, noting his own experience as an editor meant he held a higher regard for editors than many writers seemed to. Stross has, likewise, blogged about how writing a manuscript is only the first half of writing a book, because then you need to edit the thing.
My flow is to craft the content of the article in LLM speak, and then add to context a few of my human-written blog posts, and ask it to match my writing style. Made it to #1 on HN without a single callout for “LLM speak”!
Even though I use LLMs for code, I just can't read LLM written text, I kind of hate the style, it reminds me too much of LinkedIn.
Very high chance someone that’s using Claude to write code is also using Claude to write a post from some notes. That goes beyond rewriting and cleaning up.
I use Claude Code quite a bit (one of my former interns noted that I crossed 1.8 Million lines of code submitted last year, which is... um... concerning), but I still steadfastly refuse to use AI to generate written content. There are multiple purposes for writing documents, but the most critical is the forming of coherent, comprehensible thinking. The act of putting it on paper is what crystallizes the thinking.
However, I use Claude for a few things:
1. Research buddy, having conversations about technical approaches, surveying the research landscape.
2. Document clarity and consistency evaluator. I don't take edits, but I do take notes.
3. Spelling/grammar checker. It's better at this than regular spellcheck, due to its handling of words introduced in a document (e.g., proper names) and its understanding of various writing styles (e.g., comma inside or outside of quotes, one space or two after a period?)
Every time I get into a one hour meeting to see a messy, unclear, almost certainly heavily AI generated document being presented to 12 people, I spend at least thirty seconds reminding the team that 2-3 hours saved using AI to write has cost 11+ person-hours of time having others read and discuss unclear thoughts.
I will note that some folks actually put in the time to guide AI sufficiently to write meaningfully instructive documents. The part that people miss is that the clarity of thinking, not the word count, is what is required.
ai;dr
If your "content" smells like AI, I'm going to use _my_ AI to condense the content for me. I'm not wasting my time on overly verbose AI "cleaned" content.
Write like a human, have a blog with an RSS feed and I'll most likely subscribe to it.
Well, real humans may read it though. Personally I much prefer real humans write real articles than all this AI generated spam-slop. On youtube this is especially annoying - they mix in real videos with fake ones. I see this when I watch animal videos - some animal behaviour is taken from older videos, then AI fake is added. My own policy is that I do not watch anything ever again from people who lie to the audience that way so I had to begin to censor away such lying channels. I'd apply the same rationale to blog authors (but I am not 100% certain it is actually AI generated; I just mention this as a safety guard).
The main issue with evaluating content for what it is is how extremely asymmetric that process has become.
Slop looks reasonable on the surface, and requires orders of magnitude more effort to evaluate than to produce. It’s produced once, but the process has to be repeated for every single reader.
Disregarding content that smells like AI becomes an extremely tempting early filtering mechanism to separate signal from noise - the reader’s time is valuable.
> I don’t think it’s that big a red flag anymore.
It is to me, because it indicates the author didn't care about the topic. The only thing they cared about is to write an "insightful" article about using llms. Hence this whole thing is basically linked-in resume improvement slop.
Not worth interacting with, imo
Also, it's not insightful whatsoever. It's basically a retelling of other articles around the time Claude code was released to the public (March-August 2025)
If you want to write something with AI, send me your prompt. I'd rather read what you intend for it to produce rather than what it produces. If I start to believe you regularly send me AI written text, I will stop reading it. Even at work. You'll have to call me to explain what you intended to write.
And if my prompt is a 10 page wall of text that I would otherwise take the time to have the AI organize, deduplicate, summarize, and sharpen with an index, executive summary, descriptive headers, and logical sections, are you going to actually read all of that, or just whine "TL;DR"?
It's much more efficient and intentional for the writer to put the time into doing the condensing and organizing once, and review and proofread it to make sure it's what they mean, than to just lazily spam every human they want to read it with the raw prompt, so every recipient has to pay for their own AI to perform that task like a slot machine, producing random results not reviewed and approved by the author as their intended message.
Is that really how you want Hacker News discussions and your work email to be, walls of unorganized unfiltered text prompts nobody including yourself wants to take the time to read? Then step aside, hold my beer!
Or do you prefer I should call you on the phone and ramble on for hours in an unedited meandering stream of thought about what I intended to write?
Yeah but it's not. This a complete contrivance and you're just making shit up. The prompt is much shorter than the output and you are concealing that fact. Why?
Github repo or it didn't happen. Let's go.
I think as humans it's very hard to abstract content from its form. So when the form is always the same boring, generic AI slop, it's really not helping the content.
And maybe writing an article or a keynote slides is one of the few places we can still exerce some human creativity, especially when the core skills (programming) is almost completely in the hands of LLMs already
>the tells are in pretty much every paragraph.
It's not just misleading — it's lazy. And honestly? That doesn't vibe with me.
[/s obviously]
So is GP.
This is clearly a standard AI exposition:
LLM's are like unreliable interns with boundless energy. They make silly mistakes, wander into annoying structural traps, and have to be unwound if left to their own devices. It's like the genie that almost pathologically misinterprets your wishes.
Then ask your own ai to rewrite it so it doesn't trigger you into posting uninteresting thought stopping comments proclaiming why you didn't read the article, that don't contribute to the discussion.
Here's mine! https://github.com/pjlsergeant/moarcode
Agreed. The process described is much more elaborate than what I do but quite similar. I start to discuss in great details what I want to do, sometimes asking the same question to different LLMs. Then a todo list, then manual review of the code, esp. each function signature, checking if the instructions have been followed and if there are no obvious refactoring opportunities (there almost always are).
The LLM does most of the coding, yet I wouldn't call it "vibe coding" at all.
"Tele coding" would be more appropriate.
I use AWS Kiro, and its spec driven developement is exactly this, I find it really works well as it makes me slow down and think about what I want it to do.
Requirements, design, task list, coding.
> LLM's are like unreliable interns with boundless energy.
This was a popular analogy years ago, but is out of date in 2026.
Specs and a plan are still good basis, they are of equal or more importance than the ephemeral code implementation.
I’ve also found that a bigger focus on expanding my agents.md as the project rolls on has led to less headaches overall and more consistency (non-surprisingly). It’s the same as asking juniors to reflect on the work they’ve completed and to document important things that can help them in the future. Software Manger is a good way to put this.
AGENTS.md should mostly point to real documentation and design files that humans will also read and keep up to date. It's rare that something about a project is only of interest to AI agents.
I really like your analogy of LLMs as 'unreliable interns'. The shift from being a 'coder' to a 'software manager' who enforces documentation and grounding is the only way to scale these tools. Without an architecture.md or similar grounding, the context drift eventually makes the AI-generated code a liability rather than an asset. It's about moving the complexity from the syntax to the specification.
It feels like retracing the history of software project management. The post is quite waterfall-like. Writing a lot of docs and specs upfront then implementing. Another approach is to just YOLO (on a new branch) make it write up the lessons afterwards, then start a new more informed try and throw away the first. Or any other combo.
For me what works well is to ask it to write some code upfront to verify its assumptions against actual reality, not just be telling it to review the sources "in detail". It gains much more from real output from the code and clears up wrong assumptions. Do some smaller jobs, write up md files, then plan the big thing, then execute.
'The post is quite waterfall-like. Writing a lot of docs and specs upfront then implementing' - It's only waterfall if the specs cover the entire system or app. If it's broken up into sub-systems or vertical slices, then it's much more Agile or Lean.
This is exactly what I do. I assume most people avoid this approach due to cost.
Please explain what do you mean by “cost”?
It makes an endless stream of assumptions. Some of them brilliant and even instructive to a degree, but most of them are unfounded and inappropriate in my experience.
Oh no, maybe the V-Model was right all the time? And right sizing increments with control stops after them. No wonder these matrix multiplications start to behave like humans, that is what we wanted them to do.
So basically you’re saying LLMs are helping us be better humans?
Better humans? How and where?
I've been doing the exact same thing for 2 months now. I wish I had gotten off my ass and written a blog post about it. I can't blame the author for gathering all the well deserved clout they are getting for it now.
Don’t worry. This advice has been going around for much more than 2 months, including links posted here as well as official advice from the major companies (OpenAI and Anthropic) themselves. The tools literally have had plan mode as a first class feature.
So you probably wouldn’t have any clout anyways, like all of the other blog posts.
I went through the blog. I started using Claude Code about 2 weeks ago and my approach is practically the same. It just felt logical. I think there are a bunch of us who have landed on this approach and most are just quietly seeing the benefits.
> LLM's are like unreliable interns with boundless energy
This isn’t directed specifically at you but the general community of SWEs: we need to stop anthropomorphizing a tool. Code agents are not human capable and scaling pattern matching will never hit that goal. That’s all hype and this is coming from someone who runs the range of daily CC usage. I’m using CC to its fullest capability while also being a good shepherd for my prod codebases.
Pretending code agents are human capable is fueling this koolaide drinking hype craze.
It’s pretty clear they effectively take on the roles of various software related personas. Designer, coder, architect, auditor, etc…
Pretending otherwise is counter-productive. This ship has already sailed, it is fairly clear the best way to make use of them is to pass input messages to them as if they are an agent of a person in the role.
> The author seems to think they've hit upon something revolutionary...
> They've actually hit upon something that several of us have evolved to naturally.
I agree, it looks like the author is talking about spec-driven development with extra time-consuming steps.
Copilot's plan mode also supports iterations out of the box, and draft a plan only after manually reviewing and editing it. I don't know what the blogger was proposing that ventured outside of plan mode's happy path.
If you have a big rules file you’re in the right direction but still not there. Just as with humans, the key is that your architecture should make it very difficult to break the rules by accident and still be able to compile/run with correct exit status.
My architecture is so beautifully strong that even LLMs and human juniors can’t box their way out of it.
It's alchemy all over again.
Alchemy involved a lot of do-it-yourself though. With AI it is like someone else does all the work (well, almost all the work).
It was mainly a jab at the protoscientific nature of it.
Reproducing experimental results across models and vendors is trivial and cheap nowadays.
Not if anthropic goes further in obfuscating the output of claude code.
Why would you test implementation details? Test what's delivered, not how it's delivered. The thinking portion, synthetized or not, is merely implementation.
The resulting artefact, that's what is worth testing.
> Why would you test implementation details
Because this has never been sufficient. From things like various hard to test cases to things like readability and long term maintenance. Reading and understanding the code is more efficient and necessary for any code worth keeping around.
if only there was another simpler way to use your knowledge to write code...
It's nice to have it written down in a concise form. I shared it with my team as some engineers have been struggling with AI, and I think this (just trying to one-shot without planning) could be why.
We're just slowly reinventing agile for telling Ai agents what to do lol
Just skip to the Ai stand-ups
Wow, I never bother with using phrases like “deeply study this codebase deeply.” I consistently get pretty fantastic results.
I have a different approach where I have claude write coding prompts for stages then I give the prompt to another agent. I wonder if I should write it up as a blog post
Another pattern is:
1. First vibecode software to figure out what you want
2. Then throw it out and engineer it
It’s worrying to me that nobody really knows how LLMs work. We create prompts with or without certain words and hope it works. That’s my perspective anyway
It's actually no different from how real software is made. Requirements come from the business side, and through an odd game of telephone get down to developers.
The team that has developers closest to the customer usually makes the better product...or has the better product/market fit.
Then it's iteration.
It's the same as dealing with a human. You convey a spec for a problem and the language you use matters. You can convey the problem in (from your perspective) a clear way and you will get mixed results nonetheless. You will have to continue to refine the solution with them.
Genuinely: no one really knows how humans work either.
add another agent review, I ask Claude to send plan for review to Codex and fix critical and high issues, with complexity gating (no overcomplicated logic), run in a loop, then send to Gemini reviewer, then maybe final pass with Claude, once all C+H pass the sequence is done
Kiro's spec-based development looks identical.
https://kiro.dev/docs/specs/
It looks verbose but it defines the requirements based on your input, and when you approve it then it defines a design, and (again) when you approve it then it defines an implementation plan (a series of tasks.)
This got upvotes? Literally just restating basics.
I don't see how this is 'radically different' given that Claude Code literally has a planning mode.
This is my workflow as well, with the big caveat that 80% of 'work' doesn't require substantive planning, we're making relatively straight forward changes.
Edit: there is nothing fundamentally different about 'annotating offline' in an MD vs in the CLI and iterating until the plan is clear. It's a UI choice.
Spec Driven Coding with AI is very well established, so working from a plan, or spec (they can be somewhat different) is not novel.
This is conventional CC use.
last i checked, you can't annotate inline with planning mode. you have to type a lot to explain precisely what needs to change, and then it re-presents you with a plan (which may or may not have changed something else).
i like the idea of having an actual document because you could actually compare the before and after versions if you wanted to confirm things changed as intended when you gave feedback
'Giving precise feedback on a plan' is literally annotating the plan.
It comes back to you with an update for verification.
You ask it to 'write the plan' as matter of good practice.
What the author is describing is conventional usage of claude code.
A plan is just a file you can edit and then tell CC to check your annotations
One thing for me has been the ability to iterate over plans - with a better visual of them as well as ability to annotate feedback about the plan.
https://github.com/backnotprop/plannotator Plannotator does this really effectively and natively through hooks
Wow, I've been needing this! The one issue I’ve had with terminals is reviewing plans, and desiring the ability to provide feedback on specific plan sections in a more organized way.
Really nice ui based on the demo.