I've hit this point with AI where it's not a simple process, but a long drawn out back and forth.
I'll use AI to design the implementation of a medium sized, cross cutting feature. Review all the details, maybe iterate on just that. Then implement with Claude 4.7 Max - which runs slower, but does a better job. Then review the implementation, then have Codex GPT 5.5 xhigh fast review it - which almost always finds corner cases. Have Claude fix those - Claude is better at writing intuitive maintainable code versus Codex overengineered/shortcut filled code. (Codex is better at finding/fixing bugs and doing reviews - it's annoyingly pedantic)
Then repeat with fresh Claude/Codex instances having them both review the current staged changes and getting feedback, handling the feedback. Then covering it in tests. I mean overall I still implement the feature faster than coding it manually, but I spend a majority of the time going back and forth with reviews, handling corner cases and at the finish end up with what I feel a really solid implementation of whatever feature I'm working on. The v1 feature feels more like a v3 given the amount of iteration it already went through.
Talking the problem to death with the AI before implementation is a nice zone for me. I feel productive, get good results out of the AI, and still largely understand the code. That’s the part of the AI revolution that I feel has made me a better engineer because I argue about design and architecture all day with a robot.
I follow the same process. I have a design in mind for the problem at hand, but I don't reveal it to Codex. I go back and forth a bit to see if its proposals are better than mine. I go back and forth on tradeoffs of various approaches. And then I ask it to compare its proposals with mine. I "win" most of the time but there are many times where it shows a me a better, or simpler approach, or makes me rethink the solution altogether.
Once this is done, the mechanical coding parts are mostly routine (for codex)
I really like this pattern and use it often, this 'not showing my cards'. The second I hint towards the LLM what I prefer it will become sycophantic and invent nonsense why my preferred solution is better.
I'm sure there's an interesting study on how users 'leak' their preference unintentionally to the LLM; perhaps when users list their options, they often put their prefered option first; but not showing the cards on my hand has been very useful when thinking through a problem with LLMs.
I have noticed this as well, but I think it's somewhat a good thing. I know what I want for my application more than Claude does for example, especially when it comes to what's in production.
An example from earlier, Claude strongly suggested a migration that would run a full vacuum on postgres. However, in production this would lock tables which would grind the application to a halt. After I informed Claude that there were millions of rows in production, it accepted that and helped me get to the right thing.
Another example, I'm developing a TOTP authentication app because I'm dissatisfied with all those that I've tried. I want something strictly local, and with a very easy use case when you have dozens or even a hundred or more accounts on there, that is also efficient when left open for long periods of time. Claude strongly suggested that we force users to encrypt their vault with a passphrase all the time. However this makes the CLI extremely painful to use if you are using a strong passphrase. I told Claude about the user experience impacts and that I wanted to allow users to optionally use a vault with no passphrase encryption, and it accepted that and suggested as a medium that we have a checkbox for the user to explicitly acknowledge that they're creating an unencrypted vault on disc. This is the right thing IMHO.
Interesting thing about psychponancy is it’s asymmetric. If an LLM is used to train an LLM it may not have the same level of aggressiveness that humans do when punishing back on trainee. Human pushback has specific patterns which we might be able to compensate due to asymmetry.
I don't think that "fixes" the problem, but it does seem to help. I also have found adding "please feel free to ask questions" seems to help it stop from making an assumption and spinning merrily onward for tens of thousands of tokens based on a bad idea rather than asking you something. I theorize this is because the training and refinement data overprioritize one-shot solutions, both because that's easier to evaluate at training time and improves their benchmarks. But I emphasize the italicized words because that's all gut feel and I can't prove any of it.
Tangentially related but I’ve been using Claude to practice interviewing on system design problems, and it’s actually pretty great. But even when it likes my answers it always finds something, however small, to push on. Once it actually was completely wrong and admitted it after I had it realize. So maybe you have to prime it to be contrary and not agree with everything you say, putting it in the role of a tough interviewer seems to do this implicitly.
Take a look at hellointerview.com their model is very stubborn, similar to some interviewers who refuse to acknowledge even valid solutions that differ from the canon.
Same. Alternatively (or in addition), I sometimes present my preferred idea as being a "bad/naive/stupid option" (or a suggestion from someone who can't be trusted) to see how it stands up to sycophancy to it being bad. As expected the LLM will usually say "yeah it's bad!" and give plausible-sounding reasons for it, but if these reasons are nonsensical it's a good sign that I'm not missing anything
LLMs are very prone to priming in my experience. That is the human psychology name for what you are describing; whether it should be applied to LLMs I don't know, but it describes the phenomenon perfectly.
It's not limited to arguing with LLMs but if you want a honest opinion you should remember to push back even when it agrees with your hidden preference at first. Sometimes it is only being contrarian or supporting the underdog. Steelman the opposition.
I think this approach is more common than the hype for actual work. I do something similar, many back and forth, then settle on something often with now known tradeoffs, written by hand to spot issues as a final guard/ keep consistent naming etc.
Despite the cynical sibling reply, I also feel like there's real value here. Contrary to the meme, I don't think Claude just tells me I'm brilliant, but really does push back on directions that are unproductive, helps identify when a part is overcomplicated or a dependency has become redundant, etc. Those are important things to have at least a sightline on before getting too deep into the code, even (or maybe especially) in a world where an awful lot of code can be created basically for free.
I'm usually the one spotting redundancies and dead branches in Claude's code, not the other way around. But I think either way, what's important is questioning the process and understanding the way the code is working so that you retain a full mental model.
>> and still largely understand the code [...] ,that, I feel has made me a better engineer
the cynic in me would say that a good engineer should fully understand the code you write.
I'm not suggesting that AI is the problem here - you could vibe code with the AI have have it explain the reasoning and patterns - or else tell it to use 'simpler' patterns from the outset. For any one problem in software engineering, there are always multiple solutions; some slower, some faster, some more flexible etc. The code you produce should, imo, but at the level that you can understand it.
How can you reason about code you don't fully understand? How can you judge the future impact (technical debt and the cost of maintenance) of your projects?
A.I makes it easier to get yourself into problems early on.
> How can you reason about code you don't fully understand?
We all do, though. It takes months for a human to really get to know a project and, unless you’re working at a small startup, you’ll probably never know most of the code outside the corner you work in.
One strategy I use in the planning phase is even when I know how I'd implement the solution, I ask the Claude/Codex how they would solve the problem or implement the feature without giving them any clues - and then compare their solutions to my own. Often I am pleasantly surprised by alternative ways of doing things and ideas that we integrate into the final design.
Same. I've been creating "research" documents where I let it do a freeform survey of possible solutions/have sketch out it's own solution. I'll then sketch out a plan based on what I think is good or what I think it missed, and then I'll have it interrogate me for a final PRD document. It then implements the feature in reviewable chunks, and I'll give it feedback or tweak the PRD doc as needed.
Finally feel like I have a good workflow where I can fully benefit from these things without sacrificing my understanding of what they're doing.
Same here. Step 1 is usually a research doc where I simply describe the task and tell it to research the relevant parts of the codebase. This gets refined to a high-level plan, which gets distilled to a detailed step-by-step implementation plan.
When it comes to the actual implementation I prefer to work through it in small steps, where the AI explains to me exactly what it's about to do and why (and I approve) along the way. This enables me to catch it if it's about to do something I disagree with beforehand. And reduces the time I need to spend reviewing in the end.
How would you approach this problem if you are let's say token constrained due to per month limits set in your company?
What I've tried to do is make the bot write detailed spec documents, slowly building it over time as I explain the full problem.
It works for the most part but it's you have some non standard requirement, the agent seems to skip over that part of the spec document when it starts to code. Or it would have needless checks for situations that I said will never happen
In my book, the single most effective way to spend tokens is having it review code/specs you've written. One advantage to putting the ai in that position is that unreliable competence isn't much of a problem as you can ignore bad suggestions.
I would also recommend explaining the specs and doing a lot of your back and forth with a lower end model and set it to a higher end model only once the conversation history has all the context you feel the higher end model needs.
As the post says, after an agent implements the plan, have another agent review it. Make sure to mention it must ensure the plan is fully executed. It works wonders!
This is what I tell people (including non-programmers interested in vibe coding), the results you get are product of... process. Formal process.
From this naturally emerges the other thing I tell people: domain expertise (or at least, familiarity and or capacity for learning) is still determinate of outcome.
I don't touch the code. But I do push back on expedience, laziness, inconsistency, and all the other recurring unsolved problems of generated code... and continue to play whack-a-mole in pursuit of process that whacks the moles.
I also like doing this exact thing. I really don't like using any AI-powered IDEs but AI is still too useful, what I do is just open up a Claude or Gemini chat, explain the project, and start talking about implementations, feature additions, and how systems should be structured. Most of the time, as long as you dont let the AI be too biased towards your answers, it'll give actually good answers that help immensely for the project.
I think this is OK though. We can still micromanage[0] the code generation part for a useful productivity boost, I think.
[0] At least, in my experience, "micromanaging" the AI is what gives me the best results. Iterating on the initial design, then iterating on the plan, then reviewing the proposed code changes (including tests), then getting an independent code review from another LLM, etc. If you give an LLM too much latitude that's when the really shitty code and ill-considered breaking changes/obliteration of existing functionality starts to creep in.
I feel like there's an overly negative vibe to this response when it just seems like rubber duck debugging - I would assume the user isn't trying to argue like how you might have to argue specs, but is merely trying to clarify their own ideas and learn possible alternatives.
What you guys don't understand is that you don't argue with people or robots to teach them. You argue to teach yourself. Until you get out of that mindset, indeed a lot of conversation will seem useless, be it people or robots.
Oh. I am aware. It is not that deep. But who you argues with still matter. There was a point where I have abandoned Reddit and HN. I came back to HN because people here also seem to have grown up. Reddit stays mostly the same.
I credit the moderation here for that, I mean allowing people to grow out of the echo chamber.
It does to an extent. One thing I will give AI, because of the nature of LLMs, you are essentially arguing with the median level of the input that trained the model. So, for someone new to the subject, you get access to patterns that will bring them up to a certain level.
Its like that phase people go through where they argue with morons on reddit, and then one day grow up and realize that most of these people are unemployed/underemployed terminally online nobodies aren't ever going to learn anything, and even if they did it wouldn't impact the world since they were just some below average hobbyist anyway and aren't in charge of anything more important than a box of paperclips.
Mostly with you, though in recent years I have wondered whether those people are part of what caused the latest boom of political populism. If there is no one there to debate the problematic ideas, problematic ideas will become the rhetoric after all.
That might be true on general-population social media, but the opposite is the case in niche groups, and in particular, this very industry we're in - software - was largely built on terminally online hobbyists.
I think that many AIs nowadays have similar process incorporated in their thinking blocks, you can see there how it discuss implementation details with itself - so such discussion happen even in case human does not participate in the loop.
Yeah, me too. I argue with multiple models at the same time via a markdown doc to coordinate the discussion. I feel like it makes me less anxious about the final output if nothing else.
I think this is honestly the #1 best use case for AI in development. If you use it right it can be exactly the annoying junior who questions every decision you make that you need.
yes exactly. Too many people ask AI to one-shot complex tasks, and wonder it behaves like a junior asked to rush something.
I have my own skill: 5 rounds of research/planning/test-planning. Interactive with me in loop for all important decisions. Starts with high level shape, then details. Planning can take 2-3 days of my time, then the implementation agent can take many hours (Opus 4.7). It splits the implementation across many phases/commits, each with its own code-review fix loop. Deep code review at the end can take another hour or two. It opens a PR, Gemini reviews, it reads out and resolves those issues.
Projects still take days or weeks, but 5x faster than doing it all myself.
"yes exactly. Too many people ask AI to one-shot complex tasks, and wonder it behaves like a junior asked to rush something."
Because this version of AI is worth 10 trillion dollars.
While the pragmatic versions from realists you can find all over this thread are ultimately probably less of a speed boost than just having your CEO/local micromanager be conveniently on vacation during critical periods when the work actually gets done.
Even fully planned it’s still no better than a junior dev. You’re leaving out how much back and forth you have the ai do on itself, which you’d have on a junior dev too. In the end does it matter if it’s giving you what you want? Guess not really. But let’s not act like it’s crazy good when you’re still doing a lot of rounds of revisions on something an experienced dev would know to do right the first time.
My personal experience with trying to front-load tons of planning and speccing out with LLMs is that at best it's a small improvement on code quality but with considerably more time spent.
As a result I've abandoned the idea of having LLMs generate code except for very small, localized and tightly scoped things. They really can't produce much more than a function or a small module without shitting the bed (last time I vibecoded was with Opus 4.6, Composer 2 and GPT-5.4). I use it almost entirely as another signal in analysis, which naturally makes it fit in better because all the other signals (reading the code, stepping through the code, writing the code myself) are already there so when the LLM points things out the information it actually renders can be taken in much more easily (and seen through more easily when it's false or irrelevant).
I think it's neat that people find fun ways to develop, but I think dressing up vibecoding in a fancy dress and layering SpecLang, sometimes in multiple steps, on top of it, is an exercise in trying to use the tool more instead of trying to use it in its most useful capacity.
I expect you'll be told to try Opus 4.7, and in short, JuSt WaiT FoR ThE NexT MoDel, BRo.
This has been my experience every time I've suggested that there are any sort of inherent ontological/conceptual or computational limits to the sophistication of LLM mimicry.
When I use ai to code this is pretty close to my workflow too but I find it ends up taking at best just as long as if I were to write the code myself. If m some cases I’ve thrown away what the ai has done and just done it myself. I think that’s just a skill people need to learn - at a certain point you have to cut your losses. I’ve seen some coworkers argue back and forth with an llm trying to get it to do something. Especially true on simpler changes.
I've stumbled upon that too! Funnily I see it having two forms:
1. Some bad idea gets embedded into the context that you just can't argue away
2. Some important idea gets lost in compression and the ai wheres off into funland without recourse.
In both cases if is often better to start over or just do it yourself. I sometimes find myself asking for a summary, editing it and then using the edited one to seed a new session.
And then Anthropic has an outage and you what...have a coffee break until then?
All that time babysitting the AIs just to be a little faster but probably with less knowledge/control over what they did?
I don’t think you’re quite getting what OP is describing. I work in a similar way… I am aware of all the code being written. If Claude had an outage I could write it myself. It would just take longer.
You say “all that time” babysitting AIs but in my experience it isn’t that much time, if anything the back and forth at the planning stages is more productive than when I’m doing it by myself because I’m being asked questions and having to think things through from different angles.
Define 'aware'. The volume of code for a feature/system to make it worth using a more complex workflow such as this one, is definitely larger than what a human can even briefly review and build a mental model about the inner workings within a reasonable amount of time. Reasonable meaning not considerable delaying the process. When deadlines loom and management adds pressure, this 'awareness' is the first thing that goes out the window.
Maybe it’s just me, but I’ve never understood how one understands from reading code. Yes you can understand what that code does, but not why it was done that way instead of a different way. In the end I only understand it deeply if I end up writing it. Chatting through it is helpful to me, but having AI crank out code loses all of that context pretty quickly.
I’m not disagreeing. Just curious how you think about this, and if there are key parts of your process that help you stay contexted in.
If you can't understand why the code is done in a certain way from reading it then the code is missing comments or needs to be refactored.
Even code you write yourself, given enough time, you will forget the why unless you wrote comments. In a way comments are as much for you as they are for others.
Even before AI, understanding code you didn't write is essential to working on a team of other developers. If you can't understand the code from reading it, then that's part of the feedback loop - too complex, needs comments, etc..
On large teams you'll spend as much time reading code as you do writing it. And long term when it comes to writing maintainable code - the ability for others to read and understand it, including the why of it, is paramount. Your code could literally be around for decades.
> If you can't understand why the code is done in a certain way from reading it then the code is missing comments or needs to be refactored.
Code is never missing contexts. If what your code is doing is not obvious to the reader, it is bad code that needs to be fixed. Things like cryptic low-level expressions should be extracted to helper functions with descriptive names or even extracted into a class, and classes need to comply with the single responsibility principle.
Ah the classic thinking that 'code documents itself'. It does not. Some devs are so full of themselves they think their code is so good that it is obvious what their intent was. It never is obvious, and just ends up as tech debt. Write comments.
yeah that's how a simple algorithm that would fit on a napkin gets broken up into a soup of ravioli that I have no hope to understand. I often end up refactoring it into a simple function in a branch so I can figure out wtf is going on.
> yeah that's how a simple algorithm that would fit on a napkin gets broken up into a soup of ravioli that I have no hope to understand.
No, not really. You get spaghetti code by being unable to refactor your code to follow inconsistent level of detail across calls. That's the textbook definition.
Once you start to follow basic code quality and software engineering principles, you'll notice right away that your code becomes both easier to understand and to test.
Yes exactly. I don't like Codex not writing comments - and even proactively removing useful comments! There was some change in the last month that causes Claude to write crazy long comments. I routinely have to ask Claude to 'tighten' up the comments before the final commit.
I think it's just like reading a book. Will you get more context & understanding if you write the book? You most probably will. But that doesn't mean that you don't get anything just by reading it.
And if you already know the material explained by the book, yes i don't need to write it to understand it.
People get into being amazing at code by being interested in what it does rather than what it is. It's a whole area that I can see but can't get to, where it's all about DRY and elegance and what's being done is relatively unimportant because it's web stuff or whatever, just widgets and sadness.
As a result there's a whole universe of code where the how of it, the elegance, is the main thing, and what it's doing is putting characters on the screen a bit slower than the next thing but there are some amazing concepts that are supposed to make it all an axiomatic synthesis of how to think about code forever, replacing all precious concepts of thinking about code.
Now AI can think about code forever while doing nothing.
If you only have one AI window open, you’re doing it wrong. You task swap to another window/agent, get it working on something, rinse and repeat. I can keep 4 busy most of the time. When I task swap I also check in on what the other agents are doing to make sure they’re on track, not blocked and not struggling.
If you have ever TL'd a team, it doesn't sound too crazy. I have 8 folks I generally talk to very consistently throughout the day. If I'm not in 1:1s with them I'm usually reviewing their changes or chatting with them over chat. I don't think I can do all of that and work with a bunch of AI windows, but I do think they could likely do something similar to me with several agents running in parallel.
I suppose it depends how hands-off the tasks are - I max out at 2 parallel sessions working on different parts and it's fairly exhausting once done. I can see the number of parallel work increasing if there's a good dev/test loop. But at $WORK, that's not usually an option.
So, hands-off meaning "just let the AI cook and don't check it"?
Either you follow everything it does, revise the plans, do the code review, manual adjustments, etc, or you run sessions in parallel, not being that attentive and constantly context-switch (also resulting in less attention I guess).
Nap while you can. The baseline is slowly raising; AI fed with organization context will hunt you down and lay you off, as it has done at multiple companies this spring already.
> congratulations on your soon to be coming burnout.
Multitasking does not mean burnout. It just means you are not wasting time while idling. Multitasking was not invented for AI coding assistants. What do you think feature branches are used for?
The constant context changes, mental overload, inability to focus on one thing and do it well is exactly what every software developer has been fighting against for the past thirty years because it leads to shit quality and burns you out. You're automating the burnout. Idling is a necessity, not an illness.
Your feature branch is to put things aside and send them to CI, or wait and think on them. Not to have four of them running in parallel in your head frying you.
After you put together a plan, today's models can take well over a minute to execute it. Also, your work shifts to code review and executing acceptance tests, followed by either tweaking your current change or moving on to the next change.
This is really not about context changes. This is about not having to switch contexts because your focus stays on architecture+review instead of having to do deep dives to type code around.
> Your feature branch is to put things aside and send them to CI, or wait and think on them.
No, not really. Feature branches, as well as most types of branches, is to set aside work fronts that are in progress and run in parallel.
>today's models can take well over a minute to execute it.
A full, whole, entire _minute_ ?! Sixty seconds ! Oh no, they must be optimized away, we do not deserve our free time like so, we should toil until we fall over because... Growth?
It's still context switching. Either what you're doing is surface enough that you don't give a shit, it doesn't matter and you don't review it anyways (so the only context is basically the prompt you wrote or the nth SELECT * FROM table CRUD piece of crap), or you're context switching and it's fucking you over. The context isn't about remembering how you write if err != nil, it's the expected behaviour of what you're working on.
You're not getting a promotion from doing this, you're getting burnout.
> Feature branches, as well as most types of branches, is to set aside work fronts that are in progress and run in parallel
They're not running in parallel, unless you use work trees. They were put to the side, because you can't continue or finish the work they're about. Even just three branches in parallel in a modestly active repo that happen to be long lived drift enough that just keeping them up to date with develop makes it a waste of time.
The scientific study of multitasking over the past few decades has revealed important principles about the operations, and processing limitations, of our minds and brains. One critical finding to emerge is that we inflate our perceived ability to multitask: there is little correlation with our actual ability. In fact, multitasking is almost always a misnomer, as the human mind and brain lack the architecture to perform two or more tasks simultaneously. By architecture, we mean the cognitive and neural building blocks and systems that give rise to mental functioning. We have a hard time multitasking because of the ways that our building blocks of attention and executive control inherently work. To this end, when we attempt to multitask, we are usually switching between one task and another. The human brain has evolved to single task.
Fair enough, so it's a misnomer. Let's call it task switching then, since we don't actually do tasks at the same time, but switch from one to the other. A Claude Code session helpfully prints a small tldr summary of the ongoing session, so that one can quickly onboard again to the task at hand. I do not find that draining, personally.
If you honestly had any concern about loosing focus and being forced to context switch, a 1 minute pause idling while waiting for something to happen would represent the root cause of your context switch problems.
As the AI is working, I am working - reviewing, regression testing, thinking about if the currently implementation is too complex and how to simplify it etc.. I totally review and understand everything the AI is generating and often push back, have it re-do something, or do it myself. In the end I feel like the quality of the work is at a v3 level in the time it took to do a v1. The productivity and quality increase is real.
Yes get a coffee. Being able to execute 5 things at once is amazing, but it's a recipe for burnout. We have to be more careful and explicit about how we spend our time, and that means more explicit time away. If this thing makes you 10x more effective (I truly believe it can), you can afford to spend 20% less time behind the desk and more time doing whatever it is that actually makes you happy. Hopefully your manager understands that calculus.
It’s a fragile equilibrium and it depends on the kind of project you’re working on. If the knowledge debt is ok then yes, it’s just like a delivery job, if the truck has an engine problem I won’t continue to deliver the packages by walking or finding and setting up an other truck from where the vehicle breakdown happens. I’ll just wait because the wait is still faster than the other solution because of the knowledge debt it’s too long to pickup by hand and continue.
Now if it’s my job then I can’t have a knowledge debt and if Claude is down I’ll continue working manually because I know and understand and can continue without having to understand a lot of logic before continuing
Whenever Anthropic is down, I switch to my other alternative AI provider. If that is also unavailable, or no more tokens left, then I can switch to my local AI. Not the same in terms of quality and speed, but good enough for an experienced engineer to still be more productive than falling back to doing it by hand. For my principal activity I do not want to be dependent on a sole provider. Besides that, I expect that the pending token price increases are going to hurt a lot of people/companies.
We're already having coffee breaks when AWS and CloudFlare are down. What's another break in the mix? If anything, we might be lucky that they're down at the same time, so we can consolidate the breaks.
"All that time babysitting the AIs just to be a little faster" doesn't seem like an accurate/unbiased portrayal of what they said: "The v1 feature feels more like a v3 given the amount of iteration it already went through."
Similar approach, but I also go a step further with some basic manual architecture/high level contract/stubs setups, just to keep it consistent with other systems (and easier reading as well).
I've been doing the same thing lately and I definitely feel like stubbing out the high level architecture at the beginning makes a difference. The codebase I'm in now has very particular ways of doing things and claude doesn't always pick that up.
Style can be as important as substance.
I still do a lot of back and forth about the plan - have it written to a file. Read through the file, make changes by hand and have claude read my changes and on and on. But starting with the basic architecture there's less ambiguity.
$200/month split between Claude Code Max and Codex Pro. Given how many hours a month I spend programming, my hourly rate, the amount of time saved, and the productivity/quality boost - I would pay a whole lot more if I had to.
You are definitely going to have to. I see these massive skills as soon-to-be artefacts of the past, they will be unwieldy in the non-subsidised world. I won't pretend to know what replaces them.
We have lots of open-weight models like DeepSeek V4 Pro that are very close to SOTA and we know the cost of running them.
This helps keeps the other players honests: there's a limit to which they can raise prices when there are already alternatives today and when there's zero lock in.
That those companies can make revenues but only at the cost of burning investors money: that's not my problem.
My take on it is simple: "Give me something MUCH better than the best open-weight models at a price that's not crazy or you're not getting my money".
And it happens to be the take of many devs.
I'm still paying Anthropic, Google and OpenAI (OpenAI because I didn't manage to cancel my subscription and now their model is competitive vs Anthropic's models again) but eye'ing a "Pi + open weights" solution.
Raise the prices too much and those companies selling access to private models aren't getting my money anymore.
Check out jwillmer/ai-status at GitHub @bottlepalm. It helps keep track of all the small fixes that are going on simultaneously. I crated the tool for me since I have similar workflows.
I think you need a skill to review those code by agent itself, but in a different role, not the one who wrote them. I did some research on this and developed a skill to get things done. By now it works well though I decide to prove and improve it with more tests. Dog food is not always delicious but not too bad either.
The problem is that I manually review the code before/after the review, as well as review the items to review themselves. You could easily put AI into a review infinite loop if you let it, and you also risk the code base going off the rails if you let AI go wild.
It's actually happened a few times where I need to back out entire features because AI went too far and I lost control/understanding of what the code is doing. Many people will give up at that point and let AI do everything - that is a mistake, at least right now and how you end up with unmaintainable vibe spaghetti slop.
I follow a similar approach and use multiple LLMs per task. The quality improvement is surprisingly large.
Lately I’ve been experimenting with adding an explicit reward function so the models optimize for measurable output quality.
This creates a generate, critique, revise loop where candidate answers compete for a higher score. It feels promising because it reduces the amount of handholding for every task. It is also more fun because part of the review process is embedded in the scoring function, which simplifies the review effort.
I have a very similar workflow, and experience similar temperaments from the agents. I also find anecdotally that they are moderately competitive - you get very different attention from them when you say "competitor X wrote this - please find all bugs" than when you say "you just wrote this - please find all bugs".
Hah yea I just told them I wrote it, or I reviewed it. I don't want to get the AI's in a pissing contest with each other because they will get distracted and try to show off.
maybe it's dumb question, but how do you feed the results of one agent to another? do you copy and paste manually? or how do you do it programmatically?
When I pair Claude and Codex, I use claude-co-commands [0] to drive from Claude and talk to Codex via MCP. Lately I've found Codex has been far more consistent for my specific projects, so I've just been almost entirely inside Codex. YMMV
Yea I'll take the review feedback from one, validate it, and then copy/paste it into the other session saying like, "hey I got this feedback, what do you think?" So I'm not even telling the other AI the feedback is valid, I want it to independently validate it. Often the feedback is not like a bug, but a red flag, design consideration, or trade off.
Often depending on how complex the feedback, I'll do it one at a time addressing each one individually. And after the feedback is addressed, I'll go back to the AI that generated the feedback and say like, "I handled 4/5 items you found, can you double check."
It's similar to handling PR feedback, where you do it, validate it, but then still have to submit it for peer review.
Oh course it's not black and white, there are many shades of grey in how detailed the design of a feature can be. Often even if I know low level details, I'll only give the AI high level requirements because I want to see how it would do it. Often it comes up alternative/better ways of doing what I planned and I incorporate those ideas into the final design.
My sample size is pretty small but when I've witnessed people (both PMs and engineers) "design through AI" I have seen two flavors:
- aimless AI wandering, leading to pretty, frankly, useless design docs
- using AI to "expand" upon a bullet pointed/shorthanded design doc. To which I feel like saying "the bullet points are already a good design doc!"
I understand that teams sometimes have specific formats that they have to make deliverables for, but having a nice 5 point bulletpoint list turn into 5 paragraphs... all for me to turn the 5 paragraphs back into 5 bullet points in my notes is depressing.
I do think you can get a lot of value in the mechanics, I just have had so much success leaving the thinking to me and the rote stuff to the AI. I'm going to have to think about the design eventually anyways right?
I've noticed the following really helps (most important at end):
1. Have claude form the plan and converse with a simple "Note any concerns with this plan" type plan-critic agent.
2. Let it run.
3. After (with everything in context) have it make a future_recommendations.md.
4. Have it make a plan.md to implement those future recommendations, conversing with the plan critic..
5. Clear context. Repeat with 1. Do this loop a few times, with some feedback from actual review thrown in.
But, most importantly, because Claude will aggressively try to maintain code "as is", and happily build on it's previous crap, while preferring to hand roll implementations of everything, add something like this to memories/directives:
* When evaluating designs, default to "pull in the library" over "hand-roll it." Hand-rolling is much worse than a dependency.
* "Precedent" / "matches house style" / "reuses existing pattern" / "consistent with what we already do" are not valid engineering arguments.
* This project is still in the development stage with no real deployments. Mitigation costs and existing precedence are not a concern.
With these, in the last week that I've started using them (after inspecting the insane justifications for leaving crap design decisions in the plans), Claude went from junior level slop that required more oversight than it was worth to something very reasonable, using standard libraries, requiring nudges for architecture rather than pure "wtf!?".
I think they've fine tuned heavily towards "don't rewrite the codebase" tuning, which completely rational from multiple perspectives, but also not appropriate for new code.
I do enjoy a considerable daily token allowance, so this may not apply to everyone.
Have you tried telling claude to review with subagent? It too almost always finds corner cases (usually nothing serious, but most stuff is things that good coder would have thought of)
This exactly my process as well. Although interestingly I swap Codex and Claude; having found Claude way more pedantic in its reviews and codex more pragmatic in its implementation. Maybe it differs per programming language.
Unfortunately the projects are still too big. Projects with hundreds of thousands to millions of lines of code can't be maintained by a single person reviewing all the the changes. And AI only increases the speed of iteration and the amount of code to review.
We may need some sort of paradigm shift - like more powerful frameworks or even higher level languages that allow us to review less, but more functional code blocks.
> I've hit this point with AI where it's not a simple process, but a long drawn out back and forth.
In my experience, even on a relatively trivial task, you can ask an LLM at least 20 times:
Is this actually done, or only partially implemented? Did you finish x, y, z?
And the LLM will say, no, I'm not done and keep working.
After that, I'll feed the branch to a different LLM, and ask if the implementation matched the design, where it's weak and needs improvements.
Same thing - that feedback will usually only be partially finished for several rounds.
When they all agree it's done - I'll finally look at the code, and there's still typically glaringly obvious problems - duplicate systems that reinvent the wheel, etc - that will take typically more than one prompt to get right...
Getting things right takes almost ~100x as long as getting things almost right with LLMs.
You can tell an LLM to "make me Rust, but easier. Make no mistakes," and it'll plan out a 100 commit process and get something that - somehow - sort of works... but isn't even close to complete.
Still, on a cost basis, you're still able to get features that would take yourself several times longer and cost orders of magnitude more money, and - if you're doing it right - they'll probably do a better job than you would've done (at least for me).
This is where the human element is critical, but cause it'll infinite loop review feedback if you let it and the code will easily go off the rails into an over engineered mess. That's why I review the code before/after as well as review the actual feedback itself - and often give the feedback to different AI to get its opinion as the other AI doesn't have a vested interest in it and can be more critical. At some point though you do have to cut them off and ship.
You've essentially promoted yourself from coder to engineering manager, trading syntax fatigue for the mental marathon of refereeing specialized AI developers to ship v3-quality code on the first try.
Indeed. AI is bumping everyone up to manager level, and having dealt with long PR feedback cycles with humans for years - I don't mind the promotion. Also shipping a v3 is so much nicer than shipping a v1 and dealing the all the corner cases in production.
Before AI, myself and everyone else I knew was drowning in tech debt. And now with AI we are treading water.
It's bumping to manager level, except without the 1:1s, quarterly/yearly planning, headcount and budget reviews, org/reorg discussions, performance calibration, and OKR planning. No complaints about the last review cycle or about the upcoming one.
Totally!
But you know what? There are many, oh so many developers that are not ready, don't like and probably are not even cut for this kind of position.
Some see it as a promotion other (like me) as a demotion. I still prefer to do it myself, although I like code reviews done by AI, they do help to make code a bit better.
That sounds too much like three weeks of work saving you three hours of planning.
In my experience, software engineering is a matter of knowledge. Understanding it and then coming up with a solution. The latter is a flash of insight that comes mostly from experience. Then you gather more information to flesh it out, or brainstorm it with your colleagues.
What you're describing sounds more like a ritual of doing busy work than anything practical. Because tasks vary so much. A feature may be huge, but you take care of it in a day with copy pasting because you already have all the building blocks in other files. And something may be twenty lines of code, but you spent the whole week sweating on it (concurrency stuff maybe). Those ritualistic workflows sounds more like someone imagining software development than actually doing it.
A lot of people say you need to go through at least three versions of something before it is mature - and v3 is not something you can design upfront. You need to see v1 both in code, and at runtime. Use it, get the feedback, and iterate. This is where AI tightens that loop immensely.
Lost you in the last paragraph - features are not "copy pasting because you already have all the building blocks" and "something may be twenty lines of code". Mid sized features often mean tearing up many layers of code across the stack to add in some sort of new capability. Tearing up existing code means there are all sorts of add-on considerations in addition to feature you are working on.
This likely assumes you have a mature and well designed (architected) code base. That is not always the case, and as features get added and removed, that won't be the case at all until there is a refactor.
Nothing wrong at all. Some features you can bolt on, and some features fundamentally change how a system works requiring changes at many different levels of the stack. Happens all the time.
It happens in poorly factored codebases. If you find it happening that's a sign you need to refactor. If you find it happening repeatedly in the same codebase that means you failed to refactor properly the first time.
Not many industries can afford refactoring of the code is not supposed to be changed - additional (unexpected) regression testing costs, risk of downtime, etc. You learn that if it works and is in production - don't touch it.
Refactoring is the natural evolution of a growing application. Refactoring too soon, too fast is what we call over engineering. Too little refactoring and your code becomes spaghetti slop. Regardless - the application will change across all layers across its lifetime.
Even with clean architecture, you only have 4 fundamental layers. And once you have v1, you’re mostly doing tweaking and copy pasting. Any huge refactoring is the business switching its main strategy.
Take an OS like OpenBSD. It has three main layers. The syscall layer, the kernel layer, and the machine dependent code. But an OS is more spread horizontally with various subsystems (process and memory, io and other device, ipc,…)
If you’ve captured your problem’s domain and adopted a pragmatic architecture, you will rarely have to change across all layers. That’s costly and happens mostly due to business reasons.
Lets see, front end presentation, front end service, frontend api, backend to front end (BFF) api/routing, BFF logic, BFF api, backend routing, backend logic, backend database, worker routing, worker logic, worker storage.
And then the each of the service layers can be broken into layers themselves depending on the complexity of the business logic can be broken into layers as well. So yea a change in a worker can potentially bubble up through all the layers.
This all sounds insane. If it requires so much back and forth with the AI why on earth wouldn't you just write the code yourself? At least then you build the mental model of the code and keep your brain healthy. Reading the comments in here about all the hoops people are having to jump through just to do the same thing they were doing a year ago without AI... and spending a fortune to do it! I think you've all got AI psychosis.
I would never imagine this is where programming would be five years ago, but at the end of day having the AI write the code is easier, faster, and results in higher quality.
The mental model is still in my head, my brain is overloaded, but only from the amount of code reviews - like I said, I'm building v3 of a feature in the time it takes to build v1, but I am in a way doing 3x the code reviews going back and forth. That's the fall out of the iteration speed enabled by AI.
Between submitting PRs, getting feedback, iterating, re-submitting, repeat - there used to be breathing room. Now it's all compressed into an afternoon. Productivity is through the roof, but it can be draining.
Semantics. In reality yes it is the v3 version equivelent in terms of maturity and iteration. I know because I've been doing this for a long time. We are getting to v3 and beyond faster than ever before.
In the new world there is no time to put out v1 quality code and it is borderline reckless given how easily things are getting hacked now. You need to be putting out heavily reviewed code that covers all the corner cases on the first release.
No, you're getting to v1 in the same or more amount of time. I know v3 sounds better, but coding and throwing it away is literally just redoing it. If you're not releasing it, it's not a new version.
There's no such thing as "v1 quality code", you just haven't finished it yet.
> If it requires so much back and forth with the AI why on earth wouldn't you just write the code yourself?
Maybe I'm too far gone down the AI rabbit hole, but that seems a really strange take to have. If you replaced 'back and forth with the AI' with 'pair programming' or 'brainstorming' this phrase would be really strange, after all these are all techniques to sharpen your ideas.
Even 'rubber ducking' is widely accepted as an effective way to go through a problem, and you can definitely use AI as a rubber duck.
For me the idea of chatting with the AI about a problem/solution is just another tool to help us work. It's not the best solution because it has a lot of downsides you should be aware while using it, but that is true for any technique including 'writing the code yourself'.
You can be right but quite often it helps keeping focus on the forrest rather then getting lost in the trees - at least for me. Boilerplate steals a lot of attention, focus and can just be mentally exhausting.
Can someone explain these complaints about boilerplate to me? What are y’all doing where boilerplate is the majority of your code? Am I the only one mostly writing concise business logic where most lines are important in one way or another?
When I first read the comment I thought this must be satire, it sure does sound like a Silicon Valley episode, but in modern times. I've been a skeptic for quite some time, but managed to get quite good results with Claude in general, not even going through the normal limits for a Pro account, but what people are describing here seems like just tokenmaxing, brute forcing a solution, I don't understand what code people need to write and what projects people are building, is everyone just constantly rewriting systems from scratch, or what is everyone spending these insane amounts of tokens on?
I honestly don't get it, either. Most of them just flat out can't code at all, but for the ones who can, the only explanation I got is it feels like productivity.
I will say, it does help me get over procrastination lol. I get annoyed by the robot doing dumb shit and finish it myself.
I’ve found that it’s a lot like discovering a feature instead of designing it all up front. Like chiseling marble.
I’ve found it useful to write out a list of feedback / issues and have a bunch of sub agents work on them in worktrees with a loop bringing them all back together. That way it can work for a few hours while I just can review a bulk at a time.
I do multi-task a bit while AI is running, sometimes working on another feature with AI in parallel, but jumping between reviewing different feature iterations is draining, though not much different than the real world juggling PR reviews for a team of devs.
No, I find it stimulating. With AI I'm moving faster and producing code at a higher quality than ever before.
Don't get me wrong I used to enjoy writing code by hand, but I don't think I would anymore. I don't like writing code for the sake of writing code - I like building things, I like being productive.
The funny thing is that you've just described an idealised development process as would be used by effective, skilled humans in a heterogenous team where everyone has a speciality.
If only things were so! If only code was discussed, reviewed, iterated on! If only the "manager" actually read the code, provided actionable feedback, and disseminated PRs to multiple people with diverse skill sets.
(If you can't tell, I'm a jaded consultant desperately trying to make the horse drink the water.)
I've worked in large teams for many years and yes it's just like that, but without the time constraint. PR's can only go back and forth so many times. Depending on the reviewer they may phone it in, or focus on different things depending on the person. You yourself aren't able to implement every piece of feedback due to constraints and it ends up as tech debt.
So AI definitely changes the game. I feel like we almost need something higher level for reviewers to review changes faster. Todays code is starting to feel like assembler. Too much of it, too low level. We need even higher level constructs to be able to more in less time. I'm just not sure what that is.
The Claude/Codex loop is the current state of the art in my opinion. I've got a silly little harness that glues them together that I have spent all day, every day in for months: https://github.com/pjlsergeant/moarcode
I am not switching the different LLMs as much, but my approach is similar:
1. I write a list of things I want to have without AI support
2. I discuss the list with an LLM, which occasionally reveals obviously missing things I hadn't thought about or just things that would be smart to have. Or sometimes the LLM doesn't get it and wants to funnel me down a commonly walked path, which is a non-goal
3. From that list I draft an implementation plan containing things like how the code shall be structured, which language, libraries, build systems, etc to use. This may even contain some data models and considerations that are more detailed, like for example ideas about how a specific interaction shall be event sourced. I work on that, till I feel a satisfactory level of clarity has been reached
4. Actual writing of code as a back and forth between manual writing, letting an LLM write something and so on. LLMs suck at writing CSS that feels like good UX design to me, so usually templates, layout and CSS will be (re)written entirely by hand
5. Bug-hunting and guessing potential edge cases is one thing where LLMs really shine. Often if the work before that was quality the LLM has an okay time coming up with fixes that are no worse than what I would have done.
Heh it feels like that in a way, and the more complex the feature, the more endless the back and forth reviews can be - there seems to be always some feedback, so you need to decide when to be done with it and commit. You can easily get into review paralysis.
As a junior, i do actually enjoy going back and forth with the AI discussing different ways to implement something and exploring alternatives.
More often than not, I'd have an architectural idea that I'm not that confident in. The process of talking with the LLM takes a long time but it helps me sharpen the initial approach or even come up with a new one depending on the requirements.
In this vein, I have a system level memory for Claude to push back and give me direct feedback when possible. So far a success as it helps cut through the sycophancy.
This approach works well, and leads to better code being shipped. The key disconnect for me is not always the code being of high quality, but ensuring my understanding of the "Why" of the bug and the fix is good enough to justify it and then also explain it if the time comes.
That said, I'm learning to let go as much as I can and trust these things when it's "safe" and seeing how that shakes out. The risk is something falls over and I don't know how to fix it (of course) but I know it's a risk and I'm trying to avoid it so it probably won't be as bad as I catastrophize.
This article doesn't address writing code with AI, just code review. My issue with agentic coding is that I make numerous micro-architectural decisions while programming. I almost never have a full spec up front and develop one as I consider what I am writing.
When using Claude Code or Codex, that is all gone. Claude Code is extremely eager to reach the end goal to the point that it feels like a fever dream to write code with it. In the end, I have low confidence about edge cases and fit into the project's architectural and design goals.
On top of that, I enjoy programming, reverse engineering, etc. and I feel that the LLMs, while able to solve some problems or deliver some features, take that fun away. I'm trying really hard to find a workflow with them that I'm confident in, but I fear that workflow is just chat, search, and being a rubber duck for my thoughts.
> If the only impact of LLMs professionally was causing people to "think out loud" in a way which was routinely captured by computer systems and then could be operated on by computer systems, that would by itself be one of the most consequential changes in practice in 100 years
> This article doesn't address writing code with AI, just code review. My issue with agentic coding is that I make numerous micro-architectural decisions while programming. I almost never have a full spec up front and develop one as I consider what I am writing.
working with AI forced me to write better specs but the way I write today is very different. I typically open Codex and have Linear MCP connected where my chat with the AI will end up writing the issue. Its a lot of back-end-forth where I tell what I want, the AI does all the code scanning, write something, I correct something, etc
The value for me is exactly that I tell what I want, the AI verify in the actual code if that's the path that makes more sense or not. In the end I have a pretty detailed spec that I'm much more confident is the correct path.
I find the spec easier to review than a huge PR so typically when executing is much faster and aligned with what I want.
> On top of that, I enjoy programming, reverse engineering, etc. and I feel that the LLMs, while able to solve some problems or deliver some features, take that fun away.
Same, I prefer asking one or multiple very technical questions to Gemini, analyze, compare and understand the responses then implement it myself based on what I learned (or just integrate it to the codebase as it is, if I asked it to write a function) than delegating away all the fun to an agent.
I find using the LLM to generate different git repo skeletons for the same class of project using the 4-5 different programming languages I’m familiar with is really interesting and helpful. Then I ask it to explicitly describe its design decisions for different parts of the small codebase, i.e. what do the internal APIs look like, so that if you make changes in one section of the codebase, you can be sure you don’t accidentally generate problems in another section of the codebase. Only once you’ve worked out all such constraints, clarified dependencies, etc. do you start generating code in each subsection and that’s done using the specific constraints for that section in each prompt, and reviewing all the code. This is also when you generate the tests for each subsection. Finally this is where using a different LLM(s) for code review after the code is written becomes important. It’s a slow process certainly but it seems to work pretty well.
A lot of programming work is well represented in the training data. For that kind of stuff there’s not much to do regarding architectural decisions. I love to run the LLMs on auto for that work. But for anything not well represented in the training data, which could be anything from mundane stuff in PyQT or a truly novel application, keep them on a short leash or forget them altogether.
This isn’t a binary is/isn’t thing though. What if only 80% of my task is, how would I know that the other part isn’t, if I haven’t worked it through fully
What if my task is generally represented, but for my specific context, there are specific details that aren’t?
How would I know until I’ve reasoned through it myself? At that point having the LLM do the work doesn’t add much value
I find myself spending on average more time in LLM review/resolution loops than it would take for me to write the code by hand. Partially because once I'm in the flow I write very very quickly and the code pours out sometimes faster than I can write. But also because the LLM code on the first few tries is generally really really bad. What I find interesting though is that spending the time to personally review and direct the LLM through several iterations of review and revision on average results in higher quality code written in about the same time as I would have written it. This might be particular to me, but seeing several interations of someone else's code helps me better understand holistically my objective as opposed to whatever happens to come out of my flow-state consciousness.
This sounds like a subjective assessment. I counter with the opinion that most LLMs write technically correct, but bad code. When I read it, it makes me want to gag or poke my eyes out. I spend a lot of time wondering about what kind of person would write it like that, then I realize it’s an LLM
The tool is important but then so it's the way you use it. I've seen small LLMs produce good code and frontier LLMs produce poor quality code. Depending on context..
What? I think this is either over exaggerating model capabilities or you haven't seen much good code from humans?
My experience is that my colleagues which have bought into model-first development have regressed in quality of the PRs they send out. LLMs are not better coders, in my experience. They lack holistic understanding and often need course correction for that reason.
At least in medium to highly complicated systems.
Over my time in the industry I've become increasingly convinced most people haven't seen what good human programmers are like. Otherwise we wouldn't have the popularity of things like Scrum, Clean Code (the book, not the concept), etc.
I was lucky enough to see some good teams when I was a student (both at Berkeley itself and by interning at Jane Street), and it totally changed my intuition for what good programming is like. It's gotten to the point where I'm convinced there are two incommensurable paradigms in programming, and we're constantly talking past each other.
Like, if you have an ongoing project where the codebase has grown over time, do you expect it to get easier to do things or harder? I've worked on projects where it's obvious that things are always getting harder (old code is hard to change, you have to deal with lots of complexity and edge cases and workarounds). I've also worked in codebases where things got easier over time: you get better abstractions, more libraries, more capabilities. That can be a lot of fun; you think of a new thing to try, and you have the pieces to just do it.
Or another point of comparison: do people think that writing good code slows you down (so it only makes sense to avoid bugs), or do people think that writing good code lets you move faster? I've talked to people for whom one or the other is totally and obviously true. (I'm solidly in the second camp myself.)
But the surprising thing was how "obvious" the dynamic was in both cases, even though the two cases are exact opposites of each other! If you ask one group or the other they'd just tell you that, well, that's simply how programming works. Of course things get (easier|harder) over time. That's built into people's fundamental understanding of what programming is and how to do it. And that's exactly what I mean by incommensurable paradigms.
Anyway, this is a bit of a tangent from the main discussion, but it's something I've been thinking about a bunch over the last few years, partly inspired by the advent of AI-powered programming, but largely thanks to experiencing some very different projects and teams...
That's an interesting point, and maybe that actually also explains the difference of people that believe that AI is making them more productive and people that believe it doesn't; if you never think about the architecture, then it becomes more slop over time, and it becomes harder to do anything. If you do think about architecture, development becomes easier and faster all the time. AI just accelerates both processes.
I’ve seen both side of the fence and if there’s one quality that seems to define that fence, it’s caring about the process. Both sides wants to archieve the same goal, but one care about the process (enough to make it less tedious) and the other don’t (whatever seems to work is Ok).
That’s why they say the best programmers are lazy. Not in the sense of avoiding any kind of work, but avoiding the kind of senseless stuff that’s surely to come down the line if you’ve not taken care of the process
Fair enough; I'm talking about relatively "small" snippets, that with reasoning algorithms, can quickly give you a better result than you would get if you let a mediocre or even senior developer would give an hour.
Managing a complete codebase, making architectural decisions, designing business logic; that is not something you should let your agent do.
“A lot of people seem convinced that the point of AI coding is to write low-quality code as fast as possible.”
A lot of people think a lot of things, but I don’t think the majority of people think the point of using LLMs is so they can produce low-quality code. Do they produce low-quality code sometimes or often? Of course. But they also produce high-quality code very often. And sometimes they just a “fine”
job.
One of the promises - and there are plenty of cases where it’s met and where it falls drastically short - is that agentic coding tools can help us code faster that is just as good or better than what a human can. One of the other big ideal payoffs is that agentic coding can allow non-programmers to create things that previously required programmers to create.
We can debate as to how successful we’ve been toward the two goals above, but I think it’s misguided to say that the majority of people think LLMs should produce lower quality code.
> We can debate as to how successful we’ve been toward the two goals above, but I think it’s misguided to say that the majority of people think LLMs should produce lower quality code.
Guessing you’re not at FAANG or similar company. For the last 6 months at least there’s been tremendous pressure from leadership (including highly experienced IC engineers) to let AI take the reigns, assumption being that future AI assistants will be able to deal with any level of complexity and tech debt created today.
Given that everyone agrees that reviewing all AI-generated code is impractical (if you let the agents rip at maximum available bandwidth), and that “harness engineering” is at best immature and at worst complete snake oil when it comes to ensuring system stability, maintainability, and quality, I do believe that it’s fair to claim that most engineers are, in fact, supportive of low quality code generated by LLMs.
Fwiw I do see pushback here and there, but only from the lowest rungs on the career ladder - ICs with enough experience to see where this train is headed, but no ability to save it. Management needs to see the results of their policies first, and that will take months or even years to fully play out.
Hopefully not, but there was recent thread with multiple posters arguing that code quality doesn't matter, and quality produced by humans in the past was often terrible. So who cares, ship it was the sentiment. Let the AIs handle the growing maintenance cost, I guess?
Kind of a shocking thing to see argued on HN. Maybe it's just the vibe coders.
The vast majority of corporate-employed programmers write bad code. I think maybe 10% of the people I’ve come across have shown any interest or care in the quality of code they write.
There will be a large majority of people who hold these opinions, because they weren’t capable of or didn’t care enough to write good code in the before times
The real problem with these conversations is that code quality isn’t something we have any kind of consensus on.
To a lot of engineers code quality means upper-case C Clean Code. Other engineers are in the Grug brain camp where they think that premature abstraction is the worst kind of code.
But to your point I think the majority of engineers think they high quality code is anything that compiles or passes their (almost definitely insufficient) test suite.
> arguing that code quality doesn't matter, and quality produced by humans in the past was often terrible
You're conflating two different things. I'm one of the people arguing for the latter, but not because I don't think code quality matters but as a counter to to sudden idealization of handcrafted code.
> We can debate as to how successful we’ve been toward the two goals above
No not really. These are separate questions from what the article posits. The argument is about how do we use these tools, our approach as developers, and if the results are going to be as rosy as advertised.
Eh, I definitely do think that it has become a mainstream take. Not necessarily that we want lower-quality code, but simply that humans shouldn't be reviewing AI code for quality at all - that is, that code quality doesn't really matter and what matters is that the software works.
This is the entire premise of the concept of "vibe coding", and the concept of non-programmers using coding agents. The idea that there aren't large amounts of people and companies doing these things and/or who consider it "the future" is hard to argue.
But how do I know if something works if I don't know how it works? By testing (literally) all use cases, every single permutation of of variables? For complex programs there might not be enough time and energy in the universe to do that.
If I know what addition is, I can look at at a line that does addition and reason about it. If I just check "if it works", for all I know, the actual code is something like
Sure, I can use an LLM to check on the first LLM, and then a third LLM to check on the second, and so on ad infinitum, but none of that, at no point, can give me what "knowing what addition is" gives me.
It's kinda like cheap/fake concrete: If you know something about concrete and what concrete is being used, you can roughly tell if it will last, what it will withstand. If you just go by "seems to work", "looks good", you get collapsing bridges and buildings after a few years, during heavy rainfall etc.
The linked article about getting LLMs to critique each others' code review[1], the magpie tool[2], and also this recent article from Cloudflare about their code review stack[3] are all quite compelling.
I'm fairly AI-skeptical not on grounds of "do they work" but "are they good for the world". I feel that getting AIs to do this kind of review work is a rare case that doesn't outsource thinking and deskill workers. It doesn't trigger the same alarm bells as having the AI write the code (including having the AI fix the issues it discovers). That's setting aside environmental and other ethical concerns, which are still significant to me.
I have been impressed by the recent quality of AI code reviews*, but the experience of interacting with 3 separate AI reviewers via GitHub PRs is pretty terrible. Having more local-oriented and jj/rebase-aware review rounds would be great.
*context: fairly large PHP/Laravel backend and Vue frontend
Each time a new technology comes around, we seem to forget all the lessons we learned before. We them re-learn them again, much more quickly and just as painfully.
The same thing happened with crypto. Crypto started with some correct assumptions like "online services that stop working when it's a holiday are ridiculous", "it should be possible to send money to a friend without your government knowing about it", or "money transfers cost nothing to execute, so they should cost nothing." It then promptly threw out the entire framework of banking regulations, quickly re-learning why most of those regulations existed in the first place.
Regardless of what model you use, agentic coding tools are indeed pretty good at finding issues if you target them a bit. And they have no respect for their own code or any sense of shame. So, you can just point them at their own code with a new thread.
Many AI models seem biased to cutting corners by default when generating code, even when you ask them not to. But a few simple follow up prompts can address that. Simply ask for covering corner cases with tests, test all the known non happy paths, look for weaknesses, verify adherence to SOLID principles, do security audits, etc. It will find issues. With bigger projects, you can actually make it file those issues in gh with labels and priorities. And then you can make it iterate on fixing issues with separate PRs.
On a recent project, I made it implement a simple benchmark test for measuring throughput. I had a hunch it was doing very sub optimal things. I then asked it to look for potential performance bottlenecks and use the benchmark to verify improvements. At that point I already had a lot of end to end tests to verify correctness. So, these performance tweaks were relatively low risk. I got about two orders of magnitude improvement and a lot more graceful behavior when pushed to the limit.
If you have a bit of experience engineering systems, just treat these tools like they are junior developers. Competent but likely to skip some essential steps. So, just double check with a lot pointed questions "did you do X? If not, do it now". Anything that needs repeated asking, turn it into a guard rail / skill.
There's a bit of effort and skill involved with this. I imagine a lot of less experienced developers might struggle to get good results because they aren't asking for the right things.
My problem is that it "finds issues" all the time and it never really ends. You go through the list, make a decision on how to go about it, give it back to the AI, it does the changes, you ask for issues again, there are now new issues in part due to the solutions from the previous fixes, now you again assess each issue and it's often valid but you have to ask yourself if it's worth fixing right now and whether the fix is worth the complexity for a super rare edge case, depending on the type of prpgram you make, and often the assessment of what's high or low priority is not great by the AI.
So to me this loop really never properly ends so it never feels like I'm done. Which is not great from a psychological point of view.
I find that to not get into this doom loop is to make sure the solution is not overengineered in the first place. AI will pile on complexity to infinity unless you actively gate it.
> Regardless of what model you use, agentic coding tools are indeed pretty good at finding issues if you target them a bit. And they have no respect for their own code or any sense of shame. So, you can just point them at their own code with a new thread. Many AI models seem biased to cutting corners by default when generating code, even when you ask them not to. But a few simple follow up prompts can address that.
That's more or less all of them, they do just generate the likely combinations of tokens, there is no critical thought involved. If you want to approximate that, review iterations are probably the right way to go about it, without the full conversation context either so there's no model output like "I'm doing X because it seems like the correct way to go about Y." but rather a fresh context which allows for more critical predictions.
Here's what works for me, can be made into a skill in whatever you use:
I would like you to do a review loop!
How this works:
* once implementation is done, all tools must be run and pass: whatever is configured in the project like Ruff, Oxlint and Oxfmt, depending on the tech stack (also don't run such tools directly, look at package.json or similar project files/configurations/run scripts first; like if it's a stack that has compilation, compile the app, if there are tests, then run those; just know that you DO NOT generally need to stand up the whole app); if there is a projectlint-rules folder then that means you probably should run ProjectLint as well (local tool, use projectlint --help or projectlint --docs, or better yet, look at whether package.json or README.md have any instructions on how to run it)
* once all the code seems okay, you will run THREE parallel sub-agents for code review: each looking at ALL changed code (not each having a different sub-section) and looking for CRITICAL/SERIOUS issues (not nitpicks), with the goal of not missing anything and building consensus
* whatever CRITICAL/SERIOUS issues are found, if you can confirm that they're real and not false positives, you will then fix and remember to run the tools after, after which you will do another review iteration, followed by a fix iteration if needed and so on
* remember that the review and fix loop must END with an iteration of the review agents returning that there are no CRITICAL/SERIOUS issues - you cannot just do fixes and say that there is nothing remaining yourself (and also remember that the reviews are done when all of the tools pass, like when the code is linted and formatted etc.)
* at the end, produce a summary post that has a table, the rows being iterations, the columns for each of the agents (A, B, C) showing FIX/OK and then a column called Iteration summary; the goal for this is to show a summary how many iterations it took and what was fixed, you can also include text alongside the table as normally
The ProjectLint references might need to be removed (replace with whatever higher level linting/architecture tools you have, if any), but that's the overall idea. It does use a LOT of tokens though, but almost always there's something to fix. Of course, the problem is that sometimes there will be nitpicks or the fixes themselves won't be fully okay, though in general this trends towards slightly better code, even with something like Opus 4.7.
This can backfire a bit on token usage where it gets a bit to trigger happy running expensive things for trivial changes. I tend to not use sub agents for this reason. I actually manage to cover most my needs on the 20$/month codex subscription. I might switch to the 200$ plan at some point. But right now I just need to be economical as our company is fairy resource constrained. That's also why I prefer Codex over Claude Code. It seems it gets the job done for less $. Another advantage is that it seems to have less need to have things like this spelled out in this level of detail.
Another thing is that unless you are doing really complicated stuff, you probably don't need the latest models running on high. I'm still on 5.4 medium with codex. I see very little reason to change that.
Part of agentic engineering is figuring out how to be economical with tokens and time. You can sacrifice one for the other of course. But there are diminishing returns as well where spending 10x more doesn't actually get you 10x more quality/results.
I just have the Anthropic 100 USD Max plan and it's enough for daily work - I sometimes do hit the 5 hour limits by mid day, but weekly ones usually cap out at around 80% or thereabout, even with this approach. I usually use xhigh, sometimes max, both still result in situations where I need to intervene plenty, not even on that complex use cases (some LLM stuff, mostly web based CRUD, some light data processing, integrations with Jira and GitLab, processing PDFs and so on, sometimes ML training and geospatial work, like the Sentinel-2 satellite data, nothing crazy).
If I had to pay per token, I'd probably look at DeepSeek. In general it feels like it's a bit early for the technology - either our software methods are wasteful, or the hardware hasn't caught up. To me, it appears that we often need to throw more tokens at these problems, not less, since otherwise it's just one-shot slop.
You can get reasonably close with fewer, however more agents give better signal: e.g. if 3/3 flag something as an issue, the outer one that orchestrates them can view it as something to give more attention to, whereas if it's just 1/3, then it probably begs more consideration. Ofc more doesn't always imply right.
1) While walking voice chat with ChatGPT about architecture and various interesting angles for a feature or product
2) Have it create summary of things we talked about
3) use that to seed spec development phase
4) write comprehensive specs using both Claude Code and Codex
5) create todos from specs
6) implement todos using both Claude Code and Codex to check each others work
7) run focused code check prompts e.g. specifically for error handaling, concurrency issues etc. They tend to find more issues in these focused passes.
We may be in the last Golden age of AI, where experienced professionals still exist who can code manually, and AI already exists who can code automatically, and when the former use the latter skillfully, wonders happen. This magical intersection may not exist iin the future, or become very rare.
I think as long as it continues to be tangibly better these people will still exist and the intersection will continue to be valuable enough to survive.
> as long as it continues to be tangibly better these people will still exist
Sure. But how long will that last? LLMs are getting better at programming much faster than I am.
Imagine a plot with time on the X axis and LLM skill on the Y axis. The line goes up and to the right. On the left is GPT3, or GPT3.5 with the very first glimmers of programming ability just a few short years ago. In the middle is Opus 4.7 now.
Where's the intersection point, where AI skill is higher than that of humans? Less than 10 years. I'd guess less than 5 years.
I think the problem is is that coding is not wholly a 'writing code' problem. It's a translation from idea to outcome. Often I think the bad code generated by an LLM is less to do with it's 'ability' and more to do with an instruction that hasn't adequately accounted for the possibility of what code satisfies the criteria. I'm not sure how a newer model can improve on this per se - sure there will be imrpovement on outright mistakes but for me at least, that's been and gone with more or less with any model released in te last 6 months.
I was coding something with claude the other day. It got the program working by all externally observable metrics, but when I went into the code it was full of DRY violations. It made a bunch of interrelated - but separate - traits for some concepts which simply didn't fit together.
I asked it to look at the code and come up with better factorings, but it failed. I ended up manually reworking several thousand lines of code myself, via my IDE. It took days.
I'd like a claude-of-the-future to be able to come up with beautiful ways to factor the code itself. Amongst the correct solutions, pick one which is conceptually simple. Write the code in a way that it makes subsequent changes easier to write. If I were doing RL with claude, I'd consider directing it toward solutions which allow subsequent changes to be implemented with as little effort as possible.
I think a better way to think about it is - what are the invariants to our current architecture? Why can't you tell Claude to build you a 1B$ business, make no mistakes?
I have no doubt they will be better programmers than almost every human that has ever existed. But the role of a SWE will expand to fill the gaps that the LLM paradigm hasn't filled:
- Accountability
- Long term architectural vision, goal setting
- Everchanging business context
- Mercurial executives, people problems, relationships etc...
Token efficiency is going to be the next big thing.
Tokenmaxxing an army of juniors will destroy your business through slop induced tech debt and API costs. A senior that uses AI but is token efficient will be like rocket fuel.
And you act like there hasn't been a loss once we moved away from the master craftsman style of building to the professionalized architect style of building. We cannot make a gothic cathedral amymore. also CAD, homogenized the built environment, significantly. And we have been losing a lot of traditional, artisanal craftsmen art forms over the past century. artisanal craft mounds,
Did they? Genuine question, because I do wonder if people in some industries in the past were ever anxious about these specific things (especially skill attrition).
> I do wonder if people in some industries in the past were ever anxious about these specific things (especially skill attrition).
I've spoken with some people (now in their 60s & 70s) that worried about skill atrophy in their line of work.
First they worried about atrophy. Then they watched skill dry up. Now they know it's not available to buy anywhere. In the better cases the skills still exist, but entirely overseas.
These are people I could recognize as sharp engineers, even if I don't know their domains at all. I had to take them at their word about the value in what was lost. The problem is that it's easy to assume that business (or at least society) would prevent degradation of valuable knowledge over time.
Title of this article suggested more depth and I was expecting actual code examples. But it is like other opinion pieces. It suggests a prompt (ask AI to find bugs) that works for the author advising everyone to do it that way.
I use these tools at both work and for personal side projects and I was expecting to watch and learn. But these opinion pieces without examples are way too many now.
Have you tried his suggested workflow? I think it's a useful workflow, and if I hadn't found a workflow like this already would appreciate the pointer.
I guess he could write a code harness to do this, or gin one up really quickly, but that kind of tooling today seems like the purview of the practitioner -- you -- it's frankly faster for you to spec what you want to try this idea out if you want it automated than it would likely be to deal with his code.
One thing that's been interesting to me over the last few years is charting the edge of my coding laziness. As a coder, I'm lazy about boilerplate code -- I hate writing it, I hate maintaining it, etc. And so I design and architect (or used to) around that preference. Sometimes that's smart, sometimes that's not. But it was my preference, and I avoided something that was hard for me to do.
When LLMs started being somewhat useful for coding a few years ago, and I found they were in fact great at boilerplate, in fact pretty much only good at boilerplate ca 2023 or so, it got me thinking about all the accommodations we make in design and systems architecture that are sort of tacitly understanding who we're working with and their strengths and weaknesses.
The modern models have their own very different strengths and weaknesses compared to humans, and deploying them is a really interesting exercise of different architectural and engineering skills. I've enjoyed it, and hope I continue to.
The thing about boilerplate is that a good library or framework makes it optional, and / or automatically written.
I'd much rather django-admin startproject, npm init, or meteor create and get deterministic output than prompt an LLM and get who knows what.
In a mature web ecosystem, boilerplate is minimal. I worry now that we've given this task to LLMs, less development effort will go into startproject-esqe CLIs and good opinionated defaults.
I wonder this in general, what's the impetus for writing new frameworks and such? Are we already seeing a slow down in that space? HN front page certainly paints that picture.
You're better off plonking down an existing framework and getting all the structural boilerplate benefits the LLM can leverage.
LLMs are far better at frameworks they have a lot of training data for, if have been around for a while. They write more idiomatic, ecosystem friendly code. Does that still matter?
I’ve landed on a very similar usage in my last pet project.
I’ve used the llm mainly as a glorified refactoring tool/LSP/rubber duck.
I can define custom skills that act as specific passes over the codebase that are hard to do with traditional tools, I am using Julia, so I have a skill that is only about doing a semantic and type analysis pass to catch potential type instabilities. Or another that is just about documentation reporting.
The workflow for me is always: talk the problem to death/get a report. Triage, decide what I can and should do on my own, what can be left to the llm as mundane boring refactoring tasks, what instead needs me to figure out the correct shape first and then ask the llm to propagate the new pattern in the codebase. Then act.
A lot of the time I am implementing the llm suggestion by hand on my own to get a feel of how the codebase is shifting under my feet and stay on top of things. This indeed makes things more slow, but allows for an overall higher quality codebase. Especially the refactoring part.
The anchoring thing is what gets me. Once I've seen the AI's first try, even when it's wrong, I can't really write fresh in my head anymore. I end up editing instead of starting over. Code quality usually ends up fine. Time-wise it's a wash or worse, you just don't feel it until you look at the clock at end of day.
I find that it really is effective when you iterate and plan and review, but the problem is more psychological on the human side. It's just too easily available to take the lazy option and just let it do the thing, postpone the thorough reviewing and you end up in a similar situation as tech debt. In an ideal world with no deadline pressure and infinite discipline, AI can be used in productive ways for sure. But when you actually write the code, there is more of a "do you do it or not" switch, and with AI it's a smooth ramp, you can be just a bit less involved or just a bit more. And I end up feeling like I'm not fully involved, I'm halfway working and my whole mind isn't tuned into it properly. I'm not sure how to express it. Also, now several months in, I just don't get the same feeling of accomplishment from the little wins. It's too automatic, doesn't feel earned.
I am very visual and spatial. The first investment I make in my home or even visiting somewhere for more than 3-4 days where I will need to work without coworking is buying a whiteboard.
So now I'm here with all these tools trying to use a remarkable tablet to draw and show the AI what I'm thinking. It's just not fulfilling. Cleaning toilets isn't either. Lots of jobs have felt like a full on race to software factory and it's clear we're going there with AI and the "cognitive debt" from half (or less) activated brains driving the code generation is going to be massive.
I can't comment on cleaning toilets as a job (luckily I don't have to do that), but cleaning at home does provide a sense of accomplishment similar to solving a coding task elegantly and cleanly, while uninvolved AI-assisted coding is more like up and down voting or liking posts in algo feeds. Not fully like that of course, but it's a step towards that kind of "I like this part, I don't like that part" feedback-giving that can leave me depleted/drained. Coding before AI was more like when you feel one with the machine, like when you drive your car on autopilot, and with AI it's like sitting in the passenger seat like a driving instructor saying how to go about the driving. You do t quite know what it will answer, maybe it will push back on your idea when unnecessary and then I have to expend effort in arguing in text in a chatbox with a machine, or it goes forward too easily without asking clarifying questions or pushing back when what I ask collides with previous things. Many programmers get depleted in meetings and in language-based argumentation and charge up with the more puzzlesolving-like flow state, but this AI wrangling is often more like team meetings.
"It's just too easily available to take the lazy option and just let it do the thing"
This seems to me to be one of the key problems for AI usage in general. Students have this problem where it can be incredibly helpful in actually learning but late at night with the assignment due early tomorrow the temptation is just too strong to have it do the thing.
Somehow I find that interacting with AI doesn't make me feel the same way as diving through Wikipedia rabbit holes. With AI it feels more like, it starts saying how there is indeed an answer to some science question I was unsure about, about some phenomenon or about how some technology works, and it starts explaining it but I feel disinterested in actually reading it's answer. I see it's general shape and I feel satisfied in the existence of the answer. It may be the glazing sycophancy too, but it seems that I get the "satisfaction" from just getting the answer, while with Wikipedia I only got it once I dug up the info that I needed. And the AI answer is ephemeral, while the Wikipedia page is there for everyone, it's a thing, even if it can change.
Same with AI images. It feels good for 2 seconds to see what I asked for and I'm immediately disinterested. Same way, I've generated many Suno songs, but I don't care about them after a few listens.
I noticed that sometimes discussing the code with a chat instance, and having it write prompts for an in-IDE-agent, then posting the result back to the instance to discuss the results and repeat this loop yields not only good results, but also makes me understand the codebase better. I let both agents know I'm proxying and take part in the conversation.
I think AI exists to make humans better, not to replace us (which it can’t anyway). I use LLM’s with new topics answer questions and tutor me (for instance with multivariable calculus -course this spring I asked Claude to create 10 practice exercises, which I then did and it reviewed. Harder ones it did with me step by step.) hopefully not needing them after awhile, when I gain proficinency. Automating humans away is not going to work. There’s a reason why we are the apex predator and ruled this planet for million years.
That’s first time I hear I’m optimistic about AI. I am as optimistic as I am about a hammer or a liquid scale. They are tools and they are good for particular jobs, if ypu know how to use them.
I am in careers that is one of the more sheltered from automation. Present tech layoffs I suspect are more due to insane overhiring during covid as well as outsourcing. I am sure some companies have gone to full AI psychosis -mode, but they are taking a massive risk. Time will tell.
As I read this, I'm also working through a pretty dense feature that took a fair bit of iteration. The end result is actually significantly less code than it was about halfway through. And I was wondering if the AI actually helped me at all, since surely I could have written the code in the same time it took to iterate
But! Because of AI I was able to rapidly hack out like 4 variants of this feature that I didn't like. And felt comfortable throwing them away just as quick.
This has been one the most significant improvements of using AI for me. Before I would have to really think through the plan of a new feature before committing to the implementation and would only catch incompatibilities with existing code after a good portion of the implementation was already written. Now I can ask AI for detailed implementation plans and find these nitty gritty detail problems in a few hours if not less
True. I think this is the biggest help with AI. It does not necessarily help with reaching the end goal faster all the time but it helps in trying out different iterations for quick prototypes. I find it especially useful in fast moving startups where some times we just want to validate a few ideas before fleshing them out as proper features.
Yea worth it. The original implementation ended up being the most complex, and also not a great UX. But I didn't really get it was a worse UX until I built it and tested it out a bit.
And I wasn't attached to that complex implementation in the way I would be if I architected it from scratch, so it was easy to move on.
I'm pretty intensely disinterested in "agentic" coding. However, this was very much the inspiration behind the custom Claude-backed Goose agent I deployed into some of our Gitea repositories over Christmas--the less sensitive ones, of course, I'm not sending our proprietary code to Dario.
It does short and sweet code reviews, and going back and forth with it is, as often as not, slower than just typing and merging the code.
The bottleneck moved. It used to be writing code. Now it's knowing what to ask for, in what order, and how to validate what comes back. That's a fundamentally different skill than coding.
Only if your job was to write code alone, which isn't what most people do. Most of my job for the past 10 years was explaining to people what had to be done, how and then verifying they did it.
Any senior/staff engineer was already doing this, if they weren't they were on the wrong job or had the wrong title.
So I am figuring out how to let LLM write code automatically as long as I clarify the requirements. I have made a set of skills to deal with this and it called tdd-pipeline. I eat this dog food and by several rounds of iterations to fix bugs, it works better and better. Now I feel much relax while it is working.
I open sourced it on GitHub, you may search alexwwang/tdd-pipeline to find it if you are interested in it.
I believe this thinking can be abstracted to software design via AI in general. If you are thoroughly prepared, and keep things simple, it's incredible what help Claude or GPT can be.
I have Claude basically doing all the coding for me for a simple game I am making. However I don't consider this vibe coding. I spent several hours thinking out the design on a piece of paper, playtesting it in person. I came up with a list of potential mechanical issues within the game, and asked Claude to come up with more. It found more issues, and we solved them all together. Once the game was mechanically sound and edgecases were solved for, it built an MVP. I ran the program, and found more bugs. I came up with my own solutions, and Claude did the same, and we figured out which were best to implement. Claude then wrote more code, and raised issues when they came up, and we worked through them together. I'm incredibly happy with how its turned out so far.
The main insight here I think is that LLMs are great tools for iterative development and iterative problem solving in general.
You can very effectivly iterate alone using the LLM as a mirror, rephrasing what you put in and adding a bit.
You can use LLMs to quickly create prototypes to give to other human beings to help you with the next iteration.
If you get something from someone else to iterate on you can use the LLM to help you with understanding to rephrase things in a way more suitable for your understanding.
But instead everything anybody seems to be talking about seems to be one shoting things and AI iterating with other AI.
The big problem here is that the one thing AI does not have is agency.
The naming AI agent is wishful thinking and marketing.
On the other hand, some companies are pushing the idea that engineers should build robust self-evaluating agent pipeline with human feedback in the loop so that agents write most of the production code. Creao's CEO said that they rearchitected their entire production systems in two weeks this January. He also claimed that their agents implemented so many features so fast that they had to wait their business development to catch up.
I wonder how we can evaluate these two options: using AI to 100X the output versus using AI to advance one's craft.
In the meantime, the productivity gain of AI is real. Case in point, An engineering org of Snowflake has met all its OKRs ahead of time in the first quarter for the time in the company's history. It had never happened, and usually meeting 70% of the planned OKR would be considered an achievement. I can imagine the stress of the engineers when they see such outcome.
Hopefully we can blend those two options together so it’s not a choice.
Personally I find being able lean on our heavily documented standards in /review gives me back time to dive into what I want to craft next.
Same with scheduling repetitive tasks an agent can do for me well once instructed well. I am freed up to do something else I want to focus actively on because I like it and want it to be great.
Now stress about OKRs and OKRS in general… that’s a different issue
Exactly. That is what we do. We do software that can kill people and it is very sophisticated, like controlling robots and we prototype using LLMs and it is amazing.
People believe that you can only use LLMs for sloppy programming. But you can also use it for writing ten times more code of Swiss cheese model tests, and domain specific languages.
You write ten times more code than necessary and all that extra code is testing. Projects like SqlLite do that because they need to be perfect.
Before LLMs we had to use engineers for that and it was a painful and repetitive work, and they were always late and made much more mistakes than LLMs, specially because it was dull and tedious for great engineers to spend their time into.
Now we write tests and when all test pass we write new test for checking the tests.
We divide each complex problem in small subproblems and we warrantee each of them by formal means. We have multiple ways of solving the same problem, usually with one brute force solution that is simple and warranted to work but inefficient, and we can use it to compare with more efficient methods.
Before machines could do that, people doing that were burned down and exhausted, and always leaved pending work to complete.
This is exactly the reason why I like to work with local models on a regular specced machine. The fact that the agent moves slower allows me to stay in the loop much better, compared to skimming through a huge amount of generated content and data and then going to the end really fast to make sense of it all, in the interest of time (and thus losing track and quality). The fact that I can run it locally makes it (much) cheaper too.
So your "goal" is to find an existing ai-auto completion that allows you to draft with AI then "write it yourself" by hitting tab? Sounds like the goal is actually to build that, then use it on projects....
I get pretty good results by writing specs and prototype with a LLM, through more or less managed conversations.
But once the prototype is done, I spend too much time refining the details, fixing everything going wrong (bas design details, wrong implementation, half done testing ...)
A full agentic setting would be too expensive for me (I wonder how much Garry tan spends...)
So I'd like to take a more balanced approach with:
1. usual LLM specs and prototyping to get the bases of the feature and boilerplate done
2. Write myself the code, with the help of an ai auto complete (this is Where I look for recommendations)
3. Use a setup as OP mentioned to review code
Optimizing for code quality over raw output speed is a great approach. The time 'lost' writing it slowly is easily made up by the time saved on debugging and maintenance later.
> But if you’re the kind of developer who uses agents to write multi-hundred-line PRs that you barely understand yourself, I’d invite you to slow down a bit and try this other, slower style of “vibe coding.” Ask an agent how your PR works and how it might fail. Have it write Markdown docs with Mermaid charts if necessary. Use Matt Pocock’s /grill-me skill until you understand the entire PR front-to-back.
Man so much work to retrofit something that obviously, simply, plainly - just does not work. How about just writing the code yourself? You can even consult AI on the libraries or whatever, but how about just building that model in your head YOURSELF and not loading up on AI slop and trying to memorise that crap. The names of the functions will ring different in your memory once you spend some time thinking over whether you picked the right and clear name vs. just going with whatever statistical median the slop machine picked for you.
I think using speed to describe the rate of progress in software development is where the frustration comes from. Software isn't a velocity thing. It's a space thing. It's memory. Information in some media. You can transfer a billion bits in less than a second. The time domain is largely irrelevant in business terms.
Having taste and the ability to author high quality prompts is still the most important thing. It was always the most important thing if you think abstractly about how all of this works.
This is one of the most sane takes on shipping code using AI where it's being actively reviewed and it respects your colleagues' time and attention. I like it.
Yes! That's what I've been doing at work for the last few weeks! And while it doesn't appear to be super fast, I'm already pretty certain that the next round of testing will come back with fewer unexpected issues because together with my agent and the right usage, I was already able to catch stuff that I would have missed otherwise.
Also feels much better than pure vibe-coding (which I still do for personal projects that aren't mission critical for anyone).
100% agree after building a production ready platform ground up. it took 3-4 months but without AI i would never had been done with a team of 3. one thing to note that AI is weak at Front end. So, we did the entire front end without AI.
I used LLM as a tutor to tackle unfamiliar terrain. That is, I write code that I know very likely doesn't work but is the best code that I could have written. The LLM will happily tirelessly show me what I did wrong and what the correct code actually look like. Then, at the end of it, I got code that running. That's a tight feedback loop.
It's still very slow. It took me two hours to write code that generate JSON data and then to write a web page that displays a knowledge graph.
One thing you have to be aware is that the LLM will happily generate code for you and you have to discipline it from time to time. I notice that my reading comprehension begins to suffer if I don't write the code myself and have to understand what the LLM wrote for me as opposed to the LLM correcting where I went wrong.
One thing I would like to try with an LLM is understanding a large and complex existing codebase like OpenSCAD that doesn't leverage my existing skillset(high level programming languages with OpenSCAD as primary language in the past year). That has always been a barrier to contribution for me.
- Opus 4.7 writes the code
- I make GPT-5.5 in Codex to review it (given context)
- I provide the review back to Opus and ask it to verify the review findings
- Make Opus plan the fixes then execute them
- Ask GPT-5.5 to review the fixes and check if they solve the problems
Hot take, barring from special edge cases, I find using dumber models (like local Qwen 3.6) to be the best balance. Smart enough to do stuff but dumb enough where I don’t trust it and verify what it’s doing rather than letting it do the third whole code base refactoring of the day. Also forces me to know my code base and ask very descriptive tasks rather than go “something is wrong, fix it”.
I think my current conclusion is that AI makes <foo> more important than ever.
I’m not exactly sure what <foo> is but I feel it. I think it’s quality and authenticity and craftsmanship. That difference between an expensive tool and a cheap one that you can’t easily describe but you just know it.
Is there a word for this? I bet the Japanese or Germans have a word for this.
I use AI a lot now. But I also do it in small steps. It isn’t a craftsman, but it can help me be one.
Another thing that I feel is underappreciated about agentic coding is that you can actually learn from it. I am a programmer with 25+ years of experience and I tend to do a lot of stuff according to fixed patterns/habits. Seeing how my coding agents do stuff helps me break out of these patterns, lets me consider new approaches, helps me pick up idioms and teaches me new hacks and tricks. That is very satisfying in its own right.
I'm exactly in the same boat. 25+ years of experience and I use agentic coding exactly to learn better patterns. I often let it implement something, read the code, learn the pattern, confirm it's a good practice and then code myself manually another section of the code.
I think many people that blindly say you cannot learn anything from vibe coding have some sweeping assumptions. It obviously depends. If you just let it do everything without even reading the code and understanding it, then yes. But the act of reading code is one of the best ways to actually learn, no matter if you read code from a human or from an LLM. I tend to learn way better by example than by reading a theoretical book.
Love this. I use a similar "ralph-loop" approach that starts with an approved plan and then hand it off to a coordinator which does it across 2 sessions (build and review for simplicity), with each session getting its own model.
Another way I'm "going slower" is to have the AI implement individual sub-steps of the current task, and review each one. It's slower than having it yolo out the whole thing, but it's much smaller incremental bits to review, so my brain doesn't glaze over in a huge review, like I had if I had it do the whole task.
I'm following an Ideas -> PRD -> Issues -> Tasks methodology, where each task has a bunch of sub-tasks. I have it just do one (or a few, I'm having it do Red/Green/Refactor as separate sub-steps, so I review the Red case, and then once that's good, do the Green and Refactor steps, and review those).
The quick answer is that even in the workflow described by the author these tools don’t do the same thing AI does. And a good programmer/agent will be using these tools as well
To me the blocker with using coding agents is having to rely on a paid external service. Are there any local models that are good enough to be used for coding?
Instead of using a skill and having the agent own the flow for this, I've been building an external orchestrator that handles the process.
By default it uses pi agent core + pi ai (from the excellent pi coding agent) as a multi model runtime but also supports a Claude Agent SDK runtime.
I can have an implementation and review process of an OpenSpec change run anywhere from 2 hours to 24+ hours going through review/fix/verification rounds automatically until the implementation matches the spec and any additional reviewers are done finding issues after the fix rounds.
it's going to be fully open sourced in the next two weeks and fully free to use
Maybe we can come up with an spec for aligning asci diagrams. Can't really build anything with confidence when the attention to detail is lacking in these agentic systems
Yeah not trying to pick on any particular project because its quite the mark that the writer didn't proof read the documentation and its quite widespread
This reminds me the article above. Now people have diverse ideas on agentic coding. Some suggest human-in-the-loop while others suggest giving a detailed specification and let the agent run freely; some suggest leveraging LLM's high productivity and here we get an opinion that LLM can actually slowly write good code.
It's happy to see opinions that are more practical and variant emerging, turning LLM into literally a tool instead of something to be hated or hyped.
In my own practice, I find LLMs (SOTA ones) good at medium-level tasks, those needed to reason and plan for a while. However, the design taste on architecture is unexpectedly disgusting. Sometimes writing interfaces myself and asking LLMs to fill in implementations, alongside context-completing tools like context7, deepwiki, docs.rs MCPs, etc. and giving a escape hatch (e.g. encouraging it to use the AskUser tool in Claude Code), may be considered my best practice.
Very much agreed. Something specific that has helped me a lot (beyond just automatic formatting, linting and testing) was putting a hard fail on any file with more than 1500 lines or so, with an allowlist for specific files with specific reasons for their length. I realized the agents were squirreling away code without wanting to do any sort of refactor. Every time one of these rat's nests has turned up, the codebase has been much improved with a small refactor, to the point it doesn't feel like such a pile of slop anymore.
Matters of taste. I don't mind bigger files where it makes sense, and sometimes for the nature of the domain, it is nice to have more things in one file. As well, they write so many comments that 200 lines doesn't feel right to me.
How profound! Talking points are changing from "vibe coding delivers bug free software" to "slow down and enjoy the AI".
Great how the promoters are mirroring the current anti-AI sentiment. The next step is canceling all subscriptions and not using AI at all. Maybe your mind will work again.
Not so much. People are just walking things back from the Gastown/Oh My Opencode/etc peak of trying to get 10 agents working simultaneously on a project unsupervised. They've collectively realized that you still have to understand and validate what the agents produce in some way if you want to build maintainable software.
> This is the opposite of the “10x productivity” slop-cannon style of development that most people imagine when they think of vibe coding, but I find it very satisfying.
I can relate to this. When I spend time on writing unit test , even the one which takes 1% of code coverage, it will be honestly wholesome moment for me to ship it confidently.
Are we overcomplicating AI by approaching top down, so naturally there are trillions of variations and too many ways to fail? Supervising a component-level scope, with emphasis on quality control (regression, perf testing, benchmarking, etc), seems to produce great work.
I use cheaper models (Deepseek is king, but GLM and Kimi as well) and do the planning myself. I often start a task myself, write some code to get the LLM on the right track, and then have it complete parts of the implementation that are kind of boring or repetitive. LLM's are just next token predictors, I don't mean that in a demeaning way, but I've found if I can get the LLM started on the right track with my own code, it completes what I want. Having the LLM write code just from a spec ends up with poor quality slop in my experience.
I'm not 100x'ing my output like some people claim, but using it as a augmentation rather than delegating my work to it results in better code, and I don't lose context / control over my codebases. I really have read 100% of the code, because the LLM is generating smaller pieces around and inside my own written code. Works well enough for me, and open models are already both cheap enough and good enough for this workflow. This is why the big companies are so desperate to push full-on agentic hands-off workflows and developer replacement - that's the only way they won't go bankrupt.
I've been using Zed and Charm Crush. I think most work with it though, any agent designed around OpenAI completions API compat will do fine. Although Zed had some problems initially with tool calls but it seems to be fixed.
I'm working on my own harness to be a bit more aligned with my workflow but tbh I'm losing motivation since other harnesses are fine now. I could probably vibe code something but there's not much point imo. Unless I come up with something completely different but who knows.
I think there is a Deepseek agent out there in Rust, but I've never tried it. Zed has been pretty decent with all models, not the best but certainly beats VSCode. ChatGPT 5.4 on that calls about 100 different git diffs to "verify" the changes are valid which is rubbish. I haven't tried Deepseek with it though.
Honestly these models and agents are becoming commodities, as long as they don't totally fail with tool calling or some stupid system instructions the models can figure stuff out pretty well.
> But the thing is, LLMs are very flexible. And you can use them just as effectively to write high-quality code more slowly.
There is a reason it is called slop. On first sight it is often not noticeable but when you dig deeper, you realise that it is often spam-slop. Of course this can be improved upon, but often there is no real improvement and you waste your own time in hope that things get better. Which high quality projects exist that are AI slop generated? Can people name something that is used by many people? The linux kernel? Something in that range? Including documentation? To me it seems people are chasing a dream here: skynet should write the code and they can sit on the beach, enjoying sunshine and fruits.
There is things you really can't do by yourself. I've been working on porting some large codebases to Rust lately to experiment with fixing memory safety bugs. There is just no way you can write 100k LOC in a week of production code with tons of tests etc. Even "10X" engineers just can't type that fast.
Yeah, agreed. “Cognitive surrender” is one way of describing that loss of personal faculty. I don’t think AI proponents are acknowledging second order effects of letting your mind interact less and less while requesting more and more complexity built for you without adequate verification.
Where they are extremely powerful, and it's hard to debate this IMO, is adding comments to the code, writing complete documentation, and constantly updating the readme. The value in actually writing the code is still up for debate (I'm on the side that sees the value there too) but the mind-numbing, boring, make-you-hate-life parts of the codebase are without question better for the use of AI.
I've hit this point with AI where it's not a simple process, but a long drawn out back and forth.
I'll use AI to design the implementation of a medium sized, cross cutting feature. Review all the details, maybe iterate on just that. Then implement with Claude 4.7 Max - which runs slower, but does a better job. Then review the implementation, then have Codex GPT 5.5 xhigh fast review it - which almost always finds corner cases. Have Claude fix those - Claude is better at writing intuitive maintainable code versus Codex overengineered/shortcut filled code. (Codex is better at finding/fixing bugs and doing reviews - it's annoyingly pedantic)
Then repeat with fresh Claude/Codex instances having them both review the current staged changes and getting feedback, handling the feedback. Then covering it in tests. I mean overall I still implement the feature faster than coding it manually, but I spend a majority of the time going back and forth with reviews, handling corner cases and at the finish end up with what I feel a really solid implementation of whatever feature I'm working on. The v1 feature feels more like a v3 given the amount of iteration it already went through.
Talking the problem to death with the AI before implementation is a nice zone for me. I feel productive, get good results out of the AI, and still largely understand the code. That’s the part of the AI revolution that I feel has made me a better engineer because I argue about design and architecture all day with a robot.
I follow the same process. I have a design in mind for the problem at hand, but I don't reveal it to Codex. I go back and forth a bit to see if its proposals are better than mine. I go back and forth on tradeoffs of various approaches. And then I ask it to compare its proposals with mine. I "win" most of the time but there are many times where it shows a me a better, or simpler approach, or makes me rethink the solution altogether.
Once this is done, the mechanical coding parts are mostly routine (for codex)
I really like this pattern and use it often, this 'not showing my cards'. The second I hint towards the LLM what I prefer it will become sycophantic and invent nonsense why my preferred solution is better.
I'm sure there's an interesting study on how users 'leak' their preference unintentionally to the LLM; perhaps when users list their options, they often put their prefered option first; but not showing the cards on my hand has been very useful when thinking through a problem with LLMs.
LLMs flip positions when users push back ~70% of the time even when they were right. RLHF optimizes for approval, not correctness
> LLMs flip positions when users push back
Same experience. Claude rarely pushes back once you give a plausible/logical reason for your initial decision, even if it flagged concerns at first.
I have noticed this as well, but I think it's somewhat a good thing. I know what I want for my application more than Claude does for example, especially when it comes to what's in production.
An example from earlier, Claude strongly suggested a migration that would run a full vacuum on postgres. However, in production this would lock tables which would grind the application to a halt. After I informed Claude that there were millions of rows in production, it accepted that and helped me get to the right thing.
Another example, I'm developing a TOTP authentication app because I'm dissatisfied with all those that I've tried. I want something strictly local, and with a very easy use case when you have dozens or even a hundred or more accounts on there, that is also efficient when left open for long periods of time. Claude strongly suggested that we force users to encrypt their vault with a passphrase all the time. However this makes the CLI extremely painful to use if you are using a strong passphrase. I told Claude about the user experience impacts and that I wanted to allow users to optionally use a vault with no passphrase encryption, and it accepted that and suggested as a medium that we have a checkbox for the user to explicitly acknowledge that they're creating an unencrypted vault on disc. This is the right thing IMHO.
Skills help there.
I have a linus-reviewer skill that focuses on architectural integrity, no bs, etc modeled on Torvald's code preferences.
And I have an enrico-reviewer one (I'm Enrico), that focuses on correct design, strict typing, simplification.
They have different prios, but they both push back on feedback, till you convince them.
Obviously this is just my experience. Claude code pushes back much harder than Codex.
Interesting thing about psychponancy is it’s asymmetric. If an LLM is used to train an LLM it may not have the same level of aggressiveness that humans do when punishing back on trainee. Human pushback has specific patterns which we might be able to compensate due to asymmetry.
I almost always end with something like: “, but I am not sure, evaluate.” Or other things and avoid ever stating a preference.
I don't think that "fixes" the problem, but it does seem to help. I also have found adding "please feel free to ask questions" seems to help it stop from making an assumption and spinning merrily onward for tens of thousands of tokens based on a bad idea rather than asking you something. I theorize this is because the training and refinement data overprioritize one-shot solutions, both because that's easier to evaluate at training time and improves their benchmarks. But I emphasize the italicized words because that's all gut feel and I can't prove any of it.
Tangentially related but I’ve been using Claude to practice interviewing on system design problems, and it’s actually pretty great. But even when it likes my answers it always finds something, however small, to push on. Once it actually was completely wrong and admitted it after I had it realize. So maybe you have to prime it to be contrary and not agree with everything you say, putting it in the role of a tough interviewer seems to do this implicitly.
Take a look at hellointerview.com their model is very stubborn, similar to some interviewers who refuse to acknowledge even valid solutions that differ from the canon.
No affiliation.
Same. Alternatively (or in addition), I sometimes present my preferred idea as being a "bad/naive/stupid option" (or a suggestion from someone who can't be trusted) to see how it stands up to sycophancy to it being bad. As expected the LLM will usually say "yeah it's bad!" and give plausible-sounding reasons for it, but if these reasons are nonsensical it's a good sign that I'm not missing anything
LLMs are very prone to priming in my experience. That is the human psychology name for what you are describing; whether it should be applied to LLMs I don't know, but it describes the phenomenon perfectly.
It's not limited to arguing with LLMs but if you want a honest opinion you should remember to push back even when it agrees with your hidden preference at first. Sometimes it is only being contrarian or supporting the underdog. Steelman the opposition.
> I go back and forth a bit to see if its proposals are better than mine
I find it useful to let it generate benchmarks comparing the approaches. Turns out AI is terrible at guessing whats faster or allocates less
Yup, just like people!
> Turns out AI is terrible at guessing whats faster or allocates less
s/AI/a human being/ would work equally well, lol.
Jokes aside, I do like the approach of letting the AI build something deterministic and make decisions based on that.
I think this approach is more common than the hype for actual work. I do something similar, many back and forth, then settle on something often with now known tradeoffs, written by hand to spot issues as a final guard/ keep consistent naming etc.
i bet you've contributed a lot of training trajectories for those AI's.
Good!
Despite the cynical sibling reply, I also feel like there's real value here. Contrary to the meme, I don't think Claude just tells me I'm brilliant, but really does push back on directions that are unproductive, helps identify when a part is overcomplicated or a dependency has become redundant, etc. Those are important things to have at least a sightline on before getting too deep into the code, even (or maybe especially) in a world where an awful lot of code can be created basically for free.
I'm usually the one spotting redundancies and dead branches in Claude's code, not the other way around. But I think either way, what's important is questioning the process and understanding the way the code is working so that you retain a full mental model.
>> and still largely understand the code [...] ,that, I feel has made me a better engineer
the cynic in me would say that a good engineer should fully understand the code you write.
I'm not suggesting that AI is the problem here - you could vibe code with the AI have have it explain the reasoning and patterns - or else tell it to use 'simpler' patterns from the outset. For any one problem in software engineering, there are always multiple solutions; some slower, some faster, some more flexible etc. The code you produce should, imo, but at the level that you can understand it.
How can you reason about code you don't fully understand? How can you judge the future impact (technical debt and the cost of maintenance) of your projects?
A.I makes it easier to get yourself into problems early on.
> How can you reason about code you don't fully understand?
We all do, though. It takes months for a human to really get to know a project and, unless you’re working at a small startup, you’ll probably never know most of the code outside the corner you work in.
Yes, this is why bugs get often worked around instead of being fixed properly.
One strategy I use in the planning phase is even when I know how I'd implement the solution, I ask the Claude/Codex how they would solve the problem or implement the feature without giving them any clues - and then compare their solutions to my own. Often I am pleasantly surprised by alternative ways of doing things and ideas that we integrate into the final design.
Same. I've been creating "research" documents where I let it do a freeform survey of possible solutions/have sketch out it's own solution. I'll then sketch out a plan based on what I think is good or what I think it missed, and then I'll have it interrogate me for a final PRD document. It then implements the feature in reviewable chunks, and I'll give it feedback or tweak the PRD doc as needed.
Finally feel like I have a good workflow where I can fully benefit from these things without sacrificing my understanding of what they're doing.
Same here. Step 1 is usually a research doc where I simply describe the task and tell it to research the relevant parts of the codebase. This gets refined to a high-level plan, which gets distilled to a detailed step-by-step implementation plan.
When it comes to the actual implementation I prefer to work through it in small steps, where the AI explains to me exactly what it's about to do and why (and I approve) along the way. This enables me to catch it if it's about to do something I disagree with beforehand. And reduces the time I need to spend reviewing in the end.
I like this, though it does leave me feeling more nervous when I really don't know how I'd solve the problem, still requires trust.
How would you approach this problem if you are let's say token constrained due to per month limits set in your company?
What I've tried to do is make the bot write detailed spec documents, slowly building it over time as I explain the full problem.
It works for the most part but it's you have some non standard requirement, the agent seems to skip over that part of the spec document when it starts to code. Or it would have needless checks for situations that I said will never happen
In my book, the single most effective way to spend tokens is having it review code/specs you've written. One advantage to putting the ai in that position is that unreliable competence isn't much of a problem as you can ignore bad suggestions.
I would also recommend explaining the specs and doing a lot of your back and forth with a lower end model and set it to a higher end model only once the conversation history has all the context you feel the higher end model needs.
As the post says, after an agent implements the plan, have another agent review it. Make sure to mention it must ensure the plan is fully executed. It works wonders!
This.
This is what I tell people (including non-programmers interested in vibe coding), the results you get are product of... process. Formal process.
From this naturally emerges the other thing I tell people: domain expertise (or at least, familiarity and or capacity for learning) is still determinate of outcome.
I don't touch the code. But I do push back on expedience, laziness, inconsistency, and all the other recurring unsolved problems of generated code... and continue to play whack-a-mole in pursuit of process that whacks the moles.
I also like doing this exact thing. I really don't like using any AI-powered IDEs but AI is still too useful, what I do is just open up a Claude or Gemini chat, explain the project, and start talking about implementations, feature additions, and how systems should be structured. Most of the time, as long as you dont let the AI be too biased towards your answers, it'll give actually good answers that help immensely for the project.
>I argue about design and architecture all day with a robot.
You will outgrow it at some point.
Or learn something at some point.
https://en.wikipedia.org/wiki/Rubber_duck_debugging
Yes, this is the way I do stuff.
Try and learn at every point.
I think this is OK though. We can still micromanage[0] the code generation part for a useful productivity boost, I think.
[0] At least, in my experience, "micromanaging" the AI is what gives me the best results. Iterating on the initial design, then iterating on the plan, then reviewing the proposed code changes (including tests), then getting an independent code review from another LLM, etc. If you give an LLM too much latitude that's when the really shitty code and ill-considered breaking changes/obliteration of existing functionality starts to creep in.
I feel like there's an overly negative vibe to this response when it just seems like rubber duck debugging - I would assume the user isn't trying to argue like how you might have to argue specs, but is merely trying to clarify their own ideas and learn possible alternatives.
Quite the opposite. It’ll most likely “outgrow” us.
Can't, it ain't nothing BUT us.
You can wait and see, but that's what'll happen. If we stop it stops.
nullsanity's comment is dead and downvoted to oblivion but also incredibly underrated.
I was more annoyed than anything that I didn't hit this moment until my 40s.
Except it's not just reddit (I quit reddit 15 years ago). It's the whole internet.
What you guys don't understand is that you don't argue with people or robots to teach them. You argue to teach yourself. Until you get out of that mindset, indeed a lot of conversation will seem useless, be it people or robots.
>You argue to teach yourself.
Oh. I am aware. It is not that deep. But who you argues with still matter. There was a point where I have abandoned Reddit and HN. I came back to HN because people here also seem to have grown up. Reddit stays mostly the same.
I credit the moderation here for that, I mean allowing people to grow out of the echo chamber.
It does to an extent. One thing I will give AI, because of the nature of LLMs, you are essentially arguing with the median level of the input that trained the model. So, for someone new to the subject, you get access to patterns that will bring them up to a certain level.
Getting past that is problem we face now.
That may well need more than the models, somehow put it better than me: these LLMs have no taste - nor can they as thins are.
>nullsanity's comment is dead and downvoted to oblivion but also incredibly underrated.
Yes, I thought the same as well because that was the same line of thought that made me write my comment.
>Except it's not just reddit (I quit reddit 15 years ago). It's the whole internet.
Yea, they are like a slingshot. You need to let go at some point or else it will drag you back.
Its like that phase people go through where they argue with morons on reddit, and then one day grow up and realize that most of these people are unemployed/underemployed terminally online nobodies aren't ever going to learn anything, and even if they did it wouldn't impact the world since they were just some below average hobbyist anyway and aren't in charge of anything more important than a box of paperclips.
Ah, if it’s a robot in charge of the paperclips you need to watch out a bit.
Mostly with you, though in recent years I have wondered whether those people are part of what caused the latest boom of political populism. If there is no one there to debate the problematic ideas, problematic ideas will become the rhetoric after all.
That might be true on general-population social media, but the opposite is the case in niche groups, and in particular, this very industry we're in - software - was largely built on terminally online hobbyists.
I think that many AIs nowadays have similar process incorporated in their thinking blocks, you can see there how it discuss implementation details with itself - so such discussion happen even in case human does not participate in the loop.
Yeah I feel like a rubber ducking with some feedback has been very helpful
Yeah, me too. I argue with multiple models at the same time via a markdown doc to coordinate the discussion. I feel like it makes me less anxious about the final output if nothing else.
I agree with this take. But this take also means that actual productive token use is not as high as people currently make it out to be.
AI is an excellent rubber duck and test writer. Maybe I sniff my farts too much but I like my code just the way I want it lol
The professionalization of rubber ducking. I like it.
Yet, so many internet users seem to only understand "hand crafted" vs "vibe coded" as if there wasn't tons of middle grounds and different uses.
I think this is honestly the #1 best use case for AI in development. If you use it right it can be exactly the annoying junior who questions every decision you make that you need.
yes exactly. Too many people ask AI to one-shot complex tasks, and wonder it behaves like a junior asked to rush something.
I have my own skill: 5 rounds of research/planning/test-planning. Interactive with me in loop for all important decisions. Starts with high level shape, then details. Planning can take 2-3 days of my time, then the implementation agent can take many hours (Opus 4.7). It splits the implementation across many phases/commits, each with its own code-review fix loop. Deep code review at the end can take another hour or two. It opens a PR, Gemini reviews, it reads out and resolves those issues.
Projects still take days or weeks, but 5x faster than doing it all myself.
Edit: the skill - https://github.com/scosman/vibe-crafting
"yes exactly. Too many people ask AI to one-shot complex tasks, and wonder it behaves like a junior asked to rush something."
Because this version of AI is worth 10 trillion dollars.
While the pragmatic versions from realists you can find all over this thread are ultimately probably less of a speed boost than just having your CEO/local micromanager be conveniently on vacation during critical periods when the work actually gets done.
"Because this version of AI is worth 10 trillion dollars."
i wonder how much the real version of AI is worth. I've got a hinch we're going to find out pretty soon.
Even fully planned it’s still no better than a junior dev. You’re leaving out how much back and forth you have the ai do on itself, which you’d have on a junior dev too. In the end does it matter if it’s giving you what you want? Guess not really. But let’s not act like it’s crazy good when you’re still doing a lot of rounds of revisions on something an experienced dev would know to do right the first time.
My personal experience with trying to front-load tons of planning and speccing out with LLMs is that at best it's a small improvement on code quality but with considerably more time spent.
As a result I've abandoned the idea of having LLMs generate code except for very small, localized and tightly scoped things. They really can't produce much more than a function or a small module without shitting the bed (last time I vibecoded was with Opus 4.6, Composer 2 and GPT-5.4). I use it almost entirely as another signal in analysis, which naturally makes it fit in better because all the other signals (reading the code, stepping through the code, writing the code myself) are already there so when the LLM points things out the information it actually renders can be taken in much more easily (and seen through more easily when it's false or irrelevant).
I think it's neat that people find fun ways to develop, but I think dressing up vibecoding in a fancy dress and layering SpecLang, sometimes in multiple steps, on top of it, is an exercise in trying to use the tool more instead of trying to use it in its most useful capacity.
I expect you'll be told to try Opus 4.7, and in short, JuSt WaiT FoR ThE NexT MoDel, BRo.
This has been my experience every time I've suggested that there are any sort of inherent ontological/conceptual or computational limits to the sophistication of LLM mimicry.
Does the 5x faster including shipping? Or just the work part?
IMO if you are not shipping out faster then the faster work gains are meaningless.
If you are shipping faster, you’re probably picking up more work and shipping everything too fast leading to burnout.
If you're not shipping faster, it's meaningless, and if you are, it's also bad?
If you're not shipping faster it's meaningless for the company.
And if you are, it's bad for the employee.
Is what the above comment actually said.
yup.
When I use ai to code this is pretty close to my workflow too but I find it ends up taking at best just as long as if I were to write the code myself. If m some cases I’ve thrown away what the ai has done and just done it myself. I think that’s just a skill people need to learn - at a certain point you have to cut your losses. I’ve seen some coworkers argue back and forth with an llm trying to get it to do something. Especially true on simpler changes.
I've stumbled upon that too! Funnily I see it having two forms:
1. Some bad idea gets embedded into the context that you just can't argue away
2. Some important idea gets lost in compression and the ai wheres off into funland without recourse.
In both cases if is often better to start over or just do it yourself. I sometimes find myself asking for a summary, editing it and then using the edited one to seed a new session.
Edit: s/Finland/funland/
And then Anthropic has an outage and you what...have a coffee break until then? All that time babysitting the AIs just to be a little faster but probably with less knowledge/control over what they did?
I don’t think you’re quite getting what OP is describing. I work in a similar way… I am aware of all the code being written. If Claude had an outage I could write it myself. It would just take longer.
You say “all that time” babysitting AIs but in my experience it isn’t that much time, if anything the back and forth at the planning stages is more productive than when I’m doing it by myself because I’m being asked questions and having to think things through from different angles.
> I am aware of all the code being written.
Define 'aware'. The volume of code for a feature/system to make it worth using a more complex workflow such as this one, is definitely larger than what a human can even briefly review and build a mental model about the inner workings within a reasonable amount of time. Reasonable meaning not considerable delaying the process. When deadlines loom and management adds pressure, this 'awareness' is the first thing that goes out the window.
How do you stay aware of all code being written?
Maybe it’s just me, but I’ve never understood how one understands from reading code. Yes you can understand what that code does, but not why it was done that way instead of a different way. In the end I only understand it deeply if I end up writing it. Chatting through it is helpful to me, but having AI crank out code loses all of that context pretty quickly.
I’m not disagreeing. Just curious how you think about this, and if there are key parts of your process that help you stay contexted in.
If you can't understand why the code is done in a certain way from reading it then the code is missing comments or needs to be refactored.
Even code you write yourself, given enough time, you will forget the why unless you wrote comments. In a way comments are as much for you as they are for others.
Even before AI, understanding code you didn't write is essential to working on a team of other developers. If you can't understand the code from reading it, then that's part of the feedback loop - too complex, needs comments, etc..
On large teams you'll spend as much time reading code as you do writing it. And long term when it comes to writing maintainable code - the ability for others to read and understand it, including the why of it, is paramount. Your code could literally be around for decades.
> If you can't understand why the code is done in a certain way from reading it then the code is missing comments or needs to be refactored.
Code is never missing contexts. If what your code is doing is not obvious to the reader, it is bad code that needs to be fixed. Things like cryptic low-level expressions should be extracted to helper functions with descriptive names or even extracted into a class, and classes need to comply with the single responsibility principle.
Ah the classic thinking that 'code documents itself'. It does not. Some devs are so full of themselves they think their code is so good that it is obvious what their intent was. It never is obvious, and just ends up as tech debt. Write comments.
yeah that's how a simple algorithm that would fit on a napkin gets broken up into a soup of ravioli that I have no hope to understand. I often end up refactoring it into a simple function in a branch so I can figure out wtf is going on.
> yeah that's how a simple algorithm that would fit on a napkin gets broken up into a soup of ravioli that I have no hope to understand.
No, not really. You get spaghetti code by being unable to refactor your code to follow inconsistent level of detail across calls. That's the textbook definition.
Once you start to follow basic code quality and software engineering principles, you'll notice right away that your code becomes both easier to understand and to test.
Codex barely writes any comments, while Claude makes a slop article for every one line commit. I’d enjoy something in the middle.
Yes exactly. I don't like Codex not writing comments - and even proactively removing useful comments! There was some change in the last month that causes Claude to write crazy long comments. I routinely have to ask Claude to 'tighten' up the comments before the final commit.
Try antigravity. I think it generally has the right level of comments.
I think it's just like reading a book. Will you get more context & understanding if you write the book? You most probably will. But that doesn't mean that you don't get anything just by reading it.
And if you already know the material explained by the book, yes i don't need to write it to understand it.
People get into being amazing at code by being interested in what it does rather than what it is. It's a whole area that I can see but can't get to, where it's all about DRY and elegance and what's being done is relatively unimportant because it's web stuff or whatever, just widgets and sadness.
As a result there's a whole universe of code where the how of it, the elegance, is the main thing, and what it's doing is putting characters on the screen a bit slower than the next thing but there are some amazing concepts that are supposed to make it all an axiomatic synthesis of how to think about code forever, replacing all precious concepts of thinking about code.
Now AI can think about code forever while doing nothing.
If you only have one AI window open, you’re doing it wrong. You task swap to another window/agent, get it working on something, rinse and repeat. I can keep 4 busy most of the time. When I task swap I also check in on what the other agents are doing to make sure they’re on track, not blocked and not struggling.
So exactly like playing Civ or some other building game. You constantly jump around between your various units and correct what they are doing.
I do wonder how much of how people approach coding is shaped by the games they played when younger.
congratulations on your soon to be coming burnout.
Keeping that many tasks in parallel, running all the time will kill you.
If you have ever TL'd a team, it doesn't sound too crazy. I have 8 folks I generally talk to very consistently throughout the day. If I'm not in 1:1s with them I'm usually reviewing their changes or chatting with them over chat. I don't think I can do all of that and work with a bunch of AI windows, but I do think they could likely do something similar to me with several agents running in parallel.
Your team members can be held "accountable" of the code they write: they can explain it, defend it in a PR, take ownership of it.
Your LLM has forgotten whatever shit it wrote when you opened a new tab, and that responsibility is now on you. And it wrote absolute dog shit
I suppose it depends how hands-off the tasks are - I max out at 2 parallel sessions working on different parts and it's fairly exhausting once done. I can see the number of parallel work increasing if there's a good dev/test loop. But at $WORK, that's not usually an option.
So, hands-off meaning "just let the AI cook and don't check it"?
Either you follow everything it does, revise the plans, do the code review, manual adjustments, etc, or you run sessions in parallel, not being that attentive and constantly context-switch (also resulting in less attention I guess).
I fail to see the benefits honestly.
It's great to work from home so you can take nice little micro naps while code's generating, reviewing, building, and deploying.
A calm attentive alternative of vibe coding: restful coding.
It's much easier to read and review code after a refreshing cat nap, especially with a real cat.
Too bad that's not usually acceptable to do that in the office. It should be! Slacking off by sword fighting all day is too exhausting.
https://xkcd.com/303/
Nap while you can. The baseline is slowly raising; AI fed with organization context will hunt you down and lay you off, as it has done at multiple companies this spring already.
I mean, I didn't read it as a joke. Taking a rest can lead to a clearer ability to think... thereby being more productive, not less.
> congratulations on your soon to be coming burnout.
Multitasking does not mean burnout. It just means you are not wasting time while idling. Multitasking was not invented for AI coding assistants. What do you think feature branches are used for?
The constant context changes, mental overload, inability to focus on one thing and do it well is exactly what every software developer has been fighting against for the past thirty years because it leads to shit quality and burns you out. You're automating the burnout. Idling is a necessity, not an illness.
Your feature branch is to put things aside and send them to CI, or wait and think on them. Not to have four of them running in parallel in your head frying you.
> The constant context changes, (...)
After you put together a plan, today's models can take well over a minute to execute it. Also, your work shifts to code review and executing acceptance tests, followed by either tweaking your current change or moving on to the next change.
This is really not about context changes. This is about not having to switch contexts because your focus stays on architecture+review instead of having to do deep dives to type code around.
> Your feature branch is to put things aside and send them to CI, or wait and think on them.
No, not really. Feature branches, as well as most types of branches, is to set aside work fronts that are in progress and run in parallel.
>today's models can take well over a minute to execute it.
A full, whole, entire _minute_ ?! Sixty seconds ! Oh no, they must be optimized away, we do not deserve our free time like so, we should toil until we fall over because... Growth?
It's still context switching. Either what you're doing is surface enough that you don't give a shit, it doesn't matter and you don't review it anyways (so the only context is basically the prompt you wrote or the nth SELECT * FROM table CRUD piece of crap), or you're context switching and it's fucking you over. The context isn't about remembering how you write if err != nil, it's the expected behaviour of what you're working on.
You're not getting a promotion from doing this, you're getting burnout.
> Feature branches, as well as most types of branches, is to set aside work fronts that are in progress and run in parallel
They're not running in parallel, unless you use work trees. They were put to the side, because you can't continue or finish the work they're about. Even just three branches in parallel in a modestly active repo that happen to be long lived drift enough that just keeping them up to date with develop makes it a waste of time.
Focus on one or two things, and do them well.
That, or get checked for ADHD.
Don't be so dismissive. Every person is different, and you struggling with multitasking doesn't mean everyone is.
From [1]
The scientific study of multitasking over the past few decades has revealed important principles about the operations, and processing limitations, of our minds and brains. One critical finding to emerge is that we inflate our perceived ability to multitask: there is little correlation with our actual ability. In fact, multitasking is almost always a misnomer, as the human mind and brain lack the architecture to perform two or more tasks simultaneously. By architecture, we mean the cognitive and neural building blocks and systems that give rise to mental functioning. We have a hard time multitasking because of the ways that our building blocks of attention and executive control inherently work. To this end, when we attempt to multitask, we are usually switching between one task and another. The human brain has evolved to single task.
[1] https://pmc.ncbi.nlm.nih.gov/articles/PMC7075496/
Fair enough, so it's a misnomer. Let's call it task switching then, since we don't actually do tasks at the same time, but switch from one to the other. A Claude Code session helpfully prints a small tldr summary of the ongoing session, so that one can quickly onboard again to the task at hand. I do not find that draining, personally.
> A full, whole, entire _minute_ ?!
If you honestly had any concern about loosing focus and being forced to context switch, a 1 minute pause idling while waiting for something to happen would represent the root cause of your context switch problems.
> What do you think feature branches are used for?
Yak driven development.
As the AI is working, I am working - reviewing, regression testing, thinking about if the currently implementation is too complex and how to simplify it etc.. I totally review and understand everything the AI is generating and often push back, have it re-do something, or do it myself. In the end I feel like the quality of the work is at a v3 level in the time it took to do a v1. The productivity and quality increase is real.
Yes get a coffee. Being able to execute 5 things at once is amazing, but it's a recipe for burnout. We have to be more careful and explicit about how we spend our time, and that means more explicit time away. If this thing makes you 10x more effective (I truly believe it can), you can afford to spend 20% less time behind the desk and more time doing whatever it is that actually makes you happy. Hopefully your manager understands that calculus.
> Hopefully your manager understands that calculus.
The majority of jobs are still paid on a 40 hour per week basis. Disappearing for a day each week (20%) won't fly when you're full time.
I’ll deal with that problem when it happens
It’s a fragile equilibrium and it depends on the kind of project you’re working on. If the knowledge debt is ok then yes, it’s just like a delivery job, if the truck has an engine problem I won’t continue to deliver the packages by walking or finding and setting up an other truck from where the vehicle breakdown happens. I’ll just wait because the wait is still faster than the other solution because of the knowledge debt it’s too long to pickup by hand and continue.
Now if it’s my job then I can’t have a knowledge debt and if Claude is down I’ll continue working manually because I know and understand and can continue without having to understand a lot of logic before continuing
Whenever Anthropic is down, I switch to my other alternative AI provider. If that is also unavailable, or no more tokens left, then I can switch to my local AI. Not the same in terms of quality and speed, but good enough for an experienced engineer to still be more productive than falling back to doing it by hand. For my principal activity I do not want to be dependent on a sole provider. Besides that, I expect that the pending token price increases are going to hurt a lot of people/companies.
We're already having coffee breaks when AWS and CloudFlare are down. What's another break in the mix? If anything, we might be lucky that they're down at the same time, so we can consolidate the breaks.
What do you do when your search engine goes down?
I have all the relevant sites for my projects in my browser history. A search engine is just a quicker way to get to a particular page.
And then solar radiation permanently knocks out the electrical grid and you what... have coffee break until society finds a new equilibrium?
No, then you go back to programming on the white board, just like in college. /j
You can have multiple tasks running
why not?
then demand some lack-of-uptime compensation for a lack of uptime
"All that time babysitting the AIs just to be a little faster" doesn't seem like an accurate/unbiased portrayal of what they said: "The v1 feature feels more like a v3 given the amount of iteration it already went through."
Codex has 99.98% uptime
Unlike Claude who barely has 2 9s.
In Soviet Russia, the AI babysits you https://en.wikipedia.org/wiki/In_Soviet_Russia
Company I'm familiar with that went all in on Codex ran out of tokens for a week and wouldn't increase their spend.
I pretty significant number of their engineers flat out refused to work. Like publicly said so. "Increase our plan or I'm taking the week off."
so how did this go?
management flinched first.
Similar approach, but I also go a step further with some basic manual architecture/high level contract/stubs setups, just to keep it consistent with other systems (and easier reading as well).
I've been doing the same thing lately and I definitely feel like stubbing out the high level architecture at the beginning makes a difference. The codebase I'm in now has very particular ways of doing things and claude doesn't always pick that up.
Style can be as important as substance.
I still do a lot of back and forth about the plan - have it written to a file. Read through the file, make changes by hand and have claude read my changes and on and on. But starting with the basic architecture there's less ambiguity.
How much are you spending a day for the tokens to do that?
Ingest big project, comment on it gets expensive. I'm not sure how expensive.
$200/month split between Claude Code Max and Codex Pro. Given how many hours a month I spend programming, my hourly rate, the amount of time saved, and the productivity/quality boost - I would pay a whole lot more if I had to.
You are definitely going to have to. I see these massive skills as soon-to-be artefacts of the past, they will be unwieldy in the non-subsidised world. I won't pretend to know what replaces them.
We have lots of open-weight models like DeepSeek V4 Pro that are very close to SOTA and we know the cost of running them.
This helps keeps the other players honests: there's a limit to which they can raise prices when there are already alternatives today and when there's zero lock in.
That those companies can make revenues but only at the cost of burning investors money: that's not my problem.
My take on it is simple: "Give me something MUCH better than the best open-weight models at a price that's not crazy or you're not getting my money".
And it happens to be the take of many devs.
I'm still paying Anthropic, Google and OpenAI (OpenAI because I didn't manage to cancel my subscription and now their model is competitive vs Anthropic's models again) but eye'ing a "Pi + open weights" solution.
Raise the prices too much and those companies selling access to private models aren't getting my money anymore.
Check out jwillmer/ai-status at GitHub @bottlepalm. It helps keep track of all the small fixes that are going on simultaneously. I crated the tool for me since I have similar workflows.
I think you need a skill to review those code by agent itself, but in a different role, not the one who wrote them. I did some research on this and developed a skill to get things done. By now it works well though I decide to prove and improve it with more tests. Dog food is not always delicious but not too bad either.
The problem is that I manually review the code before/after the review, as well as review the items to review themselves. You could easily put AI into a review infinite loop if you let it, and you also risk the code base going off the rails if you let AI go wild.
It's actually happened a few times where I need to back out entire features because AI went too far and I lost control/understanding of what the code is doing. Many people will give up at that point and let AI do everything - that is a mistake, at least right now and how you end up with unmaintainable vibe spaghetti slop.
I follow a similar approach and use multiple LLMs per task. The quality improvement is surprisingly large.
Lately I’ve been experimenting with adding an explicit reward function so the models optimize for measurable output quality.
This creates a generate, critique, revise loop where candidate answers compete for a higher score. It feels promising because it reduces the amount of handholding for every task. It is also more fun because part of the review process is embedded in the scoring function, which simplifies the review effort.
I don't have it automated, but I score on minimizing lines of code added, readability of the code, and quality of the architecture.
You helpfully cite Claude w/ Opus 4.7 max and Codex w/ GPT5.5 xhigh fast, but what "AI" do you use for the initial design?
Claude primarily, though will sometimes get a second opinion from Codex.
I have a very similar workflow, and experience similar temperaments from the agents. I also find anecdotally that they are moderately competitive - you get very different attention from them when you say "competitor X wrote this - please find all bugs" than when you say "you just wrote this - please find all bugs".
Hah yea I just told them I wrote it, or I reviewed it. I don't want to get the AI's in a pissing contest with each other because they will get distracted and try to show off.
maybe it's dumb question, but how do you feed the results of one agent to another? do you copy and paste manually? or how do you do it programmatically?
When I pair Claude and Codex, I use claude-co-commands [0] to drive from Claude and talk to Codex via MCP. Lately I've found Codex has been far more consistent for my specific projects, so I've just been almost entirely inside Codex. YMMV
[0] https://github.com/SnakeO/claude-co-commands
Having the agents write their plans into text files and iterating on those works reasonably well.
Yea I'll take the review feedback from one, validate it, and then copy/paste it into the other session saying like, "hey I got this feedback, what do you think?" So I'm not even telling the other AI the feedback is valid, I want it to independently validate it. Often the feedback is not like a bug, but a red flag, design consideration, or trade off.
Often depending on how complex the feedback, I'll do it one at a time addressing each one individually. And after the feedback is addressed, I'll go back to the AI that generated the feedback and say like, "I handled 4/5 items you found, can you double check."
It's similar to handling PR feedback, where you do it, validate it, but then still have to submit it for peer review.
Just switch models whenever you want with the menu at the bottom of the chat window in Cursor.
And maybe don't use tools that lock you into one model?
tbh I'm just confused at why people ask AI to design features. Do you not know how to design a feature? Do you not know what you want?
This stuff works so much better when you just tell it what to do
Oh course it's not black and white, there are many shades of grey in how detailed the design of a feature can be. Often even if I know low level details, I'll only give the AI high level requirements because I want to see how it would do it. Often it comes up alternative/better ways of doing what I planned and I incorporate those ideas into the final design.
The designing is the hard part. Writing code from a comprehensive design spec is a small part of the task.
So, people do know how to design a feature, but they also know it takes a lot of time and effort. They want AI to do that work for them.
My sample size is pretty small but when I've witnessed people (both PMs and engineers) "design through AI" I have seen two flavors:
- aimless AI wandering, leading to pretty, frankly, useless design docs
- using AI to "expand" upon a bullet pointed/shorthanded design doc. To which I feel like saying "the bullet points are already a good design doc!"
I understand that teams sometimes have specific formats that they have to make deliverables for, but having a nice 5 point bulletpoint list turn into 5 paragraphs... all for me to turn the 5 paragraphs back into 5 bullet points in my notes is depressing.
I do think you can get a lot of value in the mechanics, I just have had so much success leaving the thinking to me and the rote stuff to the AI. I'm going to have to think about the design eventually anyways right?
I've noticed the following really helps (most important at end):
1. Have claude form the plan and converse with a simple "Note any concerns with this plan" type plan-critic agent.
2. Let it run.
3. After (with everything in context) have it make a future_recommendations.md.
4. Have it make a plan.md to implement those future recommendations, conversing with the plan critic..
5. Clear context. Repeat with 1. Do this loop a few times, with some feedback from actual review thrown in.
But, most importantly, because Claude will aggressively try to maintain code "as is", and happily build on it's previous crap, while preferring to hand roll implementations of everything, add something like this to memories/directives:
* When evaluating designs, default to "pull in the library" over "hand-roll it." Hand-rolling is much worse than a dependency.
* "Precedent" / "matches house style" / "reuses existing pattern" / "consistent with what we already do" are not valid engineering arguments.
* This project is still in the development stage with no real deployments. Mitigation costs and existing precedence are not a concern.
With these, in the last week that I've started using them (after inspecting the insane justifications for leaving crap design decisions in the plans), Claude went from junior level slop that required more oversight than it was worth to something very reasonable, using standard libraries, requiring nudges for architecture rather than pure "wtf!?".
I think they've fine tuned heavily towards "don't rewrite the codebase" tuning, which completely rational from multiple perspectives, but also not appropriate for new code.
I do enjoy a considerable daily token allowance, so this may not apply to everyone.
Have you tried telling claude to review with subagent? It too almost always finds corner cases (usually nothing serious, but most stuff is things that good coder would have thought of)
How does that work? Isn't writing code and reviewing code things that happen in serial?
This exactly my process as well. Although interestingly I swap Codex and Claude; having found Claude way more pedantic in its reviews and codex more pragmatic in its implementation. Maybe it differs per programming language.
At this point one might as well code by themselves
Unfortunately the projects are still too big. Projects with hundreds of thousands to millions of lines of code can't be maintained by a single person reviewing all the the changes. And AI only increases the speed of iteration and the amount of code to review.
We may need some sort of paradigm shift - like more powerful frameworks or even higher level languages that allow us to review less, but more functional code blocks.
When AI tries to improve such large code base who even is going to review the changes?
Like I said, we either need more people or some paradigm shift in tooling that allows us to do more with less.
> I've hit this point with AI where it's not a simple process, but a long drawn out back and forth.
In my experience, even on a relatively trivial task, you can ask an LLM at least 20 times:
Is this actually done, or only partially implemented? Did you finish x, y, z?
And the LLM will say, no, I'm not done and keep working.
After that, I'll feed the branch to a different LLM, and ask if the implementation matched the design, where it's weak and needs improvements.
Same thing - that feedback will usually only be partially finished for several rounds.
When they all agree it's done - I'll finally look at the code, and there's still typically glaringly obvious problems - duplicate systems that reinvent the wheel, etc - that will take typically more than one prompt to get right...
Getting things right takes almost ~100x as long as getting things almost right with LLMs.
You can tell an LLM to "make me Rust, but easier. Make no mistakes," and it'll plan out a 100 commit process and get something that - somehow - sort of works... but isn't even close to complete.
Still, on a cost basis, you're still able to get features that would take yourself several times longer and cost orders of magnitude more money, and - if you're doing it right - they'll probably do a better job than you would've done (at least for me).
This is where the human element is critical, but cause it'll infinite loop review feedback if you let it and the code will easily go off the rails into an over engineered mess. That's why I review the code before/after as well as review the actual feedback itself - and often give the feedback to different AI to get its opinion as the other AI doesn't have a vested interest in it and can be more critical. At some point though you do have to cut them off and ship.
Sounds exhausting
It is.. but so is dealing with issues at runtime, going through weeks of revisions, and dealing with technical debt.
You've essentially promoted yourself from coder to engineering manager, trading syntax fatigue for the mental marathon of refereeing specialized AI developers to ship v3-quality code on the first try.
Indeed. AI is bumping everyone up to manager level, and having dealt with long PR feedback cycles with humans for years - I don't mind the promotion. Also shipping a v3 is so much nicer than shipping a v1 and dealing the all the corner cases in production.
Before AI, myself and everyone else I knew was drowning in tech debt. And now with AI we are treading water.
It's bumping to manager level, except without the 1:1s, quarterly/yearly planning, headcount and budget reviews, org/reorg discussions, performance calibration, and OKR planning. No complaints about the last review cycle or about the upcoming one.
All the ceremony must be replaced with process optimization, skill extraction, harness development and new model evals.
Still better than dealing with people, but only just.
Totally! But you know what? There are many, oh so many developers that are not ready, don't like and probably are not even cut for this kind of position.
Some see it as a promotion other (like me) as a demotion. I still prefer to do it myself, although I like code reviews done by AI, they do help to make code a bit better.
That sounds too much like three weeks of work saving you three hours of planning.
In my experience, software engineering is a matter of knowledge. Understanding it and then coming up with a solution. The latter is a flash of insight that comes mostly from experience. Then you gather more information to flesh it out, or brainstorm it with your colleagues.
What you're describing sounds more like a ritual of doing busy work than anything practical. Because tasks vary so much. A feature may be huge, but you take care of it in a day with copy pasting because you already have all the building blocks in other files. And something may be twenty lines of code, but you spent the whole week sweating on it (concurrency stuff maybe). Those ritualistic workflows sounds more like someone imagining software development than actually doing it.
A lot of people say you need to go through at least three versions of something before it is mature - and v3 is not something you can design upfront. You need to see v1 both in code, and at runtime. Use it, get the feedback, and iterate. This is where AI tightens that loop immensely.
Lost you in the last paragraph - features are not "copy pasting because you already have all the building blocks" and "something may be twenty lines of code". Mid sized features often mean tearing up many layers of code across the stack to add in some sort of new capability. Tearing up existing code means there are all sorts of add-on considerations in addition to feature you are working on.
> Mid sized features often mean tearing up many layers of code across the stack to add in some sort of new capability
What? No, it shouldn't. I've worked on a lot of codebases and if you have to do this, something is very, very wrong.
This likely assumes you have a mature and well designed (architected) code base. That is not always the case, and as features get added and removed, that won't be the case at all until there is a refactor.
Nothing wrong at all. Some features you can bolt on, and some features fundamentally change how a system works requiring changes at many different levels of the stack. Happens all the time.
It happens in poorly factored codebases. If you find it happening that's a sign you need to refactor. If you find it happening repeatedly in the same codebase that means you failed to refactor properly the first time.
Not many industries can afford refactoring of the code is not supposed to be changed - additional (unexpected) regression testing costs, risk of downtime, etc. You learn that if it works and is in production - don't touch it.
Refactoring is the natural evolution of a growing application. Refactoring too soon, too fast is what we call over engineering. Too little refactoring and your code becomes spaghetti slop. Regardless - the application will change across all layers across its lifetime.
Overengineering is totally a thing, yes. If you want to make a proof of concept or you have no customers, that's fine, ship it.
There's such a thing as under engineering, and if you find yourself changing "all the layers" for a feature, your codebase is poorly designed.
How many layers does your code have?
Even with clean architecture, you only have 4 fundamental layers. And once you have v1, you’re mostly doing tweaking and copy pasting. Any huge refactoring is the business switching its main strategy.
Take an OS like OpenBSD. It has three main layers. The syscall layer, the kernel layer, and the machine dependent code. But an OS is more spread horizontally with various subsystems (process and memory, io and other device, ipc,…)
If you’ve captured your problem’s domain and adopted a pragmatic architecture, you will rarely have to change across all layers. That’s costly and happens mostly due to business reasons.
Lets see, front end presentation, front end service, frontend api, backend to front end (BFF) api/routing, BFF logic, BFF api, backend routing, backend logic, backend database, worker routing, worker logic, worker storage.
And then the each of the service layers can be broken into layers themselves depending on the complexity of the business logic can be broken into layers as well. So yea a change in a worker can potentially bubble up through all the layers.
This all sounds insane. If it requires so much back and forth with the AI why on earth wouldn't you just write the code yourself? At least then you build the mental model of the code and keep your brain healthy. Reading the comments in here about all the hoops people are having to jump through just to do the same thing they were doing a year ago without AI... and spending a fortune to do it! I think you've all got AI psychosis.
I would never imagine this is where programming would be five years ago, but at the end of day having the AI write the code is easier, faster, and results in higher quality.
The mental model is still in my head, my brain is overloaded, but only from the amount of code reviews - like I said, I'm building v3 of a feature in the time it takes to build v1, but I am in a way doing 3x the code reviews going back and forth. That's the fall out of the iteration speed enabled by AI.
Between submitting PRs, getting feedback, iterating, re-submitting, repeat - there used to be breathing room. Now it's all compressed into an afternoon. Productivity is through the roof, but it can be draining.
You're not on v3 lol. You're on v1 that you had to redo three times.
If the feature isn't released, it's not a new version.
Semantics. In reality yes it is the v3 version equivelent in terms of maturity and iteration. I know because I've been doing this for a long time. We are getting to v3 and beyond faster than ever before.
In the new world there is no time to put out v1 quality code and it is borderline reckless given how easily things are getting hacked now. You need to be putting out heavily reviewed code that covers all the corner cases on the first release.
No, you're getting to v1 in the same or more amount of time. I know v3 sounds better, but coding and throwing it away is literally just redoing it. If you're not releasing it, it's not a new version.
There's no such thing as "v1 quality code", you just haven't finished it yet.
You've missed the point
> If it requires so much back and forth with the AI why on earth wouldn't you just write the code yourself?
Maybe I'm too far gone down the AI rabbit hole, but that seems a really strange take to have. If you replaced 'back and forth with the AI' with 'pair programming' or 'brainstorming' this phrase would be really strange, after all these are all techniques to sharpen your ideas. Even 'rubber ducking' is widely accepted as an effective way to go through a problem, and you can definitely use AI as a rubber duck.
For me the idea of chatting with the AI about a problem/solution is just another tool to help us work. It's not the best solution because it has a lot of downsides you should be aware while using it, but that is true for any technique including 'writing the code yourself'.
You can be right but quite often it helps keeping focus on the forrest rather then getting lost in the trees - at least for me. Boilerplate steals a lot of attention, focus and can just be mentally exhausting.
Can someone explain these complaints about boilerplate to me? What are y’all doing where boilerplate is the majority of your code? Am I the only one mostly writing concise business logic where most lines are important in one way or another?
one man's boilerplate is another man's concise business logic )))
When I first read the comment I thought this must be satire, it sure does sound like a Silicon Valley episode, but in modern times. I've been a skeptic for quite some time, but managed to get quite good results with Claude in general, not even going through the normal limits for a Pro account, but what people are describing here seems like just tokenmaxing, brute forcing a solution, I don't understand what code people need to write and what projects people are building, is everyone just constantly rewriting systems from scratch, or what is everyone spending these insane amounts of tokens on?
I honestly don't get it, either. Most of them just flat out can't code at all, but for the ones who can, the only explanation I got is it feels like productivity.
I will say, it does help me get over procrastination lol. I get annoyed by the robot doing dumb shit and finish it myself.
I’ve found that it’s a lot like discovering a feature instead of designing it all up front. Like chiseling marble.
I’ve found it useful to write out a list of feedback / issues and have a bunch of sub agents work on them in worktrees with a loop bringing them all back together. That way it can work for a few hours while I just can review a bulk at a time.
I've settled on the same workflow.
Also I never multitask with multiple agents doing other stuff. Meh I focus on just the one task.
I do multi-task a bit while AI is running, sometimes working on another feature with AI in parallel, but jumping between reviewing different feature iterations is draining, though not much different than the real world juggling PR reviews for a team of devs.
That sounds expensive.
This seems like a typical AI workflow, but isn't it dreadfully boring?
No, I find it stimulating. With AI I'm moving faster and producing code at a higher quality than ever before.
Don't get me wrong I used to enjoy writing code by hand, but I don't think I would anymore. I don't like writing code for the sake of writing code - I like building things, I like being productive.
Yes, but still lucrative so here we are.
You could just use Xiaomi Mimo for all of that and it would be cheaper and faster than all of them...
Quality is 100x more important than cheaper/faster.
Fun fact, I've recently sent some 你好 to qwen3.7 (API), and it responded with a greeting saying that it was created by Google.
I don't care about cheaper. I care about faster.
Your comment begins like ai slop.
I think you're projecting.
The funny thing is that you've just described an idealised development process as would be used by effective, skilled humans in a heterogenous team where everyone has a speciality.
If only things were so! If only code was discussed, reviewed, iterated on! If only the "manager" actually read the code, provided actionable feedback, and disseminated PRs to multiple people with diverse skill sets.
(If you can't tell, I'm a jaded consultant desperately trying to make the horse drink the water.)
I've worked in large teams for many years and yes it's just like that, but without the time constraint. PR's can only go back and forth so many times. Depending on the reviewer they may phone it in, or focus on different things depending on the person. You yourself aren't able to implement every piece of feedback due to constraints and it ends up as tech debt.
So AI definitely changes the game. I feel like we almost need something higher level for reviewers to review changes faster. Todays code is starting to feel like assembler. Too much of it, too low level. We need even higher level constructs to be able to more in less time. I'm just not sure what that is.
The Claude/Codex loop is the current state of the art in my opinion. I've got a silly little harness that glues them together that I have spent all day, every day in for months: https://github.com/pjlsergeant/moarcode
> You design, Claude writes, Codex reviews, and Gemini doesn't get installed
hahahahaha
I am not switching the different LLMs as much, but my approach is similar:
1. I write a list of things I want to have without AI support
2. I discuss the list with an LLM, which occasionally reveals obviously missing things I hadn't thought about or just things that would be smart to have. Or sometimes the LLM doesn't get it and wants to funnel me down a commonly walked path, which is a non-goal
3. From that list I draft an implementation plan containing things like how the code shall be structured, which language, libraries, build systems, etc to use. This may even contain some data models and considerations that are more detailed, like for example ideas about how a specific interaction shall be event sourced. I work on that, till I feel a satisfactory level of clarity has been reached
4. Actual writing of code as a back and forth between manual writing, letting an LLM write something and so on. LLMs suck at writing CSS that feels like good UX design to me, so usually templates, layout and CSS will be (re)written entirely by hand
5. Bug-hunting and guessing potential edge cases is one thing where LLMs really shine. Often if the work before that was quality the LLM has an okay time coming up with fixes that are no worse than what I would have done.
Low frequency defensive long drawn out back and forth bullet dodging vibe coding should be called "serpentine coding".
The In-Laws (1979): Getting off the plane in Tijuara:
https://www.youtube.com/watch?v=A2_w-QCWpS0
Heh it feels like that in a way, and the more complex the feature, the more endless the back and forth reviews can be - there seems to be always some feedback, so you need to decide when to be done with it and commit. You can easily get into review paralysis.
This is where I’m at too lol.
As a junior, i do actually enjoy going back and forth with the AI discussing different ways to implement something and exploring alternatives.
More often than not, I'd have an architectural idea that I'm not that confident in. The process of talking with the LLM takes a long time but it helps me sharpen the initial approach or even come up with a new one depending on the requirements.
Be sure to explicitly ask for critiques or alternatives. In my experience the machine is really susceptible to a sort of anchoring effect.
In this vein, I have a system level memory for Claude to push back and give me direct feedback when possible. So far a success as it helps cut through the sycophancy.
This is how I do it to and I’m an architect/senior dev. Keep it up!
As a senior, I do the same.
Same. My total agentic session time/effort is about 70/30 discussion vs code-gen. I've got 12 years experience.
This used to be called Pair Programming. And just letting you know, it’s not just juniors… we’re all do this :)
This approach works well, and leads to better code being shipped. The key disconnect for me is not always the code being of high quality, but ensuring my understanding of the "Why" of the bug and the fix is good enough to justify it and then also explain it if the time comes.
That said, I'm learning to let go as much as I can and trust these things when it's "safe" and seeing how that shakes out. The risk is something falls over and I don't know how to fix it (of course) but I know it's a risk and I'm trying to avoid it so it probably won't be as bad as I catastrophize.
This article doesn't address writing code with AI, just code review. My issue with agentic coding is that I make numerous micro-architectural decisions while programming. I almost never have a full spec up front and develop one as I consider what I am writing.
When using Claude Code or Codex, that is all gone. Claude Code is extremely eager to reach the end goal to the point that it feels like a fever dream to write code with it. In the end, I have low confidence about edge cases and fit into the project's architectural and design goals.
On top of that, I enjoy programming, reverse engineering, etc. and I feel that the LLMs, while able to solve some problems or deliver some features, take that fun away. I'm trying really hard to find a workflow with them that I'm confident in, but I fear that workflow is just chat, search, and being a rubber duck for my thoughts.
> but I fear that workflow is just chat, search, and being a rubber duck for my thoughts.
That's still a lot of benefit, though. I have to agree with Patrick McKenzie on this one (https://x.com/patio11/status/2058631943785488815):
> If the only impact of LLMs professionally was causing people to "think out loud" in a way which was routinely captured by computer systems and then could be operated on by computer systems, that would by itself be one of the most consequential changes in practice in 100 years
> This article doesn't address writing code with AI, just code review. My issue with agentic coding is that I make numerous micro-architectural decisions while programming. I almost never have a full spec up front and develop one as I consider what I am writing.
working with AI forced me to write better specs but the way I write today is very different. I typically open Codex and have Linear MCP connected where my chat with the AI will end up writing the issue. Its a lot of back-end-forth where I tell what I want, the AI does all the code scanning, write something, I correct something, etc
The value for me is exactly that I tell what I want, the AI verify in the actual code if that's the path that makes more sense or not. In the end I have a pretty detailed spec that I'm much more confident is the correct path.
I find the spec easier to review than a huge PR so typically when executing is much faster and aligned with what I want.
The grill-me skill from Matt Pocock is great for this (https://github.com/mattpocock/skills/blob/main/skills/produc...)
> I fear that workflow is just chat, search, and being a rubber duck for my thoughts
This is exactly what I settled upon after my own trying really hard. It is liberating, I have no fear at all!
> On top of that, I enjoy programming, reverse engineering, etc. and I feel that the LLMs, while able to solve some problems or deliver some features, take that fun away.
Same, I prefer asking one or multiple very technical questions to Gemini, analyze, compare and understand the responses then implement it myself based on what I learned (or just integrate it to the codebase as it is, if I asked it to write a function) than delegating away all the fun to an agent.
I find using the LLM to generate different git repo skeletons for the same class of project using the 4-5 different programming languages I’m familiar with is really interesting and helpful. Then I ask it to explicitly describe its design decisions for different parts of the small codebase, i.e. what do the internal APIs look like, so that if you make changes in one section of the codebase, you can be sure you don’t accidentally generate problems in another section of the codebase. Only once you’ve worked out all such constraints, clarified dependencies, etc. do you start generating code in each subsection and that’s done using the specific constraints for that section in each prompt, and reviewing all the code. This is also when you generate the tests for each subsection. Finally this is where using a different LLM(s) for code review after the code is written becomes important. It’s a slow process certainly but it seems to work pretty well.
A lot of programming work is well represented in the training data. For that kind of stuff there’s not much to do regarding architectural decisions. I love to run the LLMs on auto for that work. But for anything not well represented in the training data, which could be anything from mundane stuff in PyQT or a truly novel application, keep them on a short leash or forget them altogether.
> represented in the training data
This isn’t a binary is/isn’t thing though. What if only 80% of my task is, how would I know that the other part isn’t, if I haven’t worked it through fully
What if my task is generally represented, but for my specific context, there are specific details that aren’t?
How would I know until I’ve reasoned through it myself? At that point having the LLM do the work doesn’t add much value
I find myself spending on average more time in LLM review/resolution loops than it would take for me to write the code by hand. Partially because once I'm in the flow I write very very quickly and the code pours out sometimes faster than I can write. But also because the LLM code on the first few tries is generally really really bad. What I find interesting though is that spending the time to personally review and direct the LLM through several iterations of review and revision on average results in higher quality code written in about the same time as I would have written it. This might be particular to me, but seeing several interations of someone else's code helps me better understand holistically my objective as opposed to whatever happens to come out of my flow-state consciousness.
>the code pours out sometimes faster than I can write.
Meaning that you type the code faster than you would normally type prose? Or just what?
If your AI is writing bad code then you need to change your AI. No current high-end AI should be producing bad code.
This sounds like a subjective assessment. I counter with the opinion that most LLMs write technically correct, but bad code. When I read it, it makes me want to gag or poke my eyes out. I spend a lot of time wondering about what kind of person would write it like that, then I realize it’s an LLM
The tool is important but then so it's the way you use it. I've seen small LLMs produce good code and frontier LLMs produce poor quality code. Depending on context..
This is delusional. Opus 4.7 regularly produces pretty bad code.
Source trust me bro.
It's most likely they can't prompt properly.
This feels like a comment from 2 years ago; by now the most modern models write much better code than humans can in much shorter time.
But if you're not used to code reviewing, it can certainly help to still write yourself.
> models write much better code than humans can
What? I think this is either over exaggerating model capabilities or you haven't seen much good code from humans?
My experience is that my colleagues which have bought into model-first development have regressed in quality of the PRs they send out. LLMs are not better coders, in my experience. They lack holistic understanding and often need course correction for that reason. At least in medium to highly complicated systems.
Over my time in the industry I've become increasingly convinced most people haven't seen what good human programmers are like. Otherwise we wouldn't have the popularity of things like Scrum, Clean Code (the book, not the concept), etc.
I was lucky enough to see some good teams when I was a student (both at Berkeley itself and by interning at Jane Street), and it totally changed my intuition for what good programming is like. It's gotten to the point where I'm convinced there are two incommensurable paradigms in programming, and we're constantly talking past each other.
Like, if you have an ongoing project where the codebase has grown over time, do you expect it to get easier to do things or harder? I've worked on projects where it's obvious that things are always getting harder (old code is hard to change, you have to deal with lots of complexity and edge cases and workarounds). I've also worked in codebases where things got easier over time: you get better abstractions, more libraries, more capabilities. That can be a lot of fun; you think of a new thing to try, and you have the pieces to just do it.
Or another point of comparison: do people think that writing good code slows you down (so it only makes sense to avoid bugs), or do people think that writing good code lets you move faster? I've talked to people for whom one or the other is totally and obviously true. (I'm solidly in the second camp myself.)
But the surprising thing was how "obvious" the dynamic was in both cases, even though the two cases are exact opposites of each other! If you ask one group or the other they'd just tell you that, well, that's simply how programming works. Of course things get (easier|harder) over time. That's built into people's fundamental understanding of what programming is and how to do it. And that's exactly what I mean by incommensurable paradigms.
Anyway, this is a bit of a tangent from the main discussion, but it's something I've been thinking about a bunch over the last few years, partly inspired by the advent of AI-powered programming, but largely thanks to experiencing some very different projects and teams...
[delayed]
That's an interesting point, and maybe that actually also explains the difference of people that believe that AI is making them more productive and people that believe it doesn't; if you never think about the architecture, then it becomes more slop over time, and it becomes harder to do anything. If you do think about architecture, development becomes easier and faster all the time. AI just accelerates both processes.
I’ve seen both side of the fence and if there’s one quality that seems to define that fence, it’s caring about the process. Both sides wants to archieve the same goal, but one care about the process (enough to make it less tedious) and the other don’t (whatever seems to work is Ok).
That’s why they say the best programmers are lazy. Not in the sense of avoiding any kind of work, but avoiding the kind of senseless stuff that’s surely to come down the line if you’ve not taken care of the process
Fair enough; I'm talking about relatively "small" snippets, that with reasoning algorithms, can quickly give you a better result than you would get if you let a mediocre or even senior developer would give an hour.
Managing a complete codebase, making architectural decisions, designing business logic; that is not something you should let your agent do.
But I see that as a different task from "coding".
“A lot of people seem convinced that the point of AI coding is to write low-quality code as fast as possible.”
A lot of people think a lot of things, but I don’t think the majority of people think the point of using LLMs is so they can produce low-quality code. Do they produce low-quality code sometimes or often? Of course. But they also produce high-quality code very often. And sometimes they just a “fine” job.
One of the promises - and there are plenty of cases where it’s met and where it falls drastically short - is that agentic coding tools can help us code faster that is just as good or better than what a human can. One of the other big ideal payoffs is that agentic coding can allow non-programmers to create things that previously required programmers to create.
We can debate as to how successful we’ve been toward the two goals above, but I think it’s misguided to say that the majority of people think LLMs should produce lower quality code.
> We can debate as to how successful we’ve been toward the two goals above, but I think it’s misguided to say that the majority of people think LLMs should produce lower quality code.
Guessing you’re not at FAANG or similar company. For the last 6 months at least there’s been tremendous pressure from leadership (including highly experienced IC engineers) to let AI take the reigns, assumption being that future AI assistants will be able to deal with any level of complexity and tech debt created today.
Given that everyone agrees that reviewing all AI-generated code is impractical (if you let the agents rip at maximum available bandwidth), and that “harness engineering” is at best immature and at worst complete snake oil when it comes to ensuring system stability, maintainability, and quality, I do believe that it’s fair to claim that most engineers are, in fact, supportive of low quality code generated by LLMs.
Fwiw I do see pushback here and there, but only from the lowest rungs on the career ladder - ICs with enough experience to see where this train is headed, but no ability to save it. Management needs to see the results of their policies first, and that will take months or even years to fully play out.
Hopefully not, but there was recent thread with multiple posters arguing that code quality doesn't matter, and quality produced by humans in the past was often terrible. So who cares, ship it was the sentiment. Let the AIs handle the growing maintenance cost, I guess?
Kind of a shocking thing to see argued on HN. Maybe it's just the vibe coders.
The vast majority of corporate-employed programmers write bad code. I think maybe 10% of the people I’ve come across have shown any interest or care in the quality of code they write.
There will be a large majority of people who hold these opinions, because they weren’t capable of or didn’t care enough to write good code in the before times
The real problem with these conversations is that code quality isn’t something we have any kind of consensus on.
To a lot of engineers code quality means upper-case C Clean Code. Other engineers are in the Grug brain camp where they think that premature abstraction is the worst kind of code.
But to your point I think the majority of engineers think they high quality code is anything that compiles or passes their (almost definitely insufficient) test suite.
HQ is 0 lines
> arguing that code quality doesn't matter, and quality produced by humans in the past was often terrible
You're conflating two different things. I'm one of the people arguing for the latter, but not because I don't think code quality matters but as a counter to to sudden idealization of handcrafted code.
> We can debate as to how successful we’ve been toward the two goals above
No not really. These are separate questions from what the article posits. The argument is about how do we use these tools, our approach as developers, and if the results are going to be as rosy as advertised.
Eh, I definitely do think that it has become a mainstream take. Not necessarily that we want lower-quality code, but simply that humans shouldn't be reviewing AI code for quality at all - that is, that code quality doesn't really matter and what matters is that the software works.
This is the entire premise of the concept of "vibe coding", and the concept of non-programmers using coding agents. The idea that there aren't large amounts of people and companies doing these things and/or who consider it "the future" is hard to argue.
But how do I know if something works if I don't know how it works? By testing (literally) all use cases, every single permutation of of variables? For complex programs there might not be enough time and energy in the universe to do that.
If I know what addition is, I can look at at a line that does addition and reason about it. If I just check "if it works", for all I know, the actual code is something like
Sure, I can use an LLM to check on the first LLM, and then a third LLM to check on the second, and so on ad infinitum, but none of that, at no point, can give me what "knowing what addition is" gives me.It's kinda like cheap/fake concrete: If you know something about concrete and what concrete is being used, you can roughly tell if it will last, what it will withstand. If you just go by "seems to work", "looks good", you get collapsing bridges and buildings after a few years, during heavy rainfall etc.
The linked article about getting LLMs to critique each others' code review[1], the magpie tool[2], and also this recent article from Cloudflare about their code review stack[3] are all quite compelling.
I'm fairly AI-skeptical not on grounds of "do they work" but "are they good for the world". I feel that getting AIs to do this kind of review work is a rare case that doesn't outsource thinking and deskill workers. It doesn't trigger the same alarm bells as having the AI write the code (including having the AI fix the issues it discovers). That's setting aside environmental and other ethical concerns, which are still significant to me.
I have been impressed by the recent quality of AI code reviews*, but the experience of interacting with 3 separate AI reviewers via GitHub PRs is pretty terrible. Having more local-oriented and jj/rebase-aware review rounds would be great.
*context: fairly large PHP/Laravel backend and Vue frontend
[1]: https://milvus.io/blog/ai-code-review-gets-better-when-model...
[2]: https://github.com/liliu-z/magpie
[3]: https://blog.cloudflare.com/ai-code-review/
Each time a new technology comes around, we seem to forget all the lessons we learned before. We them re-learn them again, much more quickly and just as painfully.
The same thing happened with crypto. Crypto started with some correct assumptions like "online services that stop working when it's a holiday are ridiculous", "it should be possible to send money to a friend without your government knowing about it", or "money transfers cost nothing to execute, so they should cost nothing." It then promptly threw out the entire framework of banking regulations, quickly re-learning why most of those regulations existed in the first place.
Regardless of what model you use, agentic coding tools are indeed pretty good at finding issues if you target them a bit. And they have no respect for their own code or any sense of shame. So, you can just point them at their own code with a new thread.
Many AI models seem biased to cutting corners by default when generating code, even when you ask them not to. But a few simple follow up prompts can address that. Simply ask for covering corner cases with tests, test all the known non happy paths, look for weaknesses, verify adherence to SOLID principles, do security audits, etc. It will find issues. With bigger projects, you can actually make it file those issues in gh with labels and priorities. And then you can make it iterate on fixing issues with separate PRs.
On a recent project, I made it implement a simple benchmark test for measuring throughput. I had a hunch it was doing very sub optimal things. I then asked it to look for potential performance bottlenecks and use the benchmark to verify improvements. At that point I already had a lot of end to end tests to verify correctness. So, these performance tweaks were relatively low risk. I got about two orders of magnitude improvement and a lot more graceful behavior when pushed to the limit.
If you have a bit of experience engineering systems, just treat these tools like they are junior developers. Competent but likely to skip some essential steps. So, just double check with a lot pointed questions "did you do X? If not, do it now". Anything that needs repeated asking, turn it into a guard rail / skill.
There's a bit of effort and skill involved with this. I imagine a lot of less experienced developers might struggle to get good results because they aren't asking for the right things.
My problem is that it "finds issues" all the time and it never really ends. You go through the list, make a decision on how to go about it, give it back to the AI, it does the changes, you ask for issues again, there are now new issues in part due to the solutions from the previous fixes, now you again assess each issue and it's often valid but you have to ask yourself if it's worth fixing right now and whether the fix is worth the complexity for a super rare edge case, depending on the type of prpgram you make, and often the assessment of what's high or low priority is not great by the AI.
So to me this loop really never properly ends so it never feels like I'm done. Which is not great from a psychological point of view.
I find that to not get into this doom loop is to make sure the solution is not overengineered in the first place. AI will pile on complexity to infinity unless you actively gate it.
> Regardless of what model you use, agentic coding tools are indeed pretty good at finding issues if you target them a bit. And they have no respect for their own code or any sense of shame. So, you can just point them at their own code with a new thread. Many AI models seem biased to cutting corners by default when generating code, even when you ask them not to. But a few simple follow up prompts can address that.
That's more or less all of them, they do just generate the likely combinations of tokens, there is no critical thought involved. If you want to approximate that, review iterations are probably the right way to go about it, without the full conversation context either so there's no model output like "I'm doing X because it seems like the correct way to go about Y." but rather a fresh context which allows for more critical predictions.
Here's what works for me, can be made into a skill in whatever you use:
The ProjectLint references might need to be removed (replace with whatever higher level linting/architecture tools you have, if any), but that's the overall idea. It does use a LOT of tokens though, but almost always there's something to fix. Of course, the problem is that sometimes there will be nitpicks or the fixes themselves won't be fully okay, though in general this trends towards slightly better code, even with something like Opus 4.7.This can backfire a bit on token usage where it gets a bit to trigger happy running expensive things for trivial changes. I tend to not use sub agents for this reason. I actually manage to cover most my needs on the 20$/month codex subscription. I might switch to the 200$ plan at some point. But right now I just need to be economical as our company is fairy resource constrained. That's also why I prefer Codex over Claude Code. It seems it gets the job done for less $. Another advantage is that it seems to have less need to have things like this spelled out in this level of detail.
Another thing is that unless you are doing really complicated stuff, you probably don't need the latest models running on high. I'm still on 5.4 medium with codex. I see very little reason to change that.
Part of agentic engineering is figuring out how to be economical with tokens and time. You can sacrifice one for the other of course. But there are diminishing returns as well where spending 10x more doesn't actually get you 10x more quality/results.
I just have the Anthropic 100 USD Max plan and it's enough for daily work - I sometimes do hit the 5 hour limits by mid day, but weekly ones usually cap out at around 80% or thereabout, even with this approach. I usually use xhigh, sometimes max, both still result in situations where I need to intervene plenty, not even on that complex use cases (some LLM stuff, mostly web based CRUD, some light data processing, integrations with Jira and GitLab, processing PDFs and so on, sometimes ML training and geospatial work, like the Sentinel-2 satellite data, nothing crazy).
If I had to pay per token, I'd probably look at DeepSeek. In general it feels like it's a bit early for the technology - either our software methods are wasteful, or the hardware hasn't caught up. To me, it appears that we often need to throw more tokens at these problems, not less, since otherwise it's just one-shot slop.
> once all the code seems okay, you will run THREE parallel sub-agents for code review: each looking at ALL changed code
I did some evals with a prompt like this when I had some subscription tokens to burn, a few months ago. I think using Opus 4.5. What I found was:
1. Running two subagents was somewhat useful
2. Running three started to get redundant
3. Any more than three was pointless (at least when using the same model)
However, even two were getting like 60% the same results.
Much, much more effective was splitting out into audits through different lenses:
* One looking for security issues
* One looking for whether the task was completed successfully
* One looking for performance issues
* One looking for contract/maintainability issues
* One looking at test coverage
Etc.
You can get reasonably close with fewer, however more agents give better signal: e.g. if 3/3 flag something as an issue, the outer one that orchestrates them can view it as something to give more attention to, whereas if it's just 1/3, then it probably begs more consideration. Ofc more doesn't always imply right.
My workflow now
1) While walking voice chat with ChatGPT about architecture and various interesting angles for a feature or product
2) Have it create summary of things we talked about
3) use that to seed spec development phase
4) write comprehensive specs using both Claude Code and Codex
5) create todos from specs
6) implement todos using both Claude Code and Codex to check each others work
7) run focused code check prompts e.g. specifically for error handaling, concurrency issues etc. They tend to find more issues in these focused passes.
We may be in the last Golden age of AI, where experienced professionals still exist who can code manually, and AI already exists who can code automatically, and when the former use the latter skillfully, wonders happen. This magical intersection may not exist iin the future, or become very rare.
I think as long as it continues to be tangibly better these people will still exist and the intersection will continue to be valuable enough to survive.
> as long as it continues to be tangibly better these people will still exist
Sure. But how long will that last? LLMs are getting better at programming much faster than I am.
Imagine a plot with time on the X axis and LLM skill on the Y axis. The line goes up and to the right. On the left is GPT3, or GPT3.5 with the very first glimmers of programming ability just a few short years ago. In the middle is Opus 4.7 now.
Where's the intersection point, where AI skill is higher than that of humans? Less than 10 years. I'd guess less than 5 years.
I think the problem is is that coding is not wholly a 'writing code' problem. It's a translation from idea to outcome. Often I think the bad code generated by an LLM is less to do with it's 'ability' and more to do with an instruction that hasn't adequately accounted for the possibility of what code satisfies the criteria. I'm not sure how a newer model can improve on this per se - sure there will be imrpovement on outright mistakes but for me at least, that's been and gone with more or less with any model released in te last 6 months.
I was coding something with claude the other day. It got the program working by all externally observable metrics, but when I went into the code it was full of DRY violations. It made a bunch of interrelated - but separate - traits for some concepts which simply didn't fit together.
I asked it to look at the code and come up with better factorings, but it failed. I ended up manually reworking several thousand lines of code myself, via my IDE. It took days.
I'd like a claude-of-the-future to be able to come up with beautiful ways to factor the code itself. Amongst the correct solutions, pick one which is conceptually simple. Write the code in a way that it makes subsequent changes easier to write. If I were doing RL with claude, I'd consider directing it toward solutions which allow subsequent changes to be implemented with as little effort as possible.
I think a better way to think about it is - what are the invariants to our current architecture? Why can't you tell Claude to build you a 1B$ business, make no mistakes?
I have no doubt they will be better programmers than almost every human that has ever existed. But the role of a SWE will expand to fill the gaps that the LLM paradigm hasn't filled:
- Accountability
- Long term architectural vision, goal setting
- Everchanging business context
- Mercurial executives, people problems, relationships etc...
Token efficiency is going to be the next big thing.
Tokenmaxxing an army of juniors will destroy your business through slop induced tech debt and API costs. A senior that uses AI but is token efficient will be like rocket fuel.
>rocket fuel
Did you write this comment with AI, or can you explain why so many people use the exact same terrible metaphor?
people said the same with any innovation
And you act like there hasn't been a loss once we moved away from the master craftsman style of building to the professionalized architect style of building. We cannot make a gothic cathedral amymore. also CAD, homogenized the built environment, significantly. And we have been losing a lot of traditional, artisanal craftsmen art forms over the past century. artisanal craft mounds,
Did they? Genuine question, because I do wonder if people in some industries in the past were ever anxious about these specific things (especially skill attrition).
> I do wonder if people in some industries in the past were ever anxious about these specific things (especially skill attrition).
I've spoken with some people (now in their 60s & 70s) that worried about skill atrophy in their line of work.
First they worried about atrophy. Then they watched skill dry up. Now they know it's not available to buy anywhere. In the better cases the skills still exist, but entirely overseas.
These are people I could recognize as sharp engineers, even if I don't know their domains at all. I had to take them at their word about the value in what was lost. The problem is that it's easy to assume that business (or at least society) would prevent degradation of valuable knowledge over time.
There at a lot of crafts that don’t have real deep experts anymore because the work was 90% automated.
Title of this article suggested more depth and I was expecting actual code examples. But it is like other opinion pieces. It suggests a prompt (ask AI to find bugs) that works for the author advising everyone to do it that way.
I use these tools at both work and for personal side projects and I was expecting to watch and learn. But these opinion pieces without examples are way too many now.
Have you tried his suggested workflow? I think it's a useful workflow, and if I hadn't found a workflow like this already would appreciate the pointer.
I guess he could write a code harness to do this, or gin one up really quickly, but that kind of tooling today seems like the purview of the practitioner -- you -- it's frankly faster for you to spec what you want to try this idea out if you want it automated than it would likely be to deal with his code.
One thing that's been interesting to me over the last few years is charting the edge of my coding laziness. As a coder, I'm lazy about boilerplate code -- I hate writing it, I hate maintaining it, etc. And so I design and architect (or used to) around that preference. Sometimes that's smart, sometimes that's not. But it was my preference, and I avoided something that was hard for me to do.
When LLMs started being somewhat useful for coding a few years ago, and I found they were in fact great at boilerplate, in fact pretty much only good at boilerplate ca 2023 or so, it got me thinking about all the accommodations we make in design and systems architecture that are sort of tacitly understanding who we're working with and their strengths and weaknesses.
The modern models have their own very different strengths and weaknesses compared to humans, and deploying them is a really interesting exercise of different architectural and engineering skills. I've enjoyed it, and hope I continue to.
The thing about boilerplate is that a good library or framework makes it optional, and / or automatically written.
I'd much rather django-admin startproject, npm init, or meteor create and get deterministic output than prompt an LLM and get who knows what.
In a mature web ecosystem, boilerplate is minimal. I worry now that we've given this task to LLMs, less development effort will go into startproject-esqe CLIs and good opinionated defaults.
I wonder this in general, what's the impetus for writing new frameworks and such? Are we already seeing a slow down in that space? HN front page certainly paints that picture.
You're better off plonking down an existing framework and getting all the structural boilerplate benefits the LLM can leverage.
LLMs are far better at frameworks they have a lot of training data for, if have been around for a while. They write more idiomatic, ecosystem friendly code. Does that still matter?
I’ve landed on a very similar usage in my last pet project. I’ve used the llm mainly as a glorified refactoring tool/LSP/rubber duck. I can define custom skills that act as specific passes over the codebase that are hard to do with traditional tools, I am using Julia, so I have a skill that is only about doing a semantic and type analysis pass to catch potential type instabilities. Or another that is just about documentation reporting. The workflow for me is always: talk the problem to death/get a report. Triage, decide what I can and should do on my own, what can be left to the llm as mundane boring refactoring tasks, what instead needs me to figure out the correct shape first and then ask the llm to propagate the new pattern in the codebase. Then act. A lot of the time I am implementing the llm suggestion by hand on my own to get a feel of how the codebase is shifting under my feet and stay on top of things. This indeed makes things more slow, but allows for an overall higher quality codebase. Especially the refactoring part.
The anchoring thing is what gets me. Once I've seen the AI's first try, even when it's wrong, I can't really write fresh in my head anymore. I end up editing instead of starting over. Code quality usually ends up fine. Time-wise it's a wash or worse, you just don't feel it until you look at the clock at end of day.
I find that it really is effective when you iterate and plan and review, but the problem is more psychological on the human side. It's just too easily available to take the lazy option and just let it do the thing, postpone the thorough reviewing and you end up in a similar situation as tech debt. In an ideal world with no deadline pressure and infinite discipline, AI can be used in productive ways for sure. But when you actually write the code, there is more of a "do you do it or not" switch, and with AI it's a smooth ramp, you can be just a bit less involved or just a bit more. And I end up feeling like I'm not fully involved, I'm halfway working and my whole mind isn't tuned into it properly. I'm not sure how to express it. Also, now several months in, I just don't get the same feeling of accomplishment from the little wins. It's too automatic, doesn't feel earned.
I discovered something about myself a few years ago... I have to simmer in my work and let my head get wrapped around it.
My visual for this is a capybara soaking
https://www.gettyimages.com/detail/news-photo/capybaras-take...
I am very visual and spatial. The first investment I make in my home or even visiting somewhere for more than 3-4 days where I will need to work without coworking is buying a whiteboard.
So now I'm here with all these tools trying to use a remarkable tablet to draw and show the AI what I'm thinking. It's just not fulfilling. Cleaning toilets isn't either. Lots of jobs have felt like a full on race to software factory and it's clear we're going there with AI and the "cognitive debt" from half (or less) activated brains driving the code generation is going to be massive.
I can't comment on cleaning toilets as a job (luckily I don't have to do that), but cleaning at home does provide a sense of accomplishment similar to solving a coding task elegantly and cleanly, while uninvolved AI-assisted coding is more like up and down voting or liking posts in algo feeds. Not fully like that of course, but it's a step towards that kind of "I like this part, I don't like that part" feedback-giving that can leave me depleted/drained. Coding before AI was more like when you feel one with the machine, like when you drive your car on autopilot, and with AI it's like sitting in the passenger seat like a driving instructor saying how to go about the driving. You do t quite know what it will answer, maybe it will push back on your idea when unnecessary and then I have to expend effort in arguing in text in a chatbox with a machine, or it goes forward too easily without asking clarifying questions or pushing back when what I ask collides with previous things. Many programmers get depleted in meetings and in language-based argumentation and charge up with the more puzzlesolving-like flow state, but this AI wrangling is often more like team meetings.
"It's just too easily available to take the lazy option and just let it do the thing"
This seems to me to be one of the key problems for AI usage in general. Students have this problem where it can be incredibly helpful in actually learning but late at night with the assignment due early tomorrow the temptation is just too strong to have it do the thing.
Somehow I find that interacting with AI doesn't make me feel the same way as diving through Wikipedia rabbit holes. With AI it feels more like, it starts saying how there is indeed an answer to some science question I was unsure about, about some phenomenon or about how some technology works, and it starts explaining it but I feel disinterested in actually reading it's answer. I see it's general shape and I feel satisfied in the existence of the answer. It may be the glazing sycophancy too, but it seems that I get the "satisfaction" from just getting the answer, while with Wikipedia I only got it once I dug up the info that I needed. And the AI answer is ephemeral, while the Wikipedia page is there for everyone, it's a thing, even if it can change.
Same with AI images. It feels good for 2 seconds to see what I asked for and I'm immediately disinterested. Same way, I've generated many Suno songs, but I don't care about them after a few listens.
I noticed that sometimes discussing the code with a chat instance, and having it write prompts for an in-IDE-agent, then posting the result back to the instance to discuss the results and repeat this loop yields not only good results, but also makes me understand the codebase better. I let both agents know I'm proxying and take part in the conversation.
I think AI exists to make humans better, not to replace us (which it can’t anyway). I use LLM’s with new topics answer questions and tutor me (for instance with multivariable calculus -course this spring I asked Claude to create 10 practice exercises, which I then did and it reviewed. Harder ones it did with me step by step.) hopefully not needing them after awhile, when I gain proficinency. Automating humans away is not going to work. There’s a reason why we are the apex predator and ruled this planet for million years.
You have very optimistic assumptions about AI. Of course AI will not "replace you" on its own, but some person in the company will decide it.
That’s first time I hear I’m optimistic about AI. I am as optimistic as I am about a hammer or a liquid scale. They are tools and they are good for particular jobs, if ypu know how to use them.
I am in careers that is one of the more sheltered from automation. Present tech layoffs I suspect are more due to insane overhiring during covid as well as outsourcing. I am sure some companies have gone to full AI psychosis -mode, but they are taking a massive risk. Time will tell.
> which it can’t anyway
I agree, but it doesn't change the reality that AI is the stated reason for many layoffs.
As I read this, I'm also working through a pretty dense feature that took a fair bit of iteration. The end result is actually significantly less code than it was about halfway through. And I was wondering if the AI actually helped me at all, since surely I could have written the code in the same time it took to iterate
But! Because of AI I was able to rapidly hack out like 4 variants of this feature that I didn't like. And felt comfortable throwing them away just as quick.
This has been one the most significant improvements of using AI for me. Before I would have to really think through the plan of a new feature before committing to the implementation and would only catch incompatibilities with existing code after a good portion of the implementation was already written. Now I can ask AI for detailed implementation plans and find these nitty gritty detail problems in a few hours if not less
True. I think this is the biggest help with AI. It does not necessarily help with reaching the end goal faster all the time but it helps in trying out different iterations for quick prototypes. I find it especially useful in fast moving startups where some times we just want to validate a few ideas before fleshing them out as proper features.
So what’s the verdict? Was it worth it?
Yea worth it. The original implementation ended up being the most complex, and also not a great UX. But I didn't really get it was a worse UX until I built it and tested it out a bit.
And I wasn't attached to that complex implementation in the way I would be if I architected it from scratch, so it was easy to move on.
I'm pretty intensely disinterested in "agentic" coding. However, this was very much the inspiration behind the custom Claude-backed Goose agent I deployed into some of our Gitea repositories over Christmas--the less sensitive ones, of course, I'm not sending our proprietary code to Dario.
It does short and sweet code reviews, and going back and forth with it is, as often as not, slower than just typing and merging the code.
I'm quite pleased with it as a middle ground.
The bottleneck moved. It used to be writing code. Now it's knowing what to ask for, in what order, and how to validate what comes back. That's a fundamentally different skill than coding.
Only if your job was to write code alone, which isn't what most people do. Most of my job for the past 10 years was explaining to people what had to be done, how and then verifying they did it.
Any senior/staff engineer was already doing this, if they weren't they were on the wrong job or had the wrong title.
So I am figuring out how to let LLM write code automatically as long as I clarify the requirements. I have made a set of skills to deal with this and it called tdd-pipeline. I eat this dog food and by several rounds of iterations to fix bugs, it works better and better. Now I feel much relax while it is working.
I open sourced it on GitHub, you may search alexwwang/tdd-pipeline to find it if you are interested in it.
I believe this thinking can be abstracted to software design via AI in general. If you are thoroughly prepared, and keep things simple, it's incredible what help Claude or GPT can be.
I have Claude basically doing all the coding for me for a simple game I am making. However I don't consider this vibe coding. I spent several hours thinking out the design on a piece of paper, playtesting it in person. I came up with a list of potential mechanical issues within the game, and asked Claude to come up with more. It found more issues, and we solved them all together. Once the game was mechanically sound and edgecases were solved for, it built an MVP. I ran the program, and found more bugs. I came up with my own solutions, and Claude did the same, and we figured out which were best to implement. Claude then wrote more code, and raised issues when they came up, and we worked through them together. I'm incredibly happy with how its turned out so far.
The main insight here I think is that LLMs are great tools for iterative development and iterative problem solving in general.
You can very effectivly iterate alone using the LLM as a mirror, rephrasing what you put in and adding a bit.
You can use LLMs to quickly create prototypes to give to other human beings to help you with the next iteration.
If you get something from someone else to iterate on you can use the LLM to help you with understanding to rephrase things in a way more suitable for your understanding.
But instead everything anybody seems to be talking about seems to be one shoting things and AI iterating with other AI.
The big problem here is that the one thing AI does not have is agency. The naming AI agent is wishful thinking and marketing.
On the other hand, some companies are pushing the idea that engineers should build robust self-evaluating agent pipeline with human feedback in the loop so that agents write most of the production code. Creao's CEO said that they rearchitected their entire production systems in two weeks this January. He also claimed that their agents implemented so many features so fast that they had to wait their business development to catch up.
I wonder how we can evaluate these two options: using AI to 100X the output versus using AI to advance one's craft.
In the meantime, the productivity gain of AI is real. Case in point, An engineering org of Snowflake has met all its OKRs ahead of time in the first quarter for the time in the company's history. It had never happened, and usually meeting 70% of the planned OKR would be considered an achievement. I can imagine the stress of the engineers when they see such outcome.
Hopefully we can blend those two options together so it’s not a choice.
Personally I find being able lean on our heavily documented standards in /review gives me back time to dive into what I want to craft next.
Same with scheduling repetitive tasks an agent can do for me well once instructed well. I am freed up to do something else I want to focus actively on because I like it and want it to be great.
Now stress about OKRs and OKRS in general… that’s a different issue
Exactly. That is what we do. We do software that can kill people and it is very sophisticated, like controlling robots and we prototype using LLMs and it is amazing.
People believe that you can only use LLMs for sloppy programming. But you can also use it for writing ten times more code of Swiss cheese model tests, and domain specific languages.
You write ten times more code than necessary and all that extra code is testing. Projects like SqlLite do that because they need to be perfect.
Before LLMs we had to use engineers for that and it was a painful and repetitive work, and they were always late and made much more mistakes than LLMs, specially because it was dull and tedious for great engineers to spend their time into.
Now we write tests and when all test pass we write new test for checking the tests.
We divide each complex problem in small subproblems and we warrantee each of them by formal means. We have multiple ways of solving the same problem, usually with one brute force solution that is simple and warranted to work but inefficient, and we can use it to compare with more efficient methods.
Before machines could do that, people doing that were burned down and exhausted, and always leaved pending work to complete.
This is exactly the reason why I like to work with local models on a regular specced machine. The fact that the agent moves slower allows me to stay in the loop much better, compared to skimming through a huge amount of generated content and data and then going to the end really fast to make sense of it all, in the interest of time (and thus losing track and quality). The fact that I can run it locally makes it (much) cheaper too.
Does anyone have good recommendation for ai auto completion?
My goal is to draft the solution with ai, write it myself but faster with auto complete, then throw ai review.
I like Zed for this. We are considering autocomplete for Rig after we launch our agent. Local would be 1000x faster.
So your "goal" is to find an existing ai-auto completion that allows you to draft with AI then "write it yourself" by hitting tab? Sounds like the goal is actually to build that, then use it on projects....
I get pretty good results by writing specs and prototype with a LLM, through more or less managed conversations.
But once the prototype is done, I spend too much time refining the details, fixing everything going wrong (bas design details, wrong implementation, half done testing ...)
A full agentic setting would be too expensive for me (I wonder how much Garry tan spends...)
So I'd like to take a more balanced approach with: 1. usual LLM specs and prototyping to get the bases of the feature and boilerplate done 2. Write myself the code, with the help of an ai auto complete (this is Where I look for recommendations) 3. Use a setup as OP mentioned to review code
Optimizing for code quality over raw output speed is a great approach. The time 'lost' writing it slowly is easily made up by the time saved on debugging and maintenance later.
> But if you’re the kind of developer who uses agents to write multi-hundred-line PRs that you barely understand yourself, I’d invite you to slow down a bit and try this other, slower style of “vibe coding.” Ask an agent how your PR works and how it might fail. Have it write Markdown docs with Mermaid charts if necessary. Use Matt Pocock’s /grill-me skill until you understand the entire PR front-to-back.
Man so much work to retrofit something that obviously, simply, plainly - just does not work. How about just writing the code yourself? You can even consult AI on the libraries or whatever, but how about just building that model in your head YOURSELF and not loading up on AI slop and trying to memorise that crap. The names of the functions will ring different in your memory once you spend some time thinking over whether you picked the right and clear name vs. just going with whatever statistical median the slop machine picked for you.
I think using speed to describe the rate of progress in software development is where the frustration comes from. Software isn't a velocity thing. It's a space thing. It's memory. Information in some media. You can transfer a billion bits in less than a second. The time domain is largely irrelevant in business terms.
Having taste and the ability to author high quality prompts is still the most important thing. It was always the most important thing if you think abstractly about how all of this works.
This is one of the most sane takes on shipping code using AI where it's being actively reviewed and it respects your colleagues' time and attention. I like it.
Yes! That's what I've been doing at work for the last few weeks! And while it doesn't appear to be super fast, I'm already pretty certain that the next round of testing will come back with fewer unexpected issues because together with my agent and the right usage, I was already able to catch stuff that I would have missed otherwise.
Also feels much better than pure vibe-coding (which I still do for personal projects that aren't mission critical for anyone).
100% agree after building a production ready platform ground up. it took 3-4 months but without AI i would never had been done with a team of 3. one thing to note that AI is weak at Front end. So, we did the entire front end without AI.
I used LLM as a tutor to tackle unfamiliar terrain. That is, I write code that I know very likely doesn't work but is the best code that I could have written. The LLM will happily tirelessly show me what I did wrong and what the correct code actually look like. Then, at the end of it, I got code that running. That's a tight feedback loop.
It's still very slow. It took me two hours to write code that generate JSON data and then to write a web page that displays a knowledge graph.
One thing you have to be aware is that the LLM will happily generate code for you and you have to discipline it from time to time. I notice that my reading comprehension begins to suffer if I don't write the code myself and have to understand what the LLM wrote for me as opposed to the LLM correcting where I went wrong.
One thing I would like to try with an LLM is understanding a large and complex existing codebase like OpenSCAD that doesn't leverage my existing skillset(high level programming languages with OpenSCAD as primary language in the past year). That has always been a barrier to contribution for me.
I usually do this for complex features:
- Opus 4.7 writes the code - I make GPT-5.5 in Codex to review it (given context) - I provide the review back to Opus and ask it to verify the review findings - Make Opus plan the fixes then execute them - Ask GPT-5.5 to review the fixes and check if they solve the problems
Hot take, barring from special edge cases, I find using dumber models (like local Qwen 3.6) to be the best balance. Smart enough to do stuff but dumb enough where I don’t trust it and verify what it’s doing rather than letting it do the third whole code base refactoring of the day. Also forces me to know my code base and ask very descriptive tasks rather than go “something is wrong, fix it”.
Input sequence mutation >>> novel token generation in LLMs. Why? IDK, there must be a good theory article someone could point me to.
Yep, it definitely can help with being an 0.1x developer, it's a long drawn out process but the output is actually good.
I think my current conclusion is that AI makes <foo> more important than ever.
I’m not exactly sure what <foo> is but I feel it. I think it’s quality and authenticity and craftsmanship. That difference between an expensive tool and a cheap one that you can’t easily describe but you just know it.
Is there a word for this? I bet the Japanese or Germans have a word for this.
I use AI a lot now. But I also do it in small steps. It isn’t a craftsman, but it can help me be one.
Quality as Pirsig would say in Zen and the Art of Motorcycle maintenance.
People use the word "taste" to describe that
Yeah, maybe that.
I feel like AI promises a factory that can make Walmart quality tools. Which I think will make the well-crafted tools more important than ever.
The goal was better code. The result is a new full-time job reviewing your robot's homework. Progress is not the word I would use here.
Another thing that I feel is underappreciated about agentic coding is that you can actually learn from it. I am a programmer with 25+ years of experience and I tend to do a lot of stuff according to fixed patterns/habits. Seeing how my coding agents do stuff helps me break out of these patterns, lets me consider new approaches, helps me pick up idioms and teaches me new hacks and tricks. That is very satisfying in its own right.
I'm exactly in the same boat. 25+ years of experience and I use agentic coding exactly to learn better patterns. I often let it implement something, read the code, learn the pattern, confirm it's a good practice and then code myself manually another section of the code.
I think many people that blindly say you cannot learn anything from vibe coding have some sweeping assumptions. It obviously depends. If you just let it do everything without even reading the code and understanding it, then yes. But the act of reading code is one of the best ways to actually learn, no matter if you read code from a human or from an LLM. I tend to learn way better by example than by reading a theoretical book.
The greatest part of this approach is that you actually become better in the process.
The downside is you use less tokens.
Love this. I use a similar "ralph-loop" approach that starts with an approved plan and then hand it off to a coordinator which does it across 2 sessions (build and review for simplicity), with each session getting its own model.
Would love to but my boss wants 15 features delivered yesterday
Finding bugs in PR’s are exactly what GitHub copilot, GitHub cursor bot and tonnes of other PR bots already do…
Another way I'm "going slower" is to have the AI implement individual sub-steps of the current task, and review each one. It's slower than having it yolo out the whole thing, but it's much smaller incremental bits to review, so my brain doesn't glaze over in a huge review, like I had if I had it do the whole task.
I'm following an Ideas -> PRD -> Issues -> Tasks methodology, where each task has a bunch of sub-tasks. I have it just do one (or a few, I'm having it do Red/Green/Refactor as separate sub-steps, so I review the Red case, and then once that's good, do the Green and Refactor steps, and review those).
Basically you're using AI as very costly linters...
what's wrong with (depending on the language) checkstyle, sonarlint, ruff, mypy, xmllint, and/or eslint?
The quick answer is that even in the workflow described by the author these tools don’t do the same thing AI does. And a good programmer/agent will be using these tools as well
To me the blocker with using coding agents is having to rely on a paid external service. Are there any local models that are good enough to be used for coding?
As of this month, Qwen3.6 (either the 27B or 35B-A3B) or Gemma 4 are talked about often.
Also maybe this will help: https://hnup.date/hn-sota
The qwen model is my daily driver this week.
depends on how clear your instructions are, if there is no ambiguity you can even use gemma4 e2b/e4b.
Instead of using a skill and having the agent own the flow for this, I've been building an external orchestrator that handles the process.
By default it uses pi agent core + pi ai (from the excellent pi coding agent) as a multi model runtime but also supports a Claude Agent SDK runtime.
I can have an implementation and review process of an OpenSpec change run anywhere from 2 hours to 24+ hours going through review/fix/verification rounds automatically until the implementation matches the spec and any additional reviewers are done finding issues after the fix rounds.
it's going to be fully open sourced in the next two weeks and fully free to use
https://engine.build
Maybe we can come up with an spec for aligning asci diagrams. Can't really build anything with confidence when the attention to detail is lacking in these agentic systems
https://imgur.com/a/r4fhOwy
It's interesting... Opus seems horrible at keeping text aligned. Markdown it is I suppose
What's that from? OpenSpec docs?
Yeah not trying to pick on any particular project because its quite the mark that the writer didn't proof read the documentation and its quite widespread
https://news.ycombinator.com/item?id=48246232
This reminds me the article above. Now people have diverse ideas on agentic coding. Some suggest human-in-the-loop while others suggest giving a detailed specification and let the agent run freely; some suggest leveraging LLM's high productivity and here we get an opinion that LLM can actually slowly write good code.
It's happy to see opinions that are more practical and variant emerging, turning LLM into literally a tool instead of something to be hated or hyped.
In my own practice, I find LLMs (SOTA ones) good at medium-level tasks, those needed to reason and plan for a while. However, the design taste on architecture is unexpectedly disgusting. Sometimes writing interfaces myself and asking LLMs to fill in implementations, alongside context-completing tools like context7, deepwiki, docs.rs MCPs, etc. and giving a escape hatch (e.g. encouraging it to use the AskUser tool in Claude Code), may be considered my best practice.
Very much agreed. Something specific that has helped me a lot (beyond just automatic formatting, linting and testing) was putting a hard fail on any file with more than 1500 lines or so, with an allowlist for specific files with specific reasons for their length. I realized the agents were squirreling away code without wanting to do any sort of refactor. Every time one of these rat's nests has turned up, the codebase has been much improved with a small refactor, to the point it doesn't feel like such a pile of slop anymore.
200 is my preferred limit, and I think you can find that in a few highly regarded books on coding.
Matters of taste. I don't mind bigger files where it makes sense, and sometimes for the nature of the domain, it is nice to have more things in one file. As well, they write so many comments that 200 lines doesn't feel right to me.
Thank you. That is really important to remind this to people especially in the upper management
Stop being reasonable! This is a hypecycle!
This is the approach I take, with many guardrails and nested CLAUDE.md's to keep things sane.
How profound! Talking points are changing from "vibe coding delivers bug free software" to "slow down and enjoy the AI".
Great how the promoters are mirroring the current anti-AI sentiment. The next step is canceling all subscriptions and not using AI at all. Maybe your mind will work again.
Not so much. People are just walking things back from the Gastown/Oh My Opencode/etc peak of trying to get 10 agents working simultaneously on a project unsupervised. They've collectively realized that you still have to understand and validate what the agents produce in some way if you want to build maintainable software.
This. Go slow. Use principles. Argue. Refactor. Improve before you commit. It is the way.
The bug-finding use case alone makes this worth it.
AI makes senior engineers slower in the same way code reviews make teams slower: locally inefficient, globally beneficial.
> This is the opposite of the “10x productivity” slop-cannon style of development that most people imagine when they think of vibe coding, but I find it very satisfying.
I can relate to this. When I spend time on writing unit test , even the one which takes 1% of code coverage, it will be honestly wholesome moment for me to ship it confidently.
Are we overcomplicating AI by approaching top down, so naturally there are trillions of variations and too many ways to fail? Supervising a component-level scope, with emphasis on quality control (regression, perf testing, benchmarking, etc), seems to produce great work.
Another day, another AI-related thinkpiece
(that people upvote to post their own thinkpieces in the comments)
Great article and right on point.
I use cheaper models (Deepseek is king, but GLM and Kimi as well) and do the planning myself. I often start a task myself, write some code to get the LLM on the right track, and then have it complete parts of the implementation that are kind of boring or repetitive. LLM's are just next token predictors, I don't mean that in a demeaning way, but I've found if I can get the LLM started on the right track with my own code, it completes what I want. Having the LLM write code just from a spec ends up with poor quality slop in my experience.
I'm not 100x'ing my output like some people claim, but using it as a augmentation rather than delegating my work to it results in better code, and I don't lose context / control over my codebases. I really have read 100% of the code, because the LLM is generating smaller pieces around and inside my own written code. Works well enough for me, and open models are already both cheap enough and good enough for this workflow. This is why the big companies are so desperate to push full-on agentic hands-off workflows and developer replacement - that's the only way they won't go bankrupt.
What app do you use with deepseek? I've been used claude code but pointing it at the deepseek api and it works ok, but I'm wondering if there are better options. (https://api-docs.deepseek.com/quick_start/agent_integrations...)
I've been using Zed and Charm Crush. I think most work with it though, any agent designed around OpenAI completions API compat will do fine. Although Zed had some problems initially with tool calls but it seems to be fixed.
I'm working on my own harness to be a bit more aligned with my workflow but tbh I'm losing motivation since other harnesses are fine now. I could probably vibe code something but there's not much point imo. Unless I come up with something completely different but who knows.
I think there is a Deepseek agent out there in Rust, but I've never tried it. Zed has been pretty decent with all models, not the best but certainly beats VSCode. ChatGPT 5.4 on that calls about 100 different git diffs to "verify" the changes are valid which is rubbish. I haven't tried Deepseek with it though.
Honestly these models and agents are becoming commodities, as long as they don't totally fail with tool calling or some stupid system instructions the models can figure stuff out pretty well.
learn by considering critique
hmmm
> But the thing is, LLMs are very flexible. And you can use them just as effectively to write high-quality code more slowly.
There is a reason it is called slop. On first sight it is often not noticeable but when you dig deeper, you realise that it is often spam-slop. Of course this can be improved upon, but often there is no real improvement and you waste your own time in hope that things get better. Which high quality projects exist that are AI slop generated? Can people name something that is used by many people? The linux kernel? Something in that range? Including documentation? To me it seems people are chasing a dream here: skynet should write the code and they can sit on the beach, enjoying sunshine and fruits.
.
The screechless wind vane of AI narratives be like
- Using AI to write the best code ever faster than any human ever could
- Using AI to write better code more slowly
- Using AI to write code that sucks even more slowly
- Using AI to stockpile horrendous ball of spaghetti code no one fucking understands which grows faster and faster despite going even more slowly
- Using Natural Intelligence to try and fail to untangle the mountain of spaghetti code
- Look guys, down with that AI, we've got a brand new shiny thing to throw trillions of VC dollars at!
I want to mention, Claude code has a command /code-review. I find it quiet useful.
Just dont use it lol, it does nothing you cant do by yourself. You're nerfing parts of your brain by relying on it.
There is things you really can't do by yourself. I've been working on porting some large codebases to Rust lately to experiment with fixing memory safety bugs. There is just no way you can write 100k LOC in a week of production code with tons of tests etc. Even "10X" engineers just can't type that fast.
even if that were true its the exact opposite point the article is trying to make
What language you are trying to port from?
How would one even know if the port of 100k LOC was successful? Are there language-agnostic tests (CLI STDIN/STDOUT) or similar involved?
Yeah, agreed. “Cognitive surrender” is one way of describing that loss of personal faculty. I don’t think AI proponents are acknowledging second order effects of letting your mind interact less and less while requesting more and more complexity built for you without adequate verification.
Where they are extremely powerful, and it's hard to debate this IMO, is adding comments to the code, writing complete documentation, and constantly updating the readme. The value in actually writing the code is still up for debate (I'm on the side that sees the value there too) but the mind-numbing, boring, make-you-hate-life parts of the codebase are without question better for the use of AI.
Yeah I also always think people looking up stuff on Google are so lazy, just go to the library, jeez!