This is such a lovely balanced thoughtful refreshingly hype-free post to read. 2025 really was the year when things shifted and many first-rate developers (often previously AI skeptics, as Mitchell was) found the tools had actually got good enough that they could incorporate AI agents into their workflows.
It's a shame that AI coding tools have become such a polarizing issue among developers. I understand the reasons, but I wish there had been a smoother path to this future. The early LLMs like GPT-3 could sort of code enough for it to look like there was a lot of potential, and so there was a lot of hype to drum up investment and a lot of promises made that weren't really viable with the tech as it was then. This created a large number of AI skeptics (of whom I was one, for a while) and a whole bunch of cynicism and suspicion and resistance amongst a large swathe of developers. But could it have been different? It seems a lot of transformative new tech is fated to evolve this way. Early aircraft were extremely unreliable and dangerous and not yet worthy of the promises being made about them, but eventually with enough evolution and lessons learned we got the Douglas DC-3, and then in the end the 747.
If you're a developer who still doesn't believe that AI tools are useful, I would recommend you go read Mitchell's post, and give Claude Code a trial run like he did. Try and forget about the annoying hype and the vibe-coding influencers and the noise and just treat it like any new tool you might put through its paces. There are many important conversations about AI to be had, it has plenty of downsides, but a proper discussion begins with close engagement with the tools.
Architects went from drawing everything on paper, to using CAD products over a generation. That's a lot of years! They're still called architects.
Our tooling just had a refresh in less than 3 years and it leaves heads spinning. People are confused, fighting for or against it. Torn even between 2025 to 2026. I know I was.
People need a way to describe it from 'agentic coding' to 'vibe coding' to 'modern AI assisted stack'.
We don't call architects 'vibe architects' even though they copy-paste 4/5th of your next house and use a library of things in their work!
We don't call builders 'vibe builders' for using earth-moving machines instead of a shovel...
When was the last time you reviewed the machine code produced by a compiler? ...
The real issue this industry is facing, is the phenomenal speed of change. But what are we really doing? That's right, programming.
"When was the last time you reviewed the machine code produced by a compiler?"
Compilers will produce working output given working input literally 100% of my time in my career. I've never personally found a compiler bug.
Meanwhile AI can't be trusted to give me a recipe for potato soup. That is to say, I would under no circumstances blindly follow the output of an LLM I asked to make soup. While I have, every day of my life, gladly sent all of the compiler output to the CPU without ever checking it.
The compiler metaphor is simply incorrect and people trying to say LLMs compile English into code insult compiler devs and English speakers alike.
> Compilers will produce working output given working input literally 100% of my time in my career.
In my experience this isn't true. People just assume their code is wrong and mess with it until they inadvertently do something that works around the bug. I've personally reported 17 bugs in GCC over the last 2 years and there are currently 1241 open wrong-code bugs.
These are still deterministic bugs, which is the point the OP was making. They can be found and solved once. Most of those bugs are simply not that important, so they never get attention.
LLMS on the other hand are non-deterministic and unpredictable and fuzzy by design. That makes them not ideal when trying to produce output which is provably correct - sure you can output and then laboriously check the output - some people find that useful, some are yet to find it useful.
It's a little like using Bitcoin to replace currencies - sure you can do that, but it includes design flaws which make it fundamentally unsuited to doing so. 10 years ago we had rabid defenders of these currencies telling us they would soon take over the global monetary system and replace it, nowadays, not so much.
Sure bitcoin is at least deterministic, but IMO (an that of many in the finance industry) it's solving entirely the wrong problem - in practice people want trust and identity in transactions much more than they want distributed and trustless.
In a similar way LLMs seem to me to be solving the wrong problem - an elegant and interesting solution, but a solution to the wrong problem (how can I fool humans into thinking the bot is generally intelligent), rather than the right problem (how can I create a general intelligence with knowledge of the world). It's not clear to me we can jump from the first to the second.
> I've personally reported 17 bugs in GCC over the last 2 years
You are an extreme outlier. I know about two dozen people who work with C(++) and not a single one of them has ever told me that they've found a compiler bug when we've talked about coding and debugging - it's been exclusively them describing PEBCAK.
I've been using c++ for over 30 years. 20-30 years ago I was mostly using MSVC (including version 6), and it absolutely had bugs, sometimes in handling the language spec correctly and sometimes regarding code generation.
Today, I use gcc and clang. I would say that compiler bugs are not common in released versions of those (i.e. not alpha or beta), but they do still occur. Although I will say I don't recall the last time I came across a code generation bug.
I knew one person reporting gcc bugs, and iirc those were all niche scenarios where it generated slightly suboptimal machine code but not otherwise observable from behavior
Right - I'm not saying that it doesn't happen, but that it's highly unusual for the majority of C(++) developers, and that some bugs are "just" suboptimal code generation (as opposed to functional correctness, which the GP was arguing).
I'm not arguing that LLMs are at a point today where we can blindly trust their outputs in most applications, I just don't think that 100% correct output is necessarily a requirement for that. What it needs to be is correct often enough that the cost of reviewing the output far outweighs the average cost of any errors in the output, just like with a compiler.
This even applies to human written code and human mistakes, as the expected cost of errors goes up we spend more time on having multiple people review the code and we worry more about carefully designing tests.
If natural language is used to specify work to the LLM, how can the output ever be trusted? You'll always need to make sure the program does what you want, rather than what you said.
>"You'll always need to make sure the program does what you want, rather than what you said."
Yes, making sure the program does what you want. Which is already part of the existing software development life cycle. Just as using natural language to specify work already is: It's where things start and return to over and over throughout any project. Further: LLM's frequently understand what I want better than other developers. Sure, lots of times they don't. But they're a lot better at it than they were 6 months ago, and a year ago they barely did so at all save for scripts of a few dozen lines.
Just create a very specific and very detailed prompt that is so specific that it starts including instructions and you came up with the most expensive programming language.
You trust your natural language instructions thousand times a day. If you ask for a large black coffee, you can trust that is more or less what you’ll get. Occasionally you may get something so atrocious that you don’t dare to drink, but generally speaking you trust the coffee shop knows what you want. It you insist on a specific amount of coffee brewed at a specific temperature, however, you need tools to measure.
AI tools are similar. You can trust them because they are good enough, and you need a way (testing) to make sure what is produced meet your specific requirements. Of course they may fail for you, doesn’t mean they aren’t useful in other cases.
The challenge not addressed with this line of reasoning is the required sheer scale of output validation on the backend of LLM-generated code. Human hand-developed code was no great shakes at the validation front either, but the scale difference hid this problem.
I’m hopeful what used to be tedious about the software development process (like correctness proving or documentation) becomes tractable enough with LLM’s to make the scale more manageable for us. That’s exciting to contemplate; think of the complexity categories we can feasibly challenge now!
Or the argument that "well, at some point we can come up with a prompt language that does exactly what you want and you just give it a detailed spec." A detailed spec is called code. It's the most round-about way to make a programming language that even then is still not deterministic at best.
Exactly the point. AI is absolutely BS that just gets peddled by shills.
It does not work. It might work for some JS bullcrao. But take existing code and ask it to add capsicum next to an ifdef of pledge. Watch the mayhem unfold.
This is obviously besides the point but I did blindly follow a wiener schnitzel recipe ChatGPT made me and cooked for a whole crew. It turned out great. I think I got lucky though, the next day I absolutely massacred the pancakes.
Recent experiments with LLM recipes (ChatGPT): missed salt in a recipe to make rice, then flubbed whether that type of rice was recommended to be washed in the recipe it was supposedly summarizing (and lied about it, too)…
Probabilistic generation will be weighted towards the means in the training data. Do I want my code looking like most code most of the time in a world full of Node.js and PHP? Am I better served by rapid delivery from a non-learning algorithm that requires eternal vigilance and critical re-evaluation or with slower delivery with a single review filtered through an meatspace actor who will build out trustable modules in a linear fashion with known failure modes already addressed by process (ie TDD, specs, integration & acceptance tests)?
I’m using LLMs a lot, but can’t shake the feeling that the TCO and total time shakes out worse than it feels as you go.
There was a guy a few months ago who found that telling the AI to do everything in a single PHP file actually produced significantly better results, i.e. it worked on the first try. Otherwise it defaulted to React, 1GB of node modules, and a site that wouldn't even load.
>Am I better served
For anything serious, I write the code "semi-interactively", i.e. I just prompt and verify small chunks of the program in rapid succession. That way I keep my mental model synced the whole time, I never have any catching up to do, and honestly it just feels good to stay in the driver's seat.
Pro-tip: Do NOT use LLMs to generate recipes, use them to search the internet for a site with a trustworthy recipe, for information on cooking techniques, science, or chemistry, or if you need ideas about pairings and/or cooking theory / conventions. Do not trust anything an LLM says if it doesn't give a source, it seems people on the internet can't cook for shit and just make stuff up about food science and cooking (e.g. "searing seals in the moisture", though most people know this is nonsense now), so the training data here is utterly corrupt. You always need to inspect the sources.
I don't even see how an LLM (or frankly any recipe) that is a summary / condensation of various recipes can ever be good, because cooking isn't something where you can semantically condense or even mathematically combine various recipes together to get one good one. It just doesn't work like that, there is just one secret recipe that produces the best dish, and the way to find this secret recipe is by experimenting in the real world, not by trying to find some weighting of a bunch of different steps from a bunch of different recipes.
Plus, LLMs don't know how to judge quality of recipes at all (and indeed hallucinate total nonsense if they don't have search enabled).
I genuinely admire your courage and willingness (or perhaps just chaos energy) to attempt both wiener schnitzel and pancakes for a crew, based on AI recipes, despite clearly limited knowledge of either.
Everything more complex than a hello-world has bugs. Compiler bugs are uncommon, but not that uncommon. (I must have debugged a few ICEs in my career, but luckily have had more skilled people to rely on when code generation itself was wrong.)
I had a fun bug while building a smartwatch app that was caused by the sample rate of the accelerometer increasing when the device heated up. I had code that was performing machine learning on the accelerometer data, which would mysteriously get less accurate during prolonged operation. It turned out that we gathered most of our training data during shorter runs when the device was cool, and when the device heated up during extended use, it changed the frequencies of the recorded signals enough to throw off our model.
I've also used a logic analyzer to debug communications protocols quite a few times in my career, and I've grown to rather like that sort of work, tedious as it may be.
Just this week I built a VFS using FUSE and managed to kernel panic my Mac a half-dozen times. Very fun debugging times.
> Meanwhile AI can't be trusted to give me a recipe for potato soup.
This just isn't true any more. Outside of work, my most common use case for LLMs is probably cooking. I used to frequently second guess them, but no longer - in my experience SOTA models are totally reliable for producing good recipes.
I recognize that at a higher level we're still talking about probabilistic recipe generation vs. deterministic compiler output, but at this point it's nonetheless just inaccurate to act as though LLMs can't be trusted with simple (e.g. potato soup recipe) tasks.
Just to nitpick - compilers (and, to some extent, processors) weren't deterministic a few decades ago. Getting them to be deterministic has been a monumental effort - see build reproducibility.
I remember the time I spent hours debugging a feature that worked on Solaris and Windows but failed to produce the right results on SGI. Turns out the SGI C++ compiler silently ignored the `throw` keyword! Just didn’t emit an opcode at all! Or maybe it wrote a NOP.
All I’m saying is, compilers aren’t perfect.
I agree about determinism though. And I mitigate that concern by prompting AI assistants to write code that solves a problem, instead of just asking for a new and potentially different answer every time I execute the app.
There's also no canonical way to write software, so in that sense generating code is more similar to coming up with a potato soup recipe than compiling code.
That is not the issue, any potato soup recipe would be fine, the issue is that it might fetch values from different recipes and give you an abomination.
This exactly, I cook as passion, and LLMs just routinely very clearly (weighted) "average" together different recipes to produce, in the worst case, disgusting monstrosities, or, in the best case, just a near-replica of some established site's recipe.
At least with the LLM, you don't have to wade through paragraph after paragraph of "I remember playing in the back yard as a child, I would get hungry..."
In fact LLMs write better and more interesting prose than the average recipe site.
It's not hard to scroll to the bottom of a page, IMO, but regardless, sites like you are mentioning have trash recipes in most cases.
I only go with resources where the text is actual documentation of their testing and/or the steps they've made, or other important details (e.g. SeriousEats, Whats Cooking America / America's Test Kitchen, AmazingRibs, Maangchi for Korean, vegrecipesofindia, Modernist series, etc) or look for someone with some credibility (e.g. Kenji Lopez, other chef on YouTube). In this case the text or surrounding content is valuable and should not be skipped. A plain recipe with no other details is generally only something an amateur would trust.
If you need a recipe, you don't know how to make it by definition, so you need more information to verify that the recipe is done soundly. There is also no reason to assume / trust that the LLMs summary / condensation of various recipes is good, because cooking isn't something where you can semantically condense or even mathematically combine various recipes together to get one good one. It just doesn't work like that, there is just one secret recipe that produces the best dish, and LLMs don't know how to judge quality of recipes, mostly.
I've never had an LLM produce something better or more trustworthy than any of those sites I mentioned, and have had it just make shit up when dealing with anything complicated (i.e. when trying to find the optimal ratio of starch to flour for Korean fried chicken, it just confidently claimed 50/50 is best, when this is obviously total trash to anyone who has done this).
The only time I've ever found LLMs useful for cooking is when I need to cook something obscure that only has information in a foreign language (e.g. icefish / noodlefish), or when I need to use it for search about something involving chemistry or technique (it once quickly found me a paper proving that baking soda can indeed be used to tenderize squid - but only after I prompted it further to get sources and go beyond its training data, because it first hallucinated some bullshit about baking soda only working on collagen or something, which is just not true at all).
So I would still never trust or use the quantities it gives me for any kind of cooking / dish without checking or having the sources, instead I would rely on my own knowledge and intuitions. This makes LLMs useless for recipes in about 99% of cases.
I think things can only be called revolutions in hindsight - while they are going on it's hard to tell if they are a true revolution, an evolution or a dead-end. So I think it's a little premature to call Generative AI a revolution.
AI will get there and replace humans at many tasks, machine learning already has, I'm not completely sure that generative AI will be the route we take, it is certainly superficially convincing, but those three years have not in fact seen huge progress IMO - huge amounts of churn and marketing versions yes, but not huge amounts of concrete progress or upheaval. Lots of money has been spent for sure! It is telling for me that many of the real founders at OpenAI stepped away - and I don't think that's just Altman, they're skeptical of the current approach.
What I don't understand about these arguments is that the input to the LLMs is natural language, which is inherently ambiguous. At which point, what does it even mean for an LLM to be reliable?
And if you start feeding an unambiguous, formal language to an LLM, couldn't you just write a compiler for that language instead of having the LLM interpret it?
> We don't call architects 'vibe architects' even though they copy-paste 4/5th of your next house and use a library of things in their work!
> We don't call builders 'vibe builders' for using earth-moving machines instead of a shovel...
> When was the last time you reviewed the machine code produced by a compiler?
Sure, because those are categorically different. You are describing shortcuts of two classes: boilerplate (library of things) and (deterministic/intentional) automation. Vibe coding doesn't use either of those things. The LLM agents involved might use them, but the vibe coder doesn't.
Vibe coding is delegation, which is a completely different class of shortcut or "tool" use. If an architect delegates all their work to interns, directs outcomes based on whims not principals, and doesn't actually know what the interns are delivering, yeah, I think it would be fair to call them a vibe architect.
We didn't have that term before, so we usually just call those people "arrogant pricks" or "terrible bosses". I'm not super familiar but I feel like Steve Jobs was pretty famously that way - thus if he was an engineer, he was a vibe engineer. But don't let this last point detract from the message, which is that you're describing things which are not really even similar to vibe coding.
I do not see LLM coding as another step up on the ladder of programming abstraction.
If your project is in, say, Python, then by using LLMs, you are not writing software in English; you are having an LLM write software for you in Python.
This is much more like delegation of work to someone else, than it is another layer in the machine-code/assembly/C/Python sort of hierarchy.
In my regular day job, I am a project manager. I find LLM coding to be effectively project management. As a project manager, I am free to dive down to whatever level of technical detail I want, but by and large, it is others on the team who actually write the software. If I assign a task, I don't say "I wrote that code", because I didn't; someone else did, even if I directed it.
And then, project management, delegating to the team, is most certainly nondeterministic behavior. Any programmer on the team might come up with a different solution, each of which works. The same programmer might come up with more than one solutions, all of which work.
I don't expect the programmers to be deterministic. I do expect the compiler to be deterministic.
I think you are right in placing emphasis on delegation.
There’s been a hypothesis floating around that I find appealing. Seemingly you can identify two distinct groups of experienced engineers. Manager, delegator, or team lead style senior engineers are broadly pro-AI. The craftsman, wizard, artist, IC style senior engineers are broadly anti-AI.
But coming back to architects, or most professional services and academia to be honest, I do think the term vibe architect as you define it is exactly how the industry works. An underclass of underpaid interns and juniors do the work, hoping to climb higher and position themselves towards the top of the ponzi-like pyramid scheme.
> We don't call architects 'vibe architects' even though they copy-paste 4/5th of your next house and use a library of things in their work!
Architect's copy-pasting is equivalent to a software developer reusing a tried and tested code library. Generating or writing new code is fundamentally different and not at all comparable.
> We don't call builders 'vibe builders' for using earth-moving machines instead of a shovel...
We would call them "vibe builders" if their machines threw bricks around randomly and the builders focused all of their time on engineering complex scaffolding around the machines to get the bricks flying roughly in the right direction.
But we don't because their machines, like our compilers and linters, do one job and they do it predictably. Most trades spend obscene amounts of money on tools that produce repeatable results.
> That's a lot of years! They're still called architects.
Because they still architect, they don't subcontract their core duties to architecture students overseas and just sign their name under it.
I find it fitting and amusing that people who are uncritical towards the quality of LLM-generated work seem to make the same sorts of reasoning errors that LLMs do. Something about blind spots?
Very likely, yes. One day we'll have a clearer understanding of how minds generalize concepts into well-trodden paths even when they're erroneous, and it'll probably shed a lot of light onto concepts like addiction.
Architects went from drawing everything on paper to using CAD, not over a generation, but over a few years, after CAD and computers got good enough.
It therefore depends on where we place the discovery/availability of the product. If we place it at the time of prototype production (in the early 1960s for CAD), it took a generation (20-30 years), since by the early and mid-1990s, all professionals were already using CAD.
But if we place it at the time when CAD and personal computers became available to the general public (e.g., mid-1980s), it took no more than 5-10 years. I attended a technical school in the 1990s, and we started with hand drawing in the first two years and used CAD systems in the remaining three years of school.
The same can be said for AI. If we place the beginning of AI in the mid-1980s, the wider adoption of AI took more than a generation. If we place it at the time OpenAI developed GPT, it took 5-10 years.
It's not about the tooling it's about the reasoning. An architect copy pasting existing blueprints is still in charge and has to decide what the copy paste and where. Same as programmer slapping a bunch of code together, plumbing libraries or writing fresh code. They are the ones who drive the logical reasoning and the building process.
The ai tooling reverses this where the thinking is outsourced to the machine and the user is borderline nothing more than a spectator, an observer and a rubber stamp on top.
Anyone who is in this position seriously need to think their value added. How do they plan to justify their position and salary to the capital class. If the machine is doing the work for you, why would anyone pay you as much as they do when they can just replace you with someone cheaper, ideally with no-one for maximum profit.
Everyone is now in a competition not only against each other but also against the machine. And any specialized. Expert knowledge moat that you've built over decades of hard work is about to evaporate.
This is the real pressing issue.
And the only way you can justify your value added, your position, your salary is to be able to undermine the AI, find flaws in it's output and reasoning. After all if/when it becomes flawless you have no purpose to the capital class!
> The ai tooling reverses this where the thinking is outsourced to the machine and the user is borderline nothing more than a spectator, an observer and a rubber stamp on top.
I find it a bit rare that this is the case though. Usually I have to carefully review what it's doing and guide it. Either by specific suggestions, or by specific tests, etc. I treat it as a "code writer" that doesn't necessarily understand the big picture. So I expect it to fuck up, and correcting it feels far less frustrating if you consider it a tool you are driving rather than letting it drive you. It's great when it gets things right but even then it's you that is confirming this.
This is exactly what I said in the end. Right now you rely on it fucking things up. What happens to you when the AI no longer fucks things up? Sorry to say, but your position is no longer needed.
Don't take this as criticizing LLMs as a whole, but architects also don't call themselves engineers. Engineers are an entirely distinct set of roles that among other things validate the plan in its totality, not only the "new" 1/5th. Our job spans both of these.
"Architect" is actually a whole career progression of people with different responsibilities. The bottom rung used to be the draftsmen, people usually without formal education who did the actual drawing. Then you had the juniors, mid-levels, seniors, principals, and partners who each oversaw different aspects. The architects with their name on the building were already issuing high level guidance before the transition instead of doing their own drawings.
When was the last time you reviewed the machine code produced by a compiler?
Last week, to sanity check some code written by an LLM.
> Engineers are an entirely distinct set of roles that among other things validate the plan in its totality, not only the "new" 1/5th. Our job spans both of these.
Where this analogy breaks down is that the work you’re describing is done by Professional Engineers that have strict licensing and are (criminally) liable for the end result of the plans they approve.
That is an entirely different role from the army of civil, mechanical, and electrical engineers (some who are PEs and some who are not) who do most of the work for the principal engineer/designated engineer/engineer of record, that have to trust building codes and tools like FEA/FEM that then get final approval from the most senior PE. I don’t think the analogy works, as software engineers rarely report to that kind of hierarchy. Architects of Record on construction projects are usually licensed with their own licensing organization too, with layers of licensed and unlicensed people working for them.
That diversity of roles is what "among other things" was meant to convey. My job at least isn't terribly different, except that licensing doesn't exist and I don't get an actual stamp. My company (and possibly me depending on the facts of the situation) is simply liable if I do something egregious that results in someone being hurt.
> Where this analogy breaks down is that the work you’re describing is done by Professional Engineers that have strict licensing and are (criminally) liable for the end result of the plans they approve.
there are plenty of software engineers that work in regulated industries, with individual licensing, criminal liability, and the ability to be struck off and banned from the industry by the regulator
It's not that PE's can't design or review buildings in whatever city the egregious failure happened.
It's that PE's can't design or review buildings at all in any city after an egregious failure.
It's not that PE's can't design or review hospital building designs because one of their hospital designs went so egregiously sideways.
It's that PE's can't design or review any building for any use because their design went so egregiously sideways.
I work in an FDA regulated software area. I need 510k approval and the whole nine. But if I can't write regulated medical or dental software anymore, I just pay my fine and/or serve my punishment and go sling React/JS/web crap or become a TF/PyTorch monkey. No one stops me. Consequences for me messing up are far less severe than the consequences for a PE messing up. I can still write software because, in the end, I was never an "engineer" in that hard sense of the word.
Same is true of any software developer. Or any unlicensed area of "engineering" for that matter. We're only playing at being "engineers" with the proverbial "monopoly money". We lose? Well, no real biggie.
PE's agree to hang a sword of damocles over their own heads for the lifetime of the bridge or building they design. That's a whole different ball game.
>if I approve a bad release that leads to an egregious failure, for me it's a prison sentence and unlimited fines
Again, I'm in 510k land. The same applies to myself. No one's gonna allow me to irradiate a patient with a 10x dose because my bass ackwards software messed up scientific notation. To remove the wrong kidney because I can't convert orthonormal basis vectors correctly.
But the fact remains that no one would stop either of us from writing software in the future in some other domain.
They do stop PE's from designing buildings in the future in any other domain. By law. So it's very much a different ball game. After an egregious error, we can still practice our craft, because we aren't "engineers" at the end of the day. (Again, "engineers" in that hard sense of the word.) PE's can't practice their craft any longer after an egregious error. Because they are "engineers" in that hard sense of the word.
Reasoning by analogy is usually a bad idea, and nowhere is this worse than talking about software development.
It’s just not analogous to architecture, or cooking, or engineering. Software development is just its own thing. So you can’t use analogy to get yourself anywhere with a hint of rigour.
The problem is, AI is generating code that may be buggy, insecure, and unmaintainable. We have as a community spent decades trying to avoid producing that kind of code. And now we are being told that productivity gains mean we should abandon those goals and accept poor quality, as evidenced by MoltBook’s security problems.
It’s a weird cognitive dissonance and it’s still not clear how this gets resolved.
Now then, Moltbook is a pathological case. Either it remains a pathological case or our whole technological world is gonna stumble HARD as all the fundamental things collapse.
I prefer to think Moltbook is a pathological case and unrepresentative, but I've also been rethinking a sort of game idea from computer-based to entirely paper/card based (tariffs be damned) specifically for this reason. I wish to make things that people will have even in the event that all these nice blinky screens are ruined and go dark.
Just the first system that was coded by AI could think of. Note this is unrelated to the fact that its users are LLMs - the problem was in the development of Moltbook itself.
> We don't call architects 'vibe architects' even though they copy-paste 4/5th of your next house and use a library of things in their work!
Maybe not, but we don't allow non-architects to vomit out thousands of diagrams that they cannot review, and that is never reviewed, which are subsequently used in the construction of the house.
Your analogy to s/ware is fatally and irredeemably flawed, because you are comparing the regulated and certification-heavy production of content, which is subsequently double-checked by certified professionals, with an unregulated and non-certified production of content which is never checked by any human.
I don't see a flaw, I think you're just gatekeeping software creation.
Anyone can pick up some CAD software and design a house if they so desire. Is the town going to let you build it without a certified engineer/architect signing off? Fuck no. But we don't lock down CAD software.
And presumably, mission critical software is still going to be stamped off on by a certified engineer of some sort.
> Anyone can pick up some CAD software and design a house if they so desire. Is the town going to let you build it without a certified engineer/architect signing off? Fuck no. But we don't lock down CAD software.
No, we lock down using that output from the CAD software in the real world.
> And presumably, mission critical software is still going to be stamped off on by a certified engineer of some sort.
The "mission critical" qualifier is new to your analogy, but is irrelevant anyway - the analogy breaks because, while you can do what you like with CAD software on your own PC, that output never gets used outside of your PC without careful and multiple levels of review, while in the s/ware case, there is no review.
I am not really sure what you are getting at here. Are you suggesting that people should need to acquire some sort of credential to be allowed to code?
> Are you suggesting that people should need to acquire some sort of credential to be allowed to code?
No, I am saying that you are comparing professional $FOO practitioners to professional $BAR practitioners, but it's not a valid comparison because one of those has review and safety built into the process, and the other does not.
You can't use the assertion "We currently allow $FOO practitioners to use every single bit of automation" as evidence that "We should also allow $BAR practitioners to use every bit of automation", because $FOO output gets review by certified humans, and $BAR output does not.
Thanks brother. I flew half way around the world yesterday and am jetlagged as fuck from a 12 hour time change. I'm sorry, my brain apparently shut off, but I follow now. Was out to lunch.
> Thanks brother. I flew half way around the world yesterday and am jetlagged as fuck from a 12 hour time change. I'm sorry, my brain apparently shut off, but I follow now. Was out to lunch.
You know, this was a very civilised discussion; below I've got someone throwing snide remarks my way for some claims I made. You just factually reconfirmed and re-verified until I clarified my PoV.
> We don't call architects 'vibe architects' even though (…)
> We don't call builders 'vibe builders' for (…)
> When was the last time (…)
None of those are the same thing. At all. They are still all deterministic approaches. The architect’s library of things doesn’t change every time they use it or present different things depending on how they hold it. It’s useful because it’s predictable. Same for all your other examples.
If we want to have an honest discussion about the pros and cons of LLM-generated code, proponents need to stop being dishonest in their comparisons. They also need to stop plugging their ears and not ignore the other issues around the technology. It is possible to have something which is useful but whose advantages do not outweigh the disadvantages.
I think the word predictable is doing a bit of heavy lifting there.
Lets say you shovel some dirt, you’ve got a lot of control over where you get it from and where you put it..
Now get in your big digger’s cabin and try to have the same precision. At the level of a shovel-user, you are unpredictable even if you’re skilled. Some of your work might be out a decent fraction of the width of a shovel. That’d never happen if you did it the precise way!
But you have a ton more leverage. And that’s the game-changer.
That’s another dishonest comparison. Predictability is not the same as precision. You don’t need to be millimetric when shovelling dirt at a construction site. But you do need to do it when conducting brain surgery. Context matters.
Sure. If you’re racing your runway to go from 0 to 100 users you’d reach for a different set of tools than if you’re contributing to postgres.
In other words I agree completely with you but these new tools open up new possibilities. We have historically not had super-shovels so we’ve had to shovel all the things no matter how giant or important they are.
I’m not disputing that. What I’m criticising is the argument from my original parent post of comparing it to things which are fundamentally different, but making it look equivalent as a justification against criticism.
I skimmed over it, and didn’t find any discussion of:
- Pull requests
- Merge requests
- Code review
I feel like I’m taking crazy pills. Are SWE supposed to move away from code review, one of the core activities for the profession? Code review is as fundamental for SWE as double entry is for accounting.
Yes, we know that functional code can get generated at incredible speeds. Yes, we know that apps and what not can be bootstrapped from nothing by “agentic coding”.
We need to read this code, right? How can I deliver code to my company without security and reliability guarantees that, at their core, come from me knowing what I’m delivering line-by-line?
The primary point behind code reviews is to let author to know that someone else will look at their code. They are a psychological tool and that, AFAIK, don't work well with the AI models. If the code is important enough that you want to review it then you should probably be using a different, more interactive flow.
Mitchell talks about this in a round about way... in the "Reproduce your own work" section he obviously reviewed that code as that was the point. In the "End-of-day agents" section he talks about what he found them good for (so far). He previously wrote about how he preferred an interactive style and this article aligns with that with his progress understanding how code agents can be useful.
Give it a read, he mentions briefly how he uses for PR triages and resolving GH issues.
He doesn't go in details, but there is a bit:
> Issue and PR triage/review. Agents are good at using gh (GitHub CLI), so I manually scripted a quick way to spin up a bunch in parallel to triage issues. I would NOT allow agents to respond, I just wanted reports the next day to try to guide me towards high value or low effort tasks.
> More specifically, I would start each day by taking the results of my prior night's triage agents, filter them manually to find the issues that an agent will almost certainly solve well, and then keep them going in the background (one at a time, not in parallel).
This is a short excerpt, this article is worth reading. Very grounded and balanced.
Okay I think this somewhat answers my question. Is this individual a solo developer? “Triaging GitHub issues” sounds a bit like open source solo developer.
Guess I’m just desperate for an article about how organizations are actually speeding up development using agentic AI. Like very practical articles about how existing development processes have been adjusted to facilitate agentic AI.
I remain unconvinced that agentic AI scales beyond solo development, where the individual is liable for the output of the agents. More precisely, I can use agentic AI to write my code, but at the end of the day when I submit it to my org it’s my responsibility to understand it, and guarantee (according to my personal expertise) its security and reliability.
Conversely, I would fire (read: reprimand) someone so fast if I found out they submitted code that created a vulnerability that they would have reasonably caught if they weren’t being reckless with code submission speed, LLM or not.
AI will not revolutionize SWE until it revolutionizes our processes. It will definitely speed us up (I have definitely become faster), but faster != revolution.
> Guess I’m just desperate for an article about how organizations are actually speeding up development using agentic AI. Like very practical articles about how existing development processes have been adjusted to facilitate agentic AI.
They probably aren't really. At least in orgs I worked at, writing the code wasn't usually the bottleneck. It was in retrospect, 'context' engineering, waiting for the decision to get made, making some change and finding it breaks some assumption that was being made elsewhere but wasn't in the ticket, waiting for other stakeholders to insert their piece of the context, waiting for $VENDOR to reply about why their service is/isn't doing X anymore, discovering that $VENDOR_A's stage environment (that your stage environment is testing against for the integration) does $Z when $VENDOR_B_C_D don't do that, etc.
The ecosystem as a whole has to shift for this to work.
The author of the blog made his name and fortune founding Hashicorp, makers of Vagrant and Terraform among other things. Having done all that in his twenties he retired as the CTO and reappeared after a short hiatus with a new open source terminal, Ghostty.
Generally don’t pay attention to names unless it’s someone like Torvalds, Stroustrop, or Guido. Maybe this guy needs another decade of notoriety or something.
Either really comprehensive tests (that you read) or read it. Usually i find you can skim most of it, but like in core sections like billing or something you gotta really review it. The models still make mistakes.
You read it. You now have an infinite army of overconfident slightly drunken new college grads to throw at any problem.
Some times you’re gonna want to slowly back away from them and write things yourself. Sometimes you can farm out work to them.
Code review their work as you would any one else’s, in fact more so.
My rule of thumb has been it takes a senior engineer per every 4 new grads to mentor them and code review their work. Or put another way bringing on a new grad gets you +1 output at the cost of -0.25 a senior.
Also, there are some tasks you just can’t give new college grads.
Same dynamic seems to be shaping up here. Except the AI juniors are cheap and work 24*7 and (currently) have no hope of growing into seniors.
> Same dynamic seems to be shaping up here. Except the AI juniors are cheap and work 24*7 and (currently) have no hope of growing into seniors.
Each individual trained model... sure. But otoh you can look at it as a very wide junior with "infinite (only limited by your budget)" willpower. Sure, three years ago they were GPT-3.5, basically useless. And now they're Opus 4.6. I wonder what the next few years will bring.
we're talking about _this_ post? He specifically said he only runs one agent, so sure he probably reviews the code or as he stated finds means of auto-verifying what the agent does (giving the agent a way to self-verify as part of its loop).
For me, AI is the best for code research and review
Since some team members started using AI without care, I did create bunch of agents/skills/commands and custom scripts for claude code. For each PR, it collects changes by git log/diff, read PR data and spin bunch of specialized agents to check code style, architecture, security, performance, and bugs. Each agent armed with necessary requirement documents, including security compliance files. False positives are rare, but it still misses some problems. No PR with ai generated code passes it. If AI did not find any problems, I do manual review.
From my own experience, I kept this post bookmarked because I too worked on that project in the late 1990s, you cannot review those changes anyway. It is handled as described, you keep tweaking stuff until the tests pass. There is fundamentally no way to understand the code. Maybe its different in some very core parts, but most of it is just far too messy. I tried merely disentangling a few types ones, because there were a lot of duplicate types for the most simple things, such as 32 bit integers, and it is like trying to pick one noodle out of a huge bowl of spaghetti, and everything is glued and knotted together, so you always end up lifting out the entire bowl's contents. No AI necessary, that is just how such projects like after many generations of temporary programmers (because all sane people will leave as soon as they can, e.g. once they switched from an H1B to a Green Card) under ticket-closing pressure.
I don't know why since the beginning of these discussions some commenters seem to work off wrong assumptions that thus far our actual methods lead to great code. Very often they don't, they lead to a huge mess over time that just gets bigger.
And that is not because people are stupid, its because top management has rationally determined that the best balance for overall profits does not require perfect code. If the project gets too messy to do much the customers will already have been hooked and can't change easily, and when they do, some new product will have already replaced the two decades old mature one. Those customers still on the old one will pay premium for future bug fixes, and the rest will jumpt to the new trend. I don't think AI can make what's described above any, or much worse.
You're missing the point. The point is that reading the code is more time consuming than writing it, and has always been thus. Having a machine that can generate code 100x faster, but which you have to read carefully to make sure it hasn't gone off the rails, is not an asset. It is a liability.
> The point is that reading the code is more time consuming than writing it, and has always been thus.
Huh?
First, that is definitely not true. If it were, dev teams would spend the majority of their time on code review, but they don't.
And second, even if it were true, you have to read it for code review even if it was written by a person anyways, if we're talking about the context of a team.
I didn't get into creating software so I could read plagiarism laundering machines output. Sorry, miss me with these takes. I love using my keyboard, and my brain.
I have a profession. Therefore I evaluate new tools. Agents coding I've introduced into my auxiliary tool forgings (one-off bash scripts) and personal projects, and I'm just now comfortable to introduce into my professional work. But I still evaluate every line.
I think this is the crux of why, when used as an enhancement to solo productivity, you'll have a pretty strict upper bound on productivity gains given that it takes experienced engineers to review code that goes out at scale.
That being said, software quality seems to be decreasing, or maybe it's just cause I use a lot of software in a somewhat locked down state with adblockers and the rest.
Although, that wouldn't explain just how badly they've murdered the once lovely iTunes (now Apple Music) user interface. (And why does CMD-C not pick up anything 15% of the time I use it lately...)
Anyways, digressions aside... the complexity in software development is generally in the organizational side. You have actual users, and then you have people who talk to those users and try to see what they like and don't like in order to distill that into product requirements which then have to be architected, and coordinated (both huge time sinks) across several teams.
Even if you cut out 100% of the development time, you'd still be left with 80% of the timeline.
Over time though... you'll probably see people doing what I do all day (which is move around among many repositories (although I've yet to use the AI much, got my Cursor license recently and am gonna spin up some POCs that I want to see soon)), enabled by their use of AI to quickly grasp what's happening in the repo, and the appropriate places to make changes.
Enabling developers to complete features from tip to tail across deep, many pronged service architectures would could bring project time down drastically and bring project management, and cross team coordination costs down tremendously.
Similarly, in big companies, the hand is often barely aware at best of the foot. And space exploration is a serious challenge. Often folk know exactly one step away, and rely on well established async communication channels which also only know one step further. Principal engineers seem to know large amounts about finite spaces and are often in the dark small hops away to things like the internal tooling for the systems they're maintaining (and often not particularly great at coming in to new spaces and thinking with the same perspective... no we don't need individual micro services for every 12 request a month admin api group we want to set up).
Once systems can take a feature proposal and lay out concrete plans which each little kingdom can give a thumbs up or thumbs down to for further modifications, you can again reduce exploration, coordination, and architecture time down.
Sadly, seems like User Experience design is an often terribly neglected part of our profession. I love the memes about an engineer building the perfect interface like a water pitcher only for the person to position it weirdly in order to get a pour out of the fill hole or something. Lemme guess how many users you actually talked to (often zero), and how many layers of distillation occurred before you received a micro picture feature request that ends up being build and taking input from engineers with no macro understanding of a user's actual needs, or day to day.
And who often are much more interested in perfecting some little algorithm thank thinking about enabling others.
So my money is on money flowing to...
- People who can actually verify system integrity, and can fight fires and bugs (but a lot of bug fixing will eventually becoming prompting?)
- Multi-talented individuals who can say... interact with users well enough to understand their needs as well as do a decent job verifying system architecture and security
It's outside of coding where I haven't seen much... I guess people use it to more quickly scaffold up expense reports, or generate mocks. So, lots of white collar stuff. But... it's not like the experience of shopping at the supermarket has changed, or going to the movies, or much of anything else.
Your sentiment resonates with me a lot. I wonder what we’ll consider the inflection point 10 years from now. It seemed like the zeitgeist was screaming about scaling limits and running out of training data, then we got Claude code, sonnet 4.5, then Opus 4.5 and no ones looked back since.
I wonder too. It might be that progress on the underlying models is going to plateau, or it might be that we haven't yet reached what in retrospect will be the biggest inflection point. Technological developments can seem to make sense in hindsight as a story of continuous progress when the dust has settled and we can write and tell the history, but when you go back and look at the full range of voices in the historical sources you realize just how deeply nothing was clear to anyone at all at the time it was happening because everyone was hurtling into the unknown future with a fog of war in front of them. In 1910 I'd say it would have been perfectly reasonable to predict airplanes would remain a terrifying curiosity reserved for daredevils only (and people did); or conversely, in the 1960s a lot of commentators thought that the future of passenger air travel in the 70s and 80s would be supersonic jets. I keep this in mind and don't really pay too much attention to over-confident predictions about the technological future.
Should AI tools use memory safe tabs or spaces for indentation? :)
It is a shame it's become such a polarized topic. Things which actually work fine get immediately bashed by large crowds at the same time things that are really not there get voted to the moon by extremely eager folks. A few years from now I expect I'll be thinking "man, there was some really good stuff I missed out on because the discussions about it were so polarized at the time. I'm glad that has cleared up significantly!"
let me ask a stupid/still-ignorant question - about repeatability.
If one asks this generator/assistant same request/thing, within same initial contexts, 10 times, would it generate same result ? in different sessions and all that.
because.. if not, then it's for once-off things only..
A pretty bad comparison. If I gave you the correct answer once, it's unlikely that I'll give you a wrong answer the next time. Also, aren't computers supposed to be more reliable than us? If I'm going to use a tool that behaves just like humans, why not just use my brain instead?
I will give Claude Code a trial run if I can run it locally without an internet connection. AI companies have procured so much training data through illegal means you have to be insane to trust them in even the smallest amount.
this is such a strawman argument. what are they going to take from you? your triple forloop? they literally own the weights for a neural net that scores 77% on SWE. they dont need, nor care, about your code
GPT-4 showed the potential but the automated workflows (context management, loops, test-running) and pure execution speed to handle all that "reasoning"/workflows (remember watching characters pop in slowly in GPT-4 streaming API response calls) are gamechangers.
The workflow automation and better (and model-directed) context management are all obvious in retrospect but a lot of people (like myself) were instead focused on IDE integration and such vs `grep` and the like. Maybe multi-agent with task boards is the next thing, but it feels like that might also start to outrun the ability to sensibly design and test new features for non-greenfield/non-port projects. Who knows yet.
I think it's still very valuable for someone to dig in to the underlying models periodically (insomuch as the APIs even expose the same level of raw stuff anymore) to get a feeling for what's reliable to one-shot vs what's easily correctable by a "ran the tests, saw it was wrong, fixed it" loop. If you don't have a good sense of that, it's easy to get overambitious and end up with something you don't like if you're the sort of person who cares at all about what the code looks like.
I think for a lot of people the turn off is the constant churn and the hype cycle. For a lot of people, they just want to get things done and not have to constantly keep on top of what's new or SOTA. Are we still using MCPs or are we using Skills now? Not long ago you had to know MCP or you'd be left behind and you definitely need to know MCP UI or you'll be left behind. I think. It just becomes really tiring, especially with all the FUD.
I'm embracing LLMs but I think I've had to just pick a happy medium and stick with Claude Code with MCPs until somebody figures out a legitimate way to use the Claude subscription with open source tools like OpenCode, then I'll move over to that. Or if a company provides a model that's as good value that can be used with OpenCode.
It reminds me a lot of 3D Printing tbh. Watching all these cool DIY 3d printing kits evolve over years, I remember a few times I'd checked on costs to build a DIY one. They kept coming down, and down, and then around the same time as "Build a 3d printer for $200 (some assembly required)!" The Bambu X1C was announced/released, for a bit over a grand iirc? And its whole selling point was that it was fast and worked, out of the box. And so I bought one and made a bunch of random one-off-things that solved _my_ specific problem, the way I wanted it solved. Mostly in the form of very specific adapter plates that I could quickly iterate on and random house 'wouldn't it be nice if' things.
That's kind of where AI-agent-coding is now too, though... software is more flexible.
> For a lot of people, they just want to get things done and not have to constantly keep on top of what's new or SOTA
That hasn’t been tech for a long time.
Frontend has been changing forever. React and friends have new releases all the time. Node has new package managers and even Deno and Bun. AWS keeps changing things.
You really shouldn't use the absolute hellscape of churn that is web dev as an example of broader industry trends. No other sub-field of tech is foolish enough to chase hype and new tools the way web dev is.
I think the web/system dichotomy is also a major conflating factor for LLM discussions.
A “few hundred lines of code” in Rust or Haskell can be bumping into multiple issues LLM assisted coding struggles with. Moving a few buttons on a website with animations and stuff through multiple front end frameworks may reasonably generate 5-10x that much “code”, but of an entirely different calibre.
3,000 lines a day of well-formatted HTML template edits, paired with a reloadable website for rapid validation, is super digestible, while 300 lines of code per day into curl could be seen as reckless.
Exactly this. At work, I’ve seen front-end people generating probably 80% of their code because when you set aside framework churn, a lot of it is boilerplatey and borderline trivial (sorry). Meanwhile, the programmers working on the EV battery controller that uses proprietary everything and where a bug could cause an actual explosion are using LLMs as advanced linters and that’s it.
There's a point at which these things become Good Enough though, and don't bottleneck your capacity to get things done.
To your point, React, while it has new updates, hasn't changed the fundamentals since 16.8.0 (introduction of hooks) and that was 7 years ago. Yes there are new hooks, but they typically build on older concepts. AWS hasn't deprecated any of our existing services at work (besides maybe a MySQL version becoming EOL) in the last 4 years that I've worked at my current company.
While I prefer pnpm (to not take up my MacBook's inadequate SSD space), you can still use npm and get things done.
I don't need to keep obsessing over whether Codex or Claude have a 1 point lead in a gamed benchmark test so long as I'm still able to ship features without a lot of churn.
Isn’t there something off about calling predictions about the future, that aren’t possible with current tech, hype? Like people predicted AI agents would be this huge change, they were called hype since earlier models were so unreliable, and now they are mostly right as ai agents work like a mid level engineer. And clearly super human in some areas.
> They do not.
Do you have anything to back this up? This seems like a shallow dismissal. Claude Code is mostly used to ship Claude Code and Claude Cowork - which are at multi billion ARR. I use Claude Code to ship technically deep dev tools for myself for example here https://github.com/ianm199/bubble-analysis. I am a decent engineer and I wouldn't have the time or expertise to ship that.
>Sure, if you think calculators or bicycles are "superhuman technology".
Uh, yes they are? That's why they were revolutionary technologies!
It's hard to see why a bike that isn't superhuman would even make sense? Being superhuman in at least some aspect really seems like the bare minimum for a technology to be worth adopting.
Is there any reason to use Claude Code specifically over Codex or Gemini? I’ve found the both Codex and Gemini similar in results, but I never tried Claude because of I keep hearing usage runs out so fast on pro plans and there’s no free trial for the CLI.
I mostly mentioned Claude Code because it's what Mitchell first tried according to his post, and it's what I personally use. From what I hear Codex is pretty comparable; it has a lot of fans. There are definitely some differences and strengths and weaknesses of both the CLIs and the underlying LLMs that others who use more than one tool might want to weight in on, but they're all fairly comparable. (Although, we'll see how the new models released from Anthropic and OpenAI today stack up.) Codex and Gemini CLI are basically Claude Code clones with different LLMs behind them, after all.
IME Gemini is pretty slow in comparison to Claude - but hey, it's super cheap at least.
But that speed makes a pretty significant difference in experience.
If you wait a couple minutes and then give the model a bunch of feedback about what you want done differently, and then have to wait again, it gets annoying fast.
If the feedback loop is much tighter things feel much more engaging. Cursor is also good at this (investigate and plan using slower/pricier models, implement using fast+cheap ones).
but annoying hype is exactly the issue with AI in my eyes. I get it's a useful tool in moderation and all, but I also experience that management values speed and quantity of delivery above all else, and hype-driven as they are I fear they will run this industry to the ground and we as users and customers will have to deal with the world where software is permanently broken as a giant pile of unmaintainable vibe code and no experienced junior developers to boot.
>management values speed and quantity of delivery above all else
I don't know about you but this has been the case for my entire career.
Mgmt never gave a shit about beautiful code or tech debt or maintainability or how enlightened I felt writing code.
> It's a shame that AI coding tools have become such a polarizing issue among developers.
Frankly I'm so tired of the usual "I don't find myself more productive", "It writes soup". Especially when some of the best software developers (and engineers) find many utility in those tools, there should be some doubt growing in that crowd.
I have come to the conclusion that software developers, those only focusing on the craft of writing code are the naysayers.
Software engineers immediately recognize the many automation/exploration/etc boosts, recognize the tools limits and work on improving them.
Hell, AI is an insane boost to productivity, even if you don't have it write a single line of code ever.
But people that focus on the craft (the kind of crowd that doesn't even process the concept of throwaway code or budgets or money) will keep laying in their "I don't see the benefits because X" forever, nonsensically confusing any tool use with vibe coding.
I'm also convinced that since this crowd never had any notion of what engineering is (there is very little of it in our industry sadly, technology and code is the focus and rarely the business, budget and problems to solve) and confused it with architectural, technological or best practices they are genuinely insecure about their jobs because once their very valued craft and skills are diminished they pay the price of never having invested in understanding the business, the domain, processes or soft skills.
I've spent 2+ decades producing software across a number of domains and orgs and can fully agree that _disciplined use_ of LLM systems can significantly boost productivity, but the rules and guidance around their use within our industry writ large are still in flux and causing as many problems as they're solving today.
As the most senior IC within my org, since the advent of (enforced) LLM adoption my code contribution/output has stalled as my focus has shifted to the reactionary work of sifting through the AI generated chaff following post mortems of projects that should have never have shipped in the first place. On a good day I end up rejecting several PRs that most certainly would have taken down our critical systems in production due to poor vetting and architectural flaws, and on the worst I'm in full on fire fighting mode to "fix" the same issues already taking down production (already too late.)
These are not inherent technical problems in LLMs, these are organizational/processes problems induced by AI pushers promising 10x output without the necessary 10x requirements gathering and validation efforts that come with that. "Everyone with GenAI access is now a 10x SDE" is the expectation, when the reality is much more nuanced.
The result I see today is massive incoming changesets that no one can properly vet given the new shortened delivery timelines and reduced human resourcing given to projects. We get test suite coverage inflation where "all tests pass" but undermine core businesses requirements and no one is being given the time or resources to properly confirm the business requirements are actually being met. Shit hits the fan, repeat ad nauseum. The focus within our industry needs to shift to education on the proper application and use of these tools, or we'll inevitably crash into the next AI winter; an increasingly likely future that would have been totally avoidable if everyone drinking the Koolaid stopped to observe what is actually happening.
As you implied, code is cheap and most code is "throwaway" given even modest time horizons, but all new code comes with hidden costs not readily apparent to all the stakeholders attempting to create a new normal with GenAI. As you correctly point out, the biggest problems within our industry aren't strictly technical ones, they're interpersonal, communication and domain expertise problems, and AI use is simply exacerbating those issues. Maybe all the orgs "doing it wrong" (of which there are MANY) simply fail and the ones with actual engineering discipline "make it," but it'll be a reckoning we should not wish for.
I have heard from a number of different industry players and they see the same patterns. Just look at the average linked in post about AI adoption to confirm. Maybe you observe different patterns and the issues aren't as systemic as I fear. I honestly hope so.
Your implication that seniors like myself are "insecure about our jobs" is somewhat ironically correct, but not for the reasons you think.
The Death of the "Stare": Why AI’s "Confident Stupidity" is a Threat to Human Genius
OPINION | THE REALITY CHECK
In the gleaming offices of Silicon Valley and the boardrooms of the Fortune 500, a new religion has taken hold. Its deity is the Large Language Model, and its disciples—the AI Evangelists—speak in a dialect of "disruption," "optimization," and "seamless integration." But outside the vacuum of the digital world, a dangerous friction is building between AI’s statistical hallucinations and the unyielding laws of physics.
The danger of Artificial Intelligence isn't that it will become our overlord; the danger is that it is fundamentally, confidently, and authoritatively stupid.
The Paradox of the Wind-Powered Car
The divide between AI hype and reality is best illustrated by a recent technical "solution" suggested by a popular AI model: an electric vehicle equipped with wind generators on the front to recharge the battery while driving. To the AI, this was a brilliant synergy. It even claimed the added weight and wind resistance amounted to "zero."
To any human who has ever held a wrench or understood the First Law of Thermodynamics, this is a joke—a perpetual motion fallacy that ignores the reality of drag and energy loss. But to the AI, it was just a series of words that sounded "correct" based on patterns. The machine doesn't know what wind is; it only knows how to predict the next syllable.
The Erosion of the "Human Spark"
The true threat lies in what we are sacrificing to adopt this "shortcut" culture. There is a specific human process—call it The Stare. It is that thirty-minute window where a person looks at a broken machine, a flawed blueprint, or a complex problem and simply observes.
In that half-hour, the human brain runs millions of mental simulations. It feels the tension of the metal, the heat of the circuit, and the logic of the physical universe. It is a "Black Box" of consciousness that develops solutions from absolutely nothing—no forums, no books, and no Google.
However, the new generation of AI-dependent thinkers views this "Stare" as an inefficiency. By outsourcing our thinking to models that cannot feel the consequences of being wrong, we are witnessing a form of evolutionary regression. We are trading hard-earned competence for a "Yes-Man" in a box.
The Gaslighting of the Realist
Perhaps most chilling is the social cost. Those who still rely on their intuition and physical experience are increasingly being marginalized. In a world where the screen is king, the person pointing out that "the Emperor has no clothes" is labeled as erratic, uneducated, or naive.
When a master craftsman or a practical thinker challenges an AI’s "hallucination," they aren't met with logic; they are met with a robotic refusal to acknowledge reality. The "AI Evangelists" have begun to walk, talk, and act like the models they worship—confidently wrong, devoid of nuance, and completely detached from the ground beneath their feet.
The High Cost of Being "Authoritatively Wrong"
We are building a world on a foundation of digital sand. If we continue to trust AI to design our structures and manage our logic, we will eventually hit a wall that no "prompt" can fix.
The human brain runs on 20 watts and can solve a problem by looking at it. The AI runs on megawatts and can’t understand why a wind-powered car won't run forever. If we lose the ability to tell the difference, we aren't just losing our jobs—we're losing our grip on reality itself.
> Break down sessions into separate clear, actionable tasks. Don't try to "draw the owl" in one mega session.
This is the key one I think. At one extreme you can tell an agent "write a for loop that iterates over the variable `numbers` and computes the sum" and they'll do this successfully, but the scope is so small there's not much point in using an LLM. On the other extreme you can tell an agent "make me an app that's Facebook for dogs" and it'll make so many assumptions about the architecture, code and product that there's no chance it produces anything useful beyond a cool prototype to show mom and dad.
A lot of successful LLM adoption for code is finding this sweet spot. Overly specific instructions don't make you feel productive, and overly broad instructions you end up redoing too much of the work.
This is actually an aspect of using AI tools I really enjoy: Forming an educated intuition about what the tool is good at, and tastefully framing and scoping the tasks I give it to get better results.
It cognitively feels very similar to other classic programming activities, like modularization at any level from architecture to code units/functions, thoughtfully choosing how to lay out and chunk things. It's always been one of the things that make programming pleasurable for me, and some of that feeling returns when slicing up tasks for agents.
"Become better at intuiting the behavior of this non-deterministic black box oracle maintained by a third party" just isn't a strong professional development sell for me, personally. If the future of writing software is chasing what a model trainer has done with no ability to actually change that myself I don't think that's going to be interesting to nearly as many people.
It sounds like you're talking more about "vibe coding" i.e. just using LLMs without inspecting the output. That's neither what the article nor the people to whom you're replying are saying. You can (and should) heavily review and edit LLM generated code. You have the full ability to change it yourself, because the code is just there and can be edited!
I think this is underrating the role of intuition in working effectively with deterministic but very complex software systems like operating systems and compilers. Determinism is a red herring.
I agree that framing and scoping tasks is becoming a real joy. The great thing about this strategy is there's a point at which you can scope something small enough that it's hard for the AI to get it wrong and it's easy enough for you as a human to comprehend what it's done and verify that it's correct.
I'm starting to think of projects now as a tree structure where the overall architecture of the system is the main trunk and from there you have the sub-modules, and eventually you get to implementations of functions and classes. The goal of the human in working with the coding agent is to have full editorial control of the main trunk and main sub-modules and delegate as much of the smaller branches as possible.
Sometimes you're still working out the higher-level architecture, too, and you can use the agent to prototype the smaller bits and pieces which will inform the decisions you make about how the higher-level stuff should operate.
[Edit: I may have been replying to another comment in my head as now I re-read it and I'm not sure I've said the same thing as you have. Oh well.]
I agree. This is how I see it too. It's more like a shortcut to an end result that's very similar (or much better) than I would've reached through typing it myself.
The other day I did realise that I'm using my experience to steer it away from bad decisions a lot more than I noticed. It feels like it does all the real work, but I have to remember it's my/our (decades of) experience writing code playing a part also.
I'm genuinely confused when people come in at this point and say that it's impossible to do this and produce good output and end results.
I feel the same, but, also, within like three years this might look very different. Maybe you'll give the full end-to-end goal upfront and it just polls you when it needs clarification or wants to suggest alternatives, and it self-manages cleanly self-delegating.
Or maybe something quite different but where these early era agentic tooling strategies still become either unneeded or even actively detrimental.
I think anyone who has worked on a serious software project would say, this means it would be polling you constantly.
Even if we posit that an LLM is equivalent to a human, humans constantly clarify requirements/architecture. IMO on both of those fronts the correct path often reveals itself over time, rather than being knowable from the start.
So in this scenario it seems like you'd be dealing with constant pings and need to really make sure you're understanding of the project is growing with the LLM's development efforts as well.
To me this seems like the best-case of the current technology, the models have been getting better and better at doing what you tell it in small chunks but you still need to be deciding what it should be doing. These chunks don't feel as though they're getting bigger unless you're willing to accept slop.
> Break down sessions into separate clear, actionable tasks.
What this misses, of course, is that you can just have the agent do this too. Agent's are great at making project plans, especially if you give them a template to follow.
It sounds to me like the goal there is to spell out everything you don't want the agent to make assumptions about. If you let the agent make the plan, it'll still make those assumptions for you.
I do something very similar. I have an "outside expert" script I tell my agent to use as the reviewer. It only bothers me when neither it OR the expert can figure out what the heck it is I actually wanted.
In my case I have Gemini CLI, so I tell Gemini to use the little python script called gatekeeper.py to validate it's plan before each phase with Qwen, Kimi, or (if nothing else is getting good results) ChatGPT 5.2 Thinking. Qwen & Kimi are via fireworks.ai so it's much cheaper than ChatGPT. The agent is not allowed to start work until one of the "experts" approves it via gatekeeper. Similarly it can't mark a phase as complete until the gatekeeper approves the code as bug free and up to standards and passes all unit tests & linting.
Lately Kimi is good enough, but when it's really stuck it will sometimes bother ChatGPT. Seldom does it get all the way to the bottom of the pile and need my input. Usually it's when my instructions turned out to be vague.
I also have it use those larger thinking models for "expert consultation" when it's spent more than 100 turns on any problem and hasn't made progress by it's own estimation.
> On the other extreme you can tell an agent "make me an app that's Facebook for dogs" and it'll make so many assumptions about the architecture, code and product that there's no chance it produces anything useful beyond a cool prototype to show mom and dad.
Amusingly, this was my experience in giving Lovable a shot. The onboarding process was literally just setting me up for failure by asking me to describe the detailed app I was attempting to build.
Taking it piece by piece in Claude Code has been significantly more successful.
I actually enjoy writing specifications. So much so that I made it a large part of my consulting work for a huge part of my career. SO it makes sense that working with Gen-AI that way is enjoyable for me.
The more detailed I am in breaking down chunks, the easier it is for me to verify and the more likely I am going to get output that isn't 30% wrong.
> the scope is so small there's not much point in using an LLM
Actually that's how I did most of my work last year. I was annoyed by existing tools so I made one that can be used interactively.
It has full context (I usually work on small codebases), and can make an arbitrary number of edits to an arbitrary number of files in a single LLM round trip.
For such "mechanical" changes, you can use the cheapest/fastest model available. This allows you to work interactively and stay in flow.
(In contrast to my previous obsession with the biggest, slowest, most expensive models! You actually want the dumbest one that can do the job.)
I call it "power coding", akin to power armor, or perhaps "coding at the speed of thought". I found that staying actively involved in this way (letting LLM only handle the function level) helped keep my mental model synchronized, whereas if I let it work independently, I'd have to spend more time catching up on what it had done.
I do use both approaches though, just depends on the project, task or mood!
This matches my experience, especially "don’t draw the owl" and the harness-engineering idea.
The failure mode I kept hitting wasn’t just "it makes mistakes", it was drift: it can stay locally plausible while slowly walking away from the real constraints of the repo. The output still sounds confident, so you don’t notice until you run into reality (tests, runtime behaviour, perf, ops, UX).
What ended up working for me was treating chat as where I shape the plan (tradeoffs, invariants, failure modes) and treating the agent as something that does narrow, reviewable diffs against that plan. The human job stays very boring: run it, verify it, and decide what’s actually acceptable. That separation is what made it click for me.
Once I got that loop stable, it stopped being a toy and started being a lever. I’ve shipped real features this way across a few projects (a git like tool for heavy media projects, a ticketing/payment flow with real users, a local-first genealogy tool, and a small CMS/publishing pipeline). The common thread is the same: small diffs, fast verification, and continuously tightening the harness so the agent can’t drift unnoticed.
>The failure mode I kept hitting wasn’t just "it makes mistakes", it was drift: it can stay locally plausible while slowly walking away from the real constraints of the repo. The output still sounds confident, so you don’t notice until you run into reality (tests, runtime behaviour, perf, ops, UX).
Yeah I would get patterns where, initial prototypes were promising, then we developed something that was 90% close to design goals, and then as we try to push in the last 10%, drift would start breaking down, or even just forgetting, the 90%.
So I would start getting to 90% and basically starting a new project with that as the baseline to add to.
No harm meant, but your writing is very reminiscent of an LLM. It is great actually, there is just something about it - "it wasn't.. it was", "it stopped being.. and started". Claude and ChatGPT seem to love these juxtapositions. The triplets on every other sentence. I think you are a couple em-dashes away from being accused of being a bot.
These patterns seem to be picking up speed in the general population; makes the human race seem quite easily hackable.
If the human race were not hackable then society would not exist, we'd be the unchanging crocodiles of the last few hundred million years.
Have you ever found yourself speaking a meme? Had a catchy toon repeating in your head? Started spouting nation state level propaganda? Found yourself in crowd trying to burn a witch at the stake?
Hacking the flow of human thought isn't that hard, especially across populations. Hacking any one particular humans thoughts is harder unless you have a lot of information on them.
> How do I hack the human population to give me money
Make something popular or become famous.
> hack law enforcement to not arrest me
Don't become famous with illegal stuff.
The hack is that we live in a society that makes people think they need a lot of money and at the same time allows individuals to accumulate obscene amounts of wealth and influence and many people being ok with that.
1. Write a generic prompts about the project and software versions and keep it in the folder. (I think this getting pushed as SKIILS.md now)
2. In the prompt add instructions to add comments on changes, since our main job is to validate and fix any issues, it makes it easier.
3. Find the best model for the specific workflow. For example, these days I find that Gemini Pro is good for HTML UI stuff, while Claude Sonnet is good for python code. (This is why subagents are getting popluar)
This is the most common answer from people that are rocking and rolling with AI tools but I cannot help but wonder how is this different from how we should have built software all along. I know I have been (after 10+ years…)
I think you are right, the secret is that there is no secret. The projects I have been involved with thats been most successful was using these techniques. I also think experience helps because you develop a sense that very quickly knows if the model wants to go in a wonky direction and how a good spec looks like.
With where the models are right now you still need a human in the loop to make sure you end up with code you (and your organisation) actually understands. The bottle neck has gone from writing code to reading code.
> The bottle neck has gone from writing code to reading code.
This has always been the bottleneck. Reviewing code is much harder and gets worse results than writing it, which is why reviewing AI code is not very efficient. The time required to understand code far outstrips the time to type it.
Most devs don’t do thorough reviews. Check the variable names seem ok, make sure there’s no obvious typos, ask for a comment and call it good. For a trusted teammate this is actually ok and why they’re so valuable! For an AI, it’s a slot machine and trusting it is equivalent to letting your coworkers/users do your job so you can personally move faster.
This was a great post, one of the best I've seen on this topic at HN.
But why is the cost never discussed or disclosed in these conversations? I feel like I'm going crazy, there is so much written extolling the virtues of these tools but with no mention of what it costs to run them now. It will surely only get more expensive from here!
> But why is the cost never discussed or disclosed in these conversations?
And not just the monetary cost of accessing the tools, but the amount of time it takes to actually get good results out. I strongly suspect that even though it feels more productive, in many cases things just take longer than they would if done manually.
I think there are really good uses for LLMs, but I also think that people are likely using them in ways that feel useful, but end up being more costly than not.
Indeed, most of us are probably limited with what our companies let us use and also not to mention not everyone can afford to use AI tooling in their own time without thinking about the cost assuming you want to build something your company doesn't claim as their own IP.
the first time I did work as the article suggests I used my monthly allowance in a day.
Apparently out of 3-5k people with access to our AI tools, there's fewer than a handful of us REALLY using it. Most are asking questions in the chatbot style.
Anyway, I had to ask my manager, the AI architect, and the Tooling Manager for approval to increase my quota.
I asked everyone in the chain how much equivalent dollars I am allocated, and how much the increase was and no one could tell me.
Honestly, the costs are so minimal and vary wildly relative to the cost of a developer that it's frankly not worth the discussion...yet. The reality is the standard deviation of cost is going to oscillate until there is a common agreed upon way to use these tools.
> Honestly, the costs are so minimal and vary wildly relative to the cost of a developer that it's frankly not worth the discussion...yet
Is it? Sure, the chatbot style maxes at $200/month. I consider that ... not unreasonable ... for a professional tool. It doesn't make me happy, but it's not horrific.
The article, however, explicitly pans the chatbot style and is extolling the API style being accessed constantly by agents, and that has no upper bound. Roughly $10-ish per Megatokens. $10-ish per 1K web searches. etc.
This doesn't sound "minimal" to me. This sounds like every single "task" I kick off is $10. And it can kick those tasks and costs off very quickly in an automated fashion. It doesn't take many of those tasks before I'm paying more than an actual full developer.
The current realistic lower bound for actual work is the $100/€90/month Claude Max ("5x") plan. It allows roughly enough usage for a typical working month (4.25 x 40-50h). "Single-threaded", interactive usage with normal human breaks, sort of.
There are two usage quota windows to be aware of: 5h and 7d. I use https://github.com/richhickson/claudecodeusage (Mac) to keep track of the status. It shows green/yellow/red and a percentage in the menu bar.
AI chat for research is great and really helps me.
I just don't need the AI writing code for me, don't see the point. Once I know from the ai chat research what my solution is I can code it myself with the benefit I then understand more what I am doing.
And yes I've tried the latest models! Tried agent mode in copilot! Don't need it!
I still use the chatbot but like to do it outside-in. Provide what I need, and instruct it to not write any code except the api (signatures of classes, interfaces, hierarchy, essential methods etc). We keep iterating about this until it looks good - still no real code. Then I ask it to do a fresh review of the broad outline, any issues it foresees etc. Then I ask it to write some demonstrator test cases to see how ergonomic and testable the code is - we fine tune the apis but nothing is fleshed out yet. Once this is done, we are done with the most time consuming phase.
After that is basically just asking it to flesh out the layers starting from zero dependencies to arriving at the top of the castle. Even if we have any complexities within the pieces or the implementation is not exactly as per my liking, the issues are localised - I can dive in and handle it myself (most of the time, I don't need to).
I feel like this approach works very well for me having a mental model of how things are connected because the most of the time I spent was spent on that model.
This perspective is why I think this article is so refreshing.
Craftsmen approach tools differently. They don't expect tools to work for them out-of-the-box. They customize the tool to their liking and reexamine their workflow in light of the tool. Either that or they have such idiosyncratic workflows they have to build their own tools.
They know their tools are custom to _them_. It would be silly to impose that everyone else use their tools-- they build different things!
I'm a huge believer in AI agent use and even I think this is wrong. It's like saying "always have something compiling" or "make sure your Internet is always downloading something".
The most important work happens when an agent is not running, and if you spend most of your time looking for ways to run more agents you're going to streetlight-effect your way into solving the wrong problems https://en.wikipedia.org/wiki/Streetlight_effect
I've been thinking about this as three maturity levels.
Level 1 is what Mitchell describes — AGENTS.md, a static harness. Prevents known mistakes. But it rots. Nobody updates the checklist when the environment changes.
Level 2 is treating each agent failure as an inoculation. Agent duplicates a util function? Don't just fix it — write a rule file: "grep existing helpers before writing new ones." Agent tries to build a feature while the build is broken? Rule: "fix blockers first." After a few months you have 30+ of these. Each one is an antibody against a specific failure class. The harness becomes an immune system that compounds.
Level 3 is what I haven't seen discussed much: specs need to push, not just be read. If a requirement in auth-spec.md changes, every linked in-progress task should get flagged automatically. The spec shouldn't wait to be consulted.
The real bottleneck isn't agent capability — it's supervision cost. Every type of drift (requirements change, environments diverge, docs rot) inflates the cost of checking the agent's work.
i'd bet that above some number there will be contradictions. Things that apply to different semantic contexts, but look same on syntax level (and maybe with various levels of "syntax" and "semantic"). And debugging those is going to be nightmare - same as debugging requirements spec / verification of that
How much does it cost per day to have all these agents running on your computer?
Is your company paying for it or you?
What is your process of the agent writes a piece of code, let's say a really complex recursive function, and you aren't confident you could have come up with the same solution? Do you still submit it?
I thought this was a joke ie you need to be a billionaire to be able to use agents like this, but you are correct.
I think we need to stop listening to billionaires. The article is well thought out and well written, but his perspective is entirely biased by never having to think about money at all... all of this stuff is incredibly expensive.
Define investment in this case. He's the cofounder of HashiCorp. I guess you could refer to his equity as an investment here, but I don't really think it tracks the same in this context.
He may have a vested interest, but he did cofound HashiCorp as an engineer that actually developed the products, so I find his insight at least somewhat valuable.
Finally, a step-by-step guide for even the skeptics to try to see what spot the LLM tools have in their workflows, without hype or magic like I vibe-coded an entire OS, and you can too!.
Very much the same experience. But it does not talk much about the project setup and the influence of it on the session success. In the narrow scoped projects it works really well, especially when tests are easy to execute. I found that this approach melts down when facing enterprise software with large repositories and unconventional layouts. Then you need to do a bunch of context management upfront, and verbose instructions for evaluations. But we know what it needs is a refactor thats all.
And the post touches on a next type of a problem, how to plan far ahead of time to utilise agents when you are away. It is a difficult problem but IMO we’re going in a direction of having some sort of shared “templated plans”/workflows and budgeted/throttled task execution to achieve that. It is like you want to give a little world to explore so that it does not stop early, like a little game to play, then you come back in the morning and check how far it went.
With so much noise in the AI world and constant model updates (just today GPT-5.3-Codex and Claude Opus 4.6 were announced), this was a really refreshing read. It’s easy to relate to his phased approach to finding real value in tooling and not just hype. There are solid insights and practical tips here. I’m increasingly convinced that the best way not to get overwhelmed is to set clear expectations for what you want to achieve with AI and tailor how you use it to work for you, rather than trying to chase every new headline. Very refreshing.
I think the sweet spot is ai-assisted chat with manual review: readily available, not as costly
agents jump ahead to the point of the user and project being out of control and more expensive
I think a lot of us still hesitate to make that jump; or at least I am not sure of a cost-effective agent approach (I guess I could manually review their output, but I could see it going off track quickly)
I guess I'd like to see more of an exact breakdown of what prompts and tools and AI are used to get ideas on if I'd use that for myself more
Suspect the sweet spot also depends on the objective. If it’s a personal tool where you are the primary user then vibe coding all the way. You can describe requirements precisely and if it breaks there are no angry customers.
I respect Hashimoto for his contributions in the field, but to be honest, I am fed up with posts talking about using AI in ways that are impossible for most people due to high costs. I want to see more posts on cost-effective techniques, rather than just another guy showing off how he turned a creative 'burning-time' hobby into a 'burning-money' one.
It's amusing how everyone seems to be going through the same journey.
I do run multiple models at once now. On different parts of the code base.
I focus solely on the less boring tasks for myself and outsource all of the slam dunk and then review. Often use another model to validate the previous models work while doing so myself.
I do git reset still quite often but I find more ways to not get to that point by knowing the tools better and better.
I don't understand how Agents make you feel productive. Single/Multiple agents reading specs, specs often produced with agents itself and iterated over time with human in the loop, a lot of reviewing of giant gibberish specs. Never had a clear spec in my life. Then all the dancing for this apperantly new paradigm, of not reviewing code but verifying behaviour, and so many other things. All of this to me is a total UNproductive mess. I use Cursor autocomplete from day one till to this day, I was super productive before LLMs, I'm more productive now, I'm capable, I have experience, product is hard to maintain but customers are happy, management is happy. So I can't really relate anymore to many of the programmers out there, that's sad, I can count on my hands devs that I can talk to that have hard skills and know-how to share instead of astroturfing about AI Agents
To me part of our job has always been about translating garbage/missing specs in something actionnable.
Working with agents don't change this and that's why until PM/business people are able to come up with actual specs, they'll still need their translators.
Furthermore, it's not because the global spec is garbage that you, as a dev, won't come up with clear specs to solve technical issues related to the overall feature asked by stakeholders.
One funny thing I see though, is in the AI presentations done to non-technical people, the advice: "be as thorough as possible when describing what you except the agent to solve!".
And I'm like: "yeah, that's what devs have been asking for since forever...".
With "Never had a clear spec in my life" what I mean is also that I don't how something should come out till I'm actually doing it. Writing code for me lead to discovery, I don't know what to produce till I see it in the wrapping context, like what a function should accept, for example a ref or a copy. Only at that point I have the proper intuition to make a decision that has to be supported long term. I don't want cheap code now I want a solit feature working tomorrow and not touching it for a long a time hopefully
In my real life bubble, AI isn't a big deal either, at least for programmers. They tend to be very sceptical about it for many reasons, perceived productivity being only one of them. So, I guess it's much less of a thing than you would expect from media coverage and certain internet communities.
Just because you haven't or you work in a particular way, doesn't mean everyone does things the same way.
Likewise, on your last point, just because someone is using AI in their work, doesn't mean they don't have hard skills and know-how. Author of this article Mitchell is a great example of that - someone who proved to be able to produce great software and, when talking about individuals who made a dent in the industry, definitely had/has an impactful career.
I will say one thing Claude does is it doesn't run a command until you approve it, and you can choose between a one-time approval and always allowing a command's pattern. I usually approve the simple commands like `zig build test`, since I'm not particularly worried about the test harness. I believe it also scopes file reading by default to the current directory.
The comment by user senko [1] links to a post from this same author with an example for a specific coding session that costs $15.98 for 8 hours of work. The example in this post talks about leaving agents running overnight, in which case I'd guess "twice that amount" would be a reasonable approximation.
Or if we assume that the OP can only do 4 hours per sitting (mentioned in the other post) and 8 hours of overnight agents then it would come down to $15.98 * 1.5 * 20 = $497,40 a month (without weekends).
I am not really happy with thinking about what this does to small companies, hobbyists, open source programmers and so on, if it becomes a necessity to be competitive.
Especially since so many of those models have just freely ingested a whole bunch of open source software to be able to do what they do.
If you make 10k/mo -- which is not that much!, $500 is 5% of revenue. All else held equal, if that helps you go 20% faster, it's an absolute no brainer.
The question is.. does it actually help you do that, or do you go 0% faster? Or 5% slower?
This is the sort of statement that immediately tells me this forum is disconnected from the real world. ~80% of full time workers in the US make less than $10k a month before tax.
And yet, the average salary of an IT worker in the US is somewhere between 104 and 110k. Since we're discussing coders here, and IT workers tend to be at the lower end of that, maybe there is some context you didn't consider?
>And yet, the average salary of an IT worker in the US is somewhere between 104 and 110k.
After tax that's like 8% of your take home pay. I don't know why it's unreasonable to scoff at having to pay that much to get the most out of these tools.
>maybe there is some context you didn't consider?
The context is that the average poster on HN has no idea how hard the real world is as they work really high paying jobs. To make a statement that "$10k a month is not a lot" makes you sound out of touch.
It’s not. A miracle is “an event that is inexplicable by natural or scientific laws and accordingly gets attributed to some supernatural or preternatural cause”. Could we please stop trivialising and ignoring the meaning of words?
The word miracle itself is hyperbolic in nature...it's meant to enchant and not to be used literal or concretely.
No need to be pedantic here, there is a large cohort of the population that seemingly never thought a robot would be able to write usable code ("inexplicable by natural or scientific laws") and now here we are seeing that happen ("hey this must be preternatural! there is no other explanation")
> This blog post was fully written by hand, in my own words.
This reminded me of back when wysiwyg web editors started becoming a thing, and coders started adding those "Created in notepad" stickers to their webpages, to point out they were 'real' web developers. Fun times.
It's so sad that we're the ones who have to tell the agent how to improve by extending agent.md or whatever. I constantly have to tell it what I don't like or what can be improved or need to request clarifications or alternative solutions.
This is what's so annoying about it. It's like a child that does the same errors again and again.
But couldn't it adjust itself with the goal of reducing the error bit by bit? Wouldn't this lead to the ultimate agent who can read your mind? That would be awesome.
> It's so sad that we're the ones who have to tell the agent how to improve by extending agent.md or whatever.
Your improvement is someone else's code smell. There's no absolute right or wrong way to write code, and that's coming from someone who definitely thinks there's a right way. But it's my right way.
Anyway, I don't know why you'd expect it to write code the way you like after it's been trained on the whole of the Internet & the the RLHF labelers' preferences and the reward model.
Putting some words in AGENTS.md hardly seems like the most annoying thing.
tip: Add a /fix command that tells it to fix $1 and then update AGENTS.md with the text that'd stop it from making that mistake in the future. Use your nearest LLM to tweak that prompt. It's a good timesaver.
While this may be the end goal, I do think humanity needs to take the trip along with AI to this point.
A mind reading ultimate agent sounds more like a deity, and there are more than enough fables warning one not to create gods because things tend to go bad. Pumping out ASI too quickly will cause massive destabilization and horrific war. Not sure who against really either. Could be us humans against the ASI, could be the rich humans with ASI against us. Anyway about it, it would represent a massive change in the world order.
I've been building systems like what the OP is using since gpt3 came out.
This is the honeymoon phase. You're learning the ins and outs of the specific model you're using and becoming more productive. It's magical. Nothing can stop you. Then you might not be improving as fast as you did at the start, but things are getting better every day. Or maybe every week. But it's heaps better than doing it by hand because you have so much mental capacity left.
Then a new release comes up. An arbitrary fraction of your hard earned intuition is not only useless but actively harmful to getting good results with the new models. Worse you will never know which part it is without unlearning everything you learned and starting over again.
I've had to learn the quirks of three generations of frontier families now. It's not worth the hassle. I've gone back to managing the context window in Emacs because I can't be bothered to learn how to deal with another model family that will be thrown out in six months. Copy and paste is the universal interface and being able to do surgery on the chat history is still better than whatever tooling is out there.
Unironically learning vim or Emacs and the standard Unix code tools is still the best thing you can do to level up your llm usage.
First off, appreciate you sharing your perspective. I just have a few questions.
> I've gone back to managing the context window in Emacs because I can't be bothered to learn how to deal with another model family that will be thrown out in six months.
Can you expand more on what you mean by that? I'm a bit of a noob on llm enabled dev work. Do you mean that you will kick off new sessions and provide a context that you manage yourself instead of relying on a longer running session to keep relevant information?
> Unironically learning vim or Emacs and the standard Unix code tools is still the best thing you can do to level up your llm usage.
I appreciate your insight but I'm failing to understand how exactly knowing these tools increases performance of llms. Is it because you can more precisely direct them via prompts?
LLMs work on text and nothing else. There isn't any magic there. Just a limited context window on which the model will keep predicting the next token until it decides that it's predicted enough and stop.
All the tooling is there to manage that context for you. It works, to a degree, then stops working. Your intuition is there to decide when it stops working. This intuition gets outdated with each new release of the frontier model and changes in the tooling.
The stateless API with a human deciding what to feed it is much more efficient in both cost and time as long as you're only running a single agent. I've yet to see anyone use multiple agents to generate code successfully (but I have used agent swarms for unstructured knowledge retrieval).
The Unix tools are there for you to progra-manually search and edit the code base copy/paste into the context that you will send. Outside of Emacs (and possibly vim) with the ability to have dozens of ephemeral buffers open to modify their output I don't imagine they will be very useful.
Or to quote the SICP lectures: The magic is that there is no magic.
I can't speak for parent, but I use gptel, and it sounds like they do as well. It has a number of features, but primarily it just gives you a chat buffer you can freely edit at any time. That gives you 100% control over the context, you just quickly remove the parts of the conversation where the LLM went off the rails and keep it clean. You can replace or compress the context so far any way you like.
While I also use LLMs in other ways, this is my core workflow. I quickly get frustrated when I can't _quickly_ modify the context.
If you have some mastery over your editor, you can just run commands and post relevant output and make suggested changes to get an agent like experience, at a speed not too different from having the agent call tools. But you retain 100% control over the context, and use a tiny fraction of the tokens OpenCode and other agents systems would use.
It's not the only or best way to use LLMs, but I find it incredibly powerful, and it certainly has it's place.
A very nice positive effect I noticed personally is that as opposed to using agents, I actually retain an understanding of the code automatically, I don't have to go in and review the work, I review and adjust on the fly.
One thing to keep in mind is that the core of an LLM is basically a (non-deterministic) stateless function that takes text as input, and gives text as output.
The chat and session interfaces obscure this, making it look more stateful than it is. But they mainly just send the whole chat so far back to the LLM to get the next response. That's why the context window grows as a chat/session continues. It's also why the answers tend to get worse with longer context windows – you're giving the LLM a lot more to sift through.
You can manage the context window manually instead. You'll potentially lose some efficiencies from prompt caching, but you can also keep your requests much smaller and more relevant, likely spending fewer tokens.
I'll wait for OP to move their workflow to Claude 7.0 and see if they still feel as bullish on AI tools.
People who are learning a new AI tool for the first time don't realzie that they are just learning quirks of the tool and underlying and not skills that generalize. It's not until you've done it a few times that you realzie you've wasted more than 80% of your time on a model that is completely useless and will be sunset in 6 months.
LLMs are not for me. My position is that the advantage we humans have over the
rest of the natural world, is our minds. Our ability to think, create and express ideas
is what separates us from the rest of the animal kingdom. Once we give that over to
"thinking" machines, we weaken ourselves, both individually and as a species.
That said, I've given it a go. I used zed, which I think is a pretty great tool. I
bought a pro subscription and used the built in agent with Claude Sonnet 4.x and Opus.
I'm a Rails developer in my day job, and, like MitchellH and many others, found out
fairly quickly that tasks for the LLM need to be quite specific and discrete. The agent
is great a renames and minor refactors, but my preferred use of the agent was to get it
to write RSpec tests once I'd written something like a controller or service object.
And generally, the LLM agent does a pretty great job of this.
But here's the rub: I found that I was losing the ability to write rspec.
I went to do it manually and found myself trying to remember API calls and approaches
required to write some specs. The feeling of skill leaving me was quite sobering and
marked my abandonment of LLMs and Zed, and my return to neovim, agent-free.
The thing is, this is a common experience generally. If you don't use it, you lose it.
It applies to all things: fitness, language (natural or otherwise), skills of all kinds.
Why should it not apply to thinking itself.
Now you may write me and my experience off as that of a lesser mind, and that you won't
have such a problem. You've been doing it so long that it's "hard-wired in" by now.
Perhaps.
It's in our nature to take the path of least resistance, to seek ease and convenience at
every turn. We've certainly given away our privacy and anonymity so that we can pay for
things with our phones and send email for "free".
LLMs are the ultimate convenience. A peer or slave mind that we can use to do our
thinking and our work for us. Some believe that the LLM represents a local maxima, that
the approach can't get much better. I dunno, but as AI improves, we will hand over more
and more thinking and work to it. To do otherwise would be to go against our very nature
and every other choice we've made so far.
But it's not for me. I'm no MitchellH, and I'm probably better off performing the
mundane activities of my work, as well as the creative ones, so as to preserve my
hard-won knowledge and skills.
YMMV
I'll leave off with the quote that resonates the most with me as I contemplate AI:-
"I say your civilization, because as soon as we started thinking for you,
it really became our civilization, which is, of course, what this is all about."
-- Agent Smith "The Matrix"
I was using it the same way you just described but for C# and Angular and you're spot on. It feels amazing not having to memorize APIs and just let the AI even do code coverage near to 100%, however at some point I began noticing 2 things:
- When tests didn't work I had to check what was going on and the LLMs do cheat a lot with Volkswagen tests, so that began to make me skeptic even of what is being written by the agents
- When things were broken, spaghetti and awful code tends to be written in an obnoxius way it's beyond repairable and made me wish I had done it from scratch.
Thankfully I just tried using agents for tests and not for the actual code, but it makes me think a lot if "vibe coding" really produces quality work.
I don't understand why you were letting your code get into such a state just because an agent wrote it? I won't approve such code from a human, and will ask them to change it with suggestions on how. I do the same for code written by claude.
And then I raise the PR and other humans review it, and they won't let me merge crap code.
Is it that a lot of you are working with much lighter weight processes and you're not as strict about what gets merged to main?
AI adoption is being heavily pushed at my work and personally I do use it, but only for the really "boilerplate-y" kinds of code I've already written hundreds of times before. I see it as a way to offload the more "typing-intensive" parts of coding (where the bottleneck is literally just my WPM on the keyboard) so I have more time to spend on the trickier "thinking-intensive" parts.
I'm kind of on the same journey, a bit less far along. One thing I have observed is that I am constantly running out of tokens in claude. I guess this is not an issue for a wealthy person like Mitchell but it does significantly hamper my ability to experiment.
I recently also reflected on the evolution of my use of ai in programming. Same evolution, other path. If anyone is interested: https://www.asfaload.com/blog/ai_use/
Just wanted to say that was a nice and very grounded write up; and as a result very informative. Thank you. More stuff like this is a breath of fresh air in a landscape that has veered into hyperbole territory both in the for and against ai sides
> Immediately cease trying to perform meaningful work via a chatbot.
That depends on your budget. To work within my pro plan's codex limits, I attach the codebase as a single file to various chat windows (GPT 5.2 Thinking - Heavy) and ask it to find bugs/plan a feature/etc. Then I copy the dense tasklist from chat to codex for implementation. This reduces the tokens that codex burns.
Also don't sleep on GPT 5.2 Pro. That model is a beast for planning.
not quite as technically rich as i came to expect from previous posts from op, but very insightful regardless.
not ashamed to say that i am between steps 2 and 3 in my personal workflow.
>Adopting a tool feels like work, and I do not want to put in the effort
all the different approaches floating online feel ephemeral to me. this, just like for different tools for the op, seem like a chore to adopt. i like the fomo mongering from the community does not help here, but in the end it is a matter of personal discovery to stick with what works for you.
What a lovely read. Thank you for sharing your experience.
The human-agent relationship described in the article made me wonder: are natural, or experienced, managers having more success with AI as subordinates than people without managerial skill? Are AI agents enormously different than arbitrary contractors half a world away where the only communication is daily text exchanges?
So does everyone just run with giving full permissions on Claude code these days? It seems like I’m constantly coming back to CC to validate that it’s not running some bash that’s going to nuke my system. I would love to be able to fully step away but it feels like I can’t.
I run my agents with full permissions in containers. Feels like a reasonable tradeoff. Bonus is I can set up each container with exactly the stack needed.
“Nuke” is maybe too strong of a word, but it has not been uncommon for me to see it trying to install specific versions of languages on my machine, or services I intentionally don’t have configured, or sometimes trying to force npm when I’m using bun, etc.
I'd be interested to know what agents you're using. You mentioned Claude and GPT in passing, but don't actually talk about which you're using or for which tasks.
Good article! I especially liked the approach to replicate manual commits with the agent. I did not do that when learning but I suspect I'd have been much better off if I had.
> Context switching is very expensive. In order to remain efficient, I found that it was my job as a human to be in control of when I interrupt the agent, not the other way around. Don't let the agent notify you.
This is yet one more indication to me that the winds have shifted with regards to the utility of the “agent” paradigm of coding with an LLM. With all the talk around Opus 4.5 I decided to finally make the jump there myself and haven’t yet been disappointed (though admittedly I’m starting it on some pretty straightforward stuff).
You mentioned "harness engineering". How do you approach building "actual programmed tools" (like screenshot scripts) specifically for an LLM's consumption rather than a human's? Are there specific output formats or constraints you’ve found most effective?
For those of working on large proprietary, in fringe languages as well, what can we do? Upload all the source code to the cloud model? I am really wary of giving it a million lines of code it’s never seen.
I've found mostly for context reasons its better to just have a grand overview of the systems and how they work together and feed that to the agent as context, it will use the additional files it touches to expand its understanding if you prompt well.
Does this essentially give the companies controlling these models access to our source code? That is, it goes into training future versions of the model?
AI is getting to the game-changing point. We need more hand-written reflections on how individuals are managing to get productivity gains for real (not a vibe coded app) software engineering.
Do you have any ideas on how to harness AI to only change specific parts of a system or workpiece? Like "I consider this part 80/100 done and only make 'meaningful' or 'new contributions' here" ...?
This gave me a physical flinch. Perhaps this is unfounded, but all this makes me think of is this becoming the norm, millions of people doing this, and us cooking our planet out much faster than predicted.
Now that the Nasdaq crashes, people switch from the stick to the carrot:
"Please let us sit down and have a reasonable conversation! I was a skeptic, too, but if all skeptics did what I did, they would come to Jesus as well! Oh, and pay the monthly Anthropic tithe!"
I know I'm in the minority here, but I've been finding AI to be increasingly useless.
I'd already abandoned it for generating code, for all the reasons everyone knows, that don't need to be rehashed.
I was still in the camp of "It's a better google" and can save me time with research.
The issue it, at this point in my career (30+ years) the questions I have are a bit more nuanced and complex. They aren't things like "how do I make a form with React".
I'm working on developing a very high performance peer server that will need to scale up to hundreds of thousands to a million concurrent web socket connections to work as a signaling server for WebRTC connection negotiation.
I wanted to start as simple as possible, so peerjs is attractive. I asked the AI if peerjs peer-server would work with NodeJS's cluster server. It enthusiastically told me it would work just fine and was, in fact, designed for that.
I took a look at the source code, and it looked to me like that was dead wrong. The AI kept arguing with me before finally admitting it was completely wrong. A total waste of time.
Same results asking it how to remove Sophos from a Mac.
Same with legal questions about HOA laws, it just totally hallucinates things thay don't exist.
My wife and I used to use it to try to settle disagreements (i.e
a better google) but amusingly we've both reached a place where we distrust anything it says so much, we're back to sending each other web articles :-)
I'm still pretty excited about the potential use of AI in elementary education, maybe through high school in some cases, but for my personal use, I've been reaching for it less and less.
I can relate as far as asking AI for advice on complex design tasks. The fundamental problem is that it is still basically a pattern matching technology that "speaks before thinking". For shallow problems this is fine, but where it fails is when it a useful response would require it to have analyzed the consequences of what it is suggesting, although (not that it helps) many people might respond in the same way - with whatever "comes to mind".
I used to joke that programming is not a career - it's a disease - since practiced long enough it fundamentally changes the way you think and talk, always thinking multiple steps ahead and the implications of what you, or anyone else, is saying. Asking advice from another seasoned developer you'll get advice that has also been "pre-analyzed", but not from an LLM.
> I'm not [yet?] running multiple agents, and currently don't really want to
This is the main reason to use AI agents, though: multitasking. If I'm working on some Terraform changes and I fire off an agent loop, I know it's going to take a while for it to produce something working. In the meantime I'm waiting for it to come back and pretend it's finished (really I'll have to fix it), so I start another agent on something else. I flip back and forth between the finished runs as they notify me. At the end of the day I have 5 things finished rather than two.
The "agent" doesn't have to be anything special either. Anything you can run in a VM or container (vscode w/copilot chat, any cli tool, etc) so you can enable YOLO mode.
How much electricity (and associated materials like water) must this use?
It makes me profoundly sad to think of the huge number of AI agents running endlessly to produce vibe-coded slop. The environmental impact must be massive.
Keep in mind that these are estimates, but you could attempt to extrapolate from here. Programming prompts probably take more because I assume the average context is a good bit higher than the average ChatGPT question, plus additional agents.
All in, I'm not sure if the energy usage long term is going to be overblown by media or if it'll be accurate. I'm personally not sure yet.
This are all valid points and a hype-free pragmatic take, I've been wondering about the same things even when I'm still in the skeptics side. I think there are other things that should be added since Mitchell's reality won't apply to everyone:
- What about non opensource work that's not on Github?
- Costs! I would think "an agent always running" would add up quickly
- In open source work, how does it amplify others. Are you seeing AI Slop as PRs? Can you tell the difference?
If the author is here, please could you also confirm you’ve never been paid by any AI company, marketing representative, community programme, in any shape or form?
He explicitly said "I don't work for, invest in, or advise any AI companies." in the article.
But yes, Hashimoto is a high profile CEO/CTO who may well have an indirect, or near-future interest in talking up AI. HN articles extoling the productivity gains of Claude on HN do generally tend to be from older, managerial types (make of that what you will).
Probably exhausting to be that way. The author is well respected and well known and has a good track record. My immediate reaction wasn’t to question that he spoke in good faith.
I don’t know the author, and am suspicious of the amount of astroturfing that has gone on with AI. This article seems reasonable so I looked for a disclaimed and found it oddly worded, hence the request for clarification.
I find it interesting that this thread is full of pragmatic posts that seem to honestly reflect the real limits of current Gen-Ai.
Versus other threads (here on HN, and especially on places like LinkedIn) where it's "I set up a pipeline and some agents and now I type two sentences and amazing technology comes out in 5 minutes that would have taken 3 devs 6 months to do".
There are so many stories about how people use agentic AI but they rarely post how much they spend. Before I can even consider it, I need to know how it will cost me per month. I'm currently using one pro subscription and it's already quite expensive for me. What are people doing, burning hundreds of dollars per month? Do they also evaluate how much value they get out of it?
I quickly run out of the JetBrains AI 35 monthly credits for $300/yr and spending an additional $5-10/day on top of that, mostly for Claude.
I just recently added in Codex, since it comes with my $20/mo subscription to GPT and that's lowering my Claude credit usage significantly... until I hit those limits at some point.
2012 + 300 + 5~200... so about $1500-$1600/year.
It is 100% worth it for what I'm building right now, but my fear is that I'll take a break from coding and then I'm paying for something I'm not using with the subscriptions.
I'd prefer to move to a model where I'm paying for compute time as I use it, instead of worrying about tokens/credits.
The Death of the "Stare": Why AI’s "Confident Stupidity" is a Threat to Human Genius
OPINION | THE REALITY CHECK
In the gleaming offices of Silicon Valley and the boardrooms of the Fortune 500, a new religion has taken hold. Its deity is the Large Language Model, and its disciples—the AI Evangelists—speak in a dialect of "disruption," "optimization," and "seamless integration." But outside the vacuum of the digital world, a dangerous friction is building between AI’s statistical hallucinations and the unyielding laws of physics.
The danger of Artificial Intelligence isn't that it will become our overlord; the danger is that it is fundamentally, confidently, and authoritatively stupid.
The Paradox of the Wind-Powered Car
The divide between AI hype and reality is best illustrated by a recent technical "solution" suggested by a popular AI model: an electric vehicle equipped with wind generators on the front to recharge the battery while driving. To the AI, this was a brilliant synergy. It even claimed the added weight and wind resistance amounted to "zero."
To any human who has ever held a wrench or understood the First Law of Thermodynamics, this is a joke—a perpetual motion fallacy that ignores the reality of drag and energy loss. But to the AI, it was just a series of words that sounded "correct" based on patterns. The machine doesn't know what wind is; it only knows how to predict the next syllable.
The Erosion of the "Human Spark"
The true threat lies in what we are sacrificing to adopt this "shortcut" culture. There is a specific human process—call it The Stare. It is that thirty-minute window where a person looks at a broken machine, a flawed blueprint, or a complex problem and simply observes.
In that half-hour, the human brain runs millions of mental simulations. It feels the tension of the metal, the heat of the circuit, and the logic of the physical universe. It is a "Black Box" of consciousness that develops solutions from absolutely nothing—no forums, no books, and no Google.
However, the new generation of AI-dependent thinkers views this "Stare" as an inefficiency. By outsourcing our thinking to models that cannot feel the consequences of being wrong, we are witnessing a form of evolutionary regression. We are trading hard-earned competence for a "Yes-Man" in a box.
The Gaslighting of the Realist
Perhaps most chilling is the social cost. Those who still rely on their intuition and physical experience are increasingly being marginalized. In a world where the screen is king, the person pointing out that "the Emperor has no clothes" is labeled as erratic, uneducated, or naive.
When a master craftsman or a practical thinker challenges an AI’s "hallucination," they aren't met with logic; they are met with a robotic refusal to acknowledge reality. The "AI Evangelists" have begun to walk, talk, and act like the models they worship—confidently wrong, devoid of nuance, and completely detached from the ground beneath their feet.
The High Cost of Being "Authoritatively Wrong"
We are building a world on a foundation of digital sand. If we continue to trust AI to design our structures and manage our logic, we will eventually hit a wall that no "prompt" can fix.
The human brain runs on 20 watts and can solve a problem by looking at it. The AI runs on megawatts and can’t understand why a wind-powered car won't run forever. If we lose the ability to tell the difference, we aren't just losing our jobs—we're losing our grip on reality itself.
> babysitting my kind of stupid and yet mysteriously productive robot friend
LOL, been there, done that. It is much less frustrating and demoralizing than babysitting your kind of stupid colleague though. (Thankfully, I don't have any of those anymore. But at previous big companies? Oh man, if only their commits were ONLY as bad as a bad AI commit.)
For the AI skeptics reading this, there is an overwhelming probability that Mitchell is a better developer than you. If he gets value out of these tools you should think about why you can't.
> 1) We do NOT provide evidence that AI systems do not currently speed up many or most software developers. Clarification: We do not claim that our developers or repositories represent a majority or plurality of software development work.
> 2) We do NOT provide evidence that AI systems do not speed up individuals or groups in domains other than software development. Clarification: We only study software development.
> 3) We do NOT provide evidence that AI systems in the near future will not speed up developers in our exact setting. Clarification: Progress is difficult to predict, and there has been substantial AI progress over the past five years [3].
> 4) We do NOT provide evidence that there are not ways of using existing AI systems more effectively to achieve positive speedup in our exact setting. Clarification: Cursor does not sample many tokens from LLMs, it may not use optimal prompting/scaffolding, and domain/repository-specific training/finetuning/few-shot learning could yield positive speedup.
Point 1 is saying results may not generalise, which is not a counter claim. It’s just saying “we cannot speak for everyone”.
Point 4 is saying there may be other techniques that work better, which again is not a counter claim. It’s just saying “you may find bette methods.”
Those are standard scientific statements giving scope to the research. They are in no way contradicting their findings. To contradict their findings, you would need similarly rigorous work that perhaps fell into those scenarios.
Not pushing an opinion here, but if we’re talking about research then we should be rigorous and rationale by posting counter evidence. Anyone who has done serious research in software engineering knows the difficulties involved and that this study represents one set of data. But it is at least a rigorous set and not anecdata or marketing.
I for one would love a rigorous study that showed a reliable methodology for gaining generalised productivity gains with the same or better code quality.
Perhaps that's the reason. Maybe I'm just not a good enough developer. But that's still not actionable. It's not like I never considered being a better developer.
Don't get it. What's the relation between Mitchell being a "better" developer than most of us (and better is always relative, but that's another story) and getting value out of AI? That's like saying Bezos is a way better businessman than you, so you should really hear his tips about becoming a billionaire. No sense (because what works for him probably doesn't work for you)
Tons of respect for Mitchell. I think you are doing him a disservice with these kinds of comments.
Maybe you disagree with it, but it seems like a pretty straightforward argument: A lot of us dismiss AI because "it can't be trusted to do as good a job as me". The OP is arguing that someone, who can do better than most of us, disagrees with this line of thinking. And if we have respect for his abilities, and recognize them as better than our own, we should perhaps re-assess our own rationale in dismissing the utility of AI assistance. If he can get value out of it, surely we can too if we don't argue ourselves out of giving it a fair shake. The flip side of that argument might be that you have to be a much better programmer than most of us are, to properly extract value out of the AI... maybe it's only useful in the hands of a real expert.
No, it doesn't work that way. I don't know if Mitchell is a better programmer than me, but let's say he is for the sake of argument. That doesn't make him a god to whom I must listen. He's just a guy, and he can be wrong about things. I'm glad he's apparently finding value here, but the cold hard reality is that I have tried the tools and they don't provide value to me. And between another practicioner's opinion and my own, I value my own more.
>A lot of us dismiss AI because "it can't be trusted to do as good a job as me"
Some of us enjoy learning how systems work, and derive satisfaction from the feeling of doing something hard, and feel that AI removes that satisfaction. If I wanted to have something else write the code, I would focus on becoming a product manager, or a technical lead. But as is, this is a craft, and I very much enjoy the autonomy that comes with being able to use this skill and grow it.
I consider myself a craftsman as well. AI gives me the ability to focus on the parts I both enjoy working on and that demand the most craftsmanship. A lot of what I use AI for and show in the blog isn’t coding at all, but a way to allow me to spend more time coding.
This reads like you maybe didn’t read the blog post, so I’ll mention there many examples there.
Nobody is trying to talk anyone out of their hobby or artisanal creativeness. A lot of people enjoy walking, even after the invention of the automobile. There's nothing wrong with that, there are even times when it's the much more efficient choice. But in the context of say transporting packages across the country... it's not really relevant how much you enjoy one or the other; only one of them can get the job done in a reasonable amount of time. And we can assume that's the context and spirit of the OP's argument.
>Nobody is trying to talk anyone out of their hobby or artisanal creativeness.
Well, yes, they are, some folks don't think "here's how I use AI" and "I'm a craftsman!" are consistent. Seems like maybe OP should consider whether "AI is a tool, why can't you use it right" isn't begging the question.
Is this going to be the new rhetorical trick, to say "oh hey surely we can all agree I have reasonable goals! And to the extent they're reasonable you are unreasonable for not adopting them"?
>But in the context of say transporting packages across the country... it's not really relevant how much you enjoy one or the other; only one of them can get the job done in a reasonable amount of time.
I think one of the more frustrating aspects of this whole debate is this idea that software development pre-AI was too "slow", despite the fact that no other kind of engineering has nearly the same turn around time as software engineering does (nor does they have the same return on investment!).
I just end up rolling my eyes when people use this argument. To me it feels like favoring productivity over everything else.
The value Mitchell describes aligns well with the lack of value I'm getting. He feels that guiding an agent through a task is neither faster nor slower than doing it himself, and there's some tasks he doesn't even try to do with an agent because he knows it won't work, but it's easier to parallelize reviewing agentic work than it is to parallelize direct coding work. That's just not a usage pattern that's valuable to me personally - I rarely find myself in a situation where I have large number of well-scoped programming tasks I need to complete, and it's a fun treat to do myself when I do.
I think this is something people ignore, and is significant. The only way to get good at coding with LLMs is actually trying to do it. Even if it's inefficient or slower at first. It's just another skill to develop [0].
And it's not really about using all the plugins and features available. In fact, many plugins and features are counter-productive. Just learn how to prompt and steer the LLM better.
This is such a lovely balanced thoughtful refreshingly hype-free post to read. 2025 really was the year when things shifted and many first-rate developers (often previously AI skeptics, as Mitchell was) found the tools had actually got good enough that they could incorporate AI agents into their workflows.
It's a shame that AI coding tools have become such a polarizing issue among developers. I understand the reasons, but I wish there had been a smoother path to this future. The early LLMs like GPT-3 could sort of code enough for it to look like there was a lot of potential, and so there was a lot of hype to drum up investment and a lot of promises made that weren't really viable with the tech as it was then. This created a large number of AI skeptics (of whom I was one, for a while) and a whole bunch of cynicism and suspicion and resistance amongst a large swathe of developers. But could it have been different? It seems a lot of transformative new tech is fated to evolve this way. Early aircraft were extremely unreliable and dangerous and not yet worthy of the promises being made about them, but eventually with enough evolution and lessons learned we got the Douglas DC-3, and then in the end the 747.
If you're a developer who still doesn't believe that AI tools are useful, I would recommend you go read Mitchell's post, and give Claude Code a trial run like he did. Try and forget about the annoying hype and the vibe-coding influencers and the noise and just treat it like any new tool you might put through its paces. There are many important conversations about AI to be had, it has plenty of downsides, but a proper discussion begins with close engagement with the tools.
Architects went from drawing everything on paper, to using CAD products over a generation. That's a lot of years! They're still called architects.
Our tooling just had a refresh in less than 3 years and it leaves heads spinning. People are confused, fighting for or against it. Torn even between 2025 to 2026. I know I was.
People need a way to describe it from 'agentic coding' to 'vibe coding' to 'modern AI assisted stack'.
We don't call architects 'vibe architects' even though they copy-paste 4/5th of your next house and use a library of things in their work!
We don't call builders 'vibe builders' for using earth-moving machines instead of a shovel...
When was the last time you reviewed the machine code produced by a compiler? ...
The real issue this industry is facing, is the phenomenal speed of change. But what are we really doing? That's right, programming.
"When was the last time you reviewed the machine code produced by a compiler?"
Compilers will produce working output given working input literally 100% of my time in my career. I've never personally found a compiler bug.
Meanwhile AI can't be trusted to give me a recipe for potato soup. That is to say, I would under no circumstances blindly follow the output of an LLM I asked to make soup. While I have, every day of my life, gladly sent all of the compiler output to the CPU without ever checking it.
The compiler metaphor is simply incorrect and people trying to say LLMs compile English into code insult compiler devs and English speakers alike.
> Compilers will produce working output given working input literally 100% of my time in my career.
In my experience this isn't true. People just assume their code is wrong and mess with it until they inadvertently do something that works around the bug. I've personally reported 17 bugs in GCC over the last 2 years and there are currently 1241 open wrong-code bugs.
Here's an example of a simple to understand bug (not mine) in the C frontend that has existed since GCC 4.7: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=105180
These are still deterministic bugs, which is the point the OP was making. They can be found and solved once. Most of those bugs are simply not that important, so they never get attention.
LLMS on the other hand are non-deterministic and unpredictable and fuzzy by design. That makes them not ideal when trying to produce output which is provably correct - sure you can output and then laboriously check the output - some people find that useful, some are yet to find it useful.
It's a little like using Bitcoin to replace currencies - sure you can do that, but it includes design flaws which make it fundamentally unsuited to doing so. 10 years ago we had rabid defenders of these currencies telling us they would soon take over the global monetary system and replace it, nowadays, not so much.
> It's a little like using Bitcoin to replace currencies [...]
At least, Bitcoin transactions are deterministic.
Not many would want to use a AI currency (mostly works; always shows "Oh, you are 100% right" after losing one's money).
Sure bitcoin is at least deterministic, but IMO (an that of many in the finance industry) it's solving entirely the wrong problem - in practice people want trust and identity in transactions much more than they want distributed and trustless.
In a similar way LLMs seem to me to be solving the wrong problem - an elegant and interesting solution, but a solution to the wrong problem (how can I fool humans into thinking the bot is generally intelligent), rather than the right problem (how can I create a general intelligence with knowledge of the world). It's not clear to me we can jump from the first to the second.
> I've personally reported 17 bugs in GCC over the last 2 years
You are an extreme outlier. I know about two dozen people who work with C(++) and not a single one of them has ever told me that they've found a compiler bug when we've talked about coding and debugging - it's been exclusively them describing PEBCAK.
I've been using c++ for over 30 years. 20-30 years ago I was mostly using MSVC (including version 6), and it absolutely had bugs, sometimes in handling the language spec correctly and sometimes regarding code generation.
Today, I use gcc and clang. I would say that compiler bugs are not common in released versions of those (i.e. not alpha or beta), but they do still occur. Although I will say I don't recall the last time I came across a code generation bug.
I knew one person reporting gcc bugs, and iirc those were all niche scenarios where it generated slightly suboptimal machine code but not otherwise observable from behavior
Right - I'm not saying that it doesn't happen, but that it's highly unusual for the majority of C(++) developers, and that some bugs are "just" suboptimal code generation (as opposed to functional correctness, which the GP was arguing).
This argument is disingenuous and distracts rather than addresses the point.
Yes, it is possible for a compiler to have a bug. No, that is I’m mo way analogous to AI producing buggy code.
I’ve experienced maybe two compiler bugs in my twenty year career. I have experienced countless AI mistakes - hundreds? Thousands? Already.
These are not the same and it has the whiff of sales patter trying to address objections. Please stop.
I'm not arguing that LLMs are at a point today where we can blindly trust their outputs in most applications, I just don't think that 100% correct output is necessarily a requirement for that. What it needs to be is correct often enough that the cost of reviewing the output far outweighs the average cost of any errors in the output, just like with a compiler.
This even applies to human written code and human mistakes, as the expected cost of errors goes up we spend more time on having multiple people review the code and we worry more about carefully designing tests.
If natural language is used to specify work to the LLM, how can the output ever be trusted? You'll always need to make sure the program does what you want, rather than what you said.
>"You'll always need to make sure the program does what you want, rather than what you said."
Yes, making sure the program does what you want. Which is already part of the existing software development life cycle. Just as using natural language to specify work already is: It's where things start and return to over and over throughout any project. Further: LLM's frequently understand what I want better than other developers. Sure, lots of times they don't. But they're a lot better at it than they were 6 months ago, and a year ago they barely did so at all save for scripts of a few dozen lines.
Just create a very specific and very detailed prompt that is so specific that it starts including instructions and you came up with the most expensive programming language.
It's not great that it's the most expensive (by far), but it's also by far the most expressive programming language.
How is it more expressive? What is more expressive than Turing completeness?
You trust your natural language instructions thousand times a day. If you ask for a large black coffee, you can trust that is more or less what you’ll get. Occasionally you may get something so atrocious that you don’t dare to drink, but generally speaking you trust the coffee shop knows what you want. It you insist on a specific amount of coffee brewed at a specific temperature, however, you need tools to measure.
AI tools are similar. You can trust them because they are good enough, and you need a way (testing) to make sure what is produced meet your specific requirements. Of course they may fail for you, doesn’t mean they aren’t useful in other cases.
All of that is simply common sense.
I don't think the argument is that AI isn't useful. I think the argument is that it is qualitatively different from a compiler.
> All of that is simply common sense.
Is that why we have legal codes spanning millions of pages?
The challenge not addressed with this line of reasoning is the required sheer scale of output validation on the backend of LLM-generated code. Human hand-developed code was no great shakes at the validation front either, but the scale difference hid this problem.
I’m hopeful what used to be tedious about the software development process (like correctness proving or documentation) becomes tractable enough with LLM’s to make the scale more manageable for us. That’s exciting to contemplate; think of the complexity categories we can feasibly challenge now!
the fact that the bug tracker exists is proving GP's point.
Right, now what would you say is the probability of getting a bug in compiler output vs ai output?
It's a great tool, once it matures.
Absolutely this. I am tired of that trope.
Or the argument that "well, at some point we can come up with a prompt language that does exactly what you want and you just give it a detailed spec." A detailed spec is called code. It's the most round-about way to make a programming language that even then is still not deterministic at best.
And at the point that your detailed specification language is deterministic, why do you need AI in the middle?
Exactly the point. AI is absolutely BS that just gets peddled by shills. It does not work. It might work for some JS bullcrao. But take existing code and ask it to add capsicum next to an ifdef of pledge. Watch the mayhem unfold.
This is obviously besides the point but I did blindly follow a wiener schnitzel recipe ChatGPT made me and cooked for a whole crew. It turned out great. I think I got lucky though, the next day I absolutely massacred the pancakes.
Recent experiments with LLM recipes (ChatGPT): missed salt in a recipe to make rice, then flubbed whether that type of rice was recommended to be washed in the recipe it was supposedly summarizing (and lied about it, too)…
Probabilistic generation will be weighted towards the means in the training data. Do I want my code looking like most code most of the time in a world full of Node.js and PHP? Am I better served by rapid delivery from a non-learning algorithm that requires eternal vigilance and critical re-evaluation or with slower delivery with a single review filtered through an meatspace actor who will build out trustable modules in a linear fashion with known failure modes already addressed by process (ie TDD, specs, integration & acceptance tests)?
I’m using LLMs a lot, but can’t shake the feeling that the TCO and total time shakes out worse than it feels as you go.
There was a guy a few months ago who found that telling the AI to do everything in a single PHP file actually produced significantly better results, i.e. it worked on the first try. Otherwise it defaulted to React, 1GB of node modules, and a site that wouldn't even load.
>Am I better served
For anything serious, I write the code "semi-interactively", i.e. I just prompt and verify small chunks of the program in rapid succession. That way I keep my mental model synced the whole time, I never have any catching up to do, and honestly it just feels good to stay in the driver's seat.
Pro-tip: Do NOT use LLMs to generate recipes, use them to search the internet for a site with a trustworthy recipe, for information on cooking techniques, science, or chemistry, or if you need ideas about pairings and/or cooking theory / conventions. Do not trust anything an LLM says if it doesn't give a source, it seems people on the internet can't cook for shit and just make stuff up about food science and cooking (e.g. "searing seals in the moisture", though most people know this is nonsense now), so the training data here is utterly corrupt. You always need to inspect the sources.
I don't even see how an LLM (or frankly any recipe) that is a summary / condensation of various recipes can ever be good, because cooking isn't something where you can semantically condense or even mathematically combine various recipes together to get one good one. It just doesn't work like that, there is just one secret recipe that produces the best dish, and the way to find this secret recipe is by experimenting in the real world, not by trying to find some weighting of a bunch of different steps from a bunch of different recipes.
Plus, LLMs don't know how to judge quality of recipes at all (and indeed hallucinate total nonsense if they don't have search enabled).
> I don't even see how an LLM (or frankly any recipe) that is a summary / condensation of various recipes can ever be good
It's funny, I actually know quite a few (totally non tech) people who uses (and like using) LLMs for recipes/recipes ideas.
They probably have enough experience to push back when there's a bad idea, or figure out missing steps/follow up.
Thinking about it, it sounds a bit like LLM usage for coding where an experienced programmer can get more value out of it.
I genuinely admire your courage and willingness (or perhaps just chaos energy) to attempt both wiener schnitzel and pancakes for a crew, based on AI recipes, despite clearly limited knowledge of either.
Everything more complex than a hello-world has bugs. Compiler bugs are uncommon, but not that uncommon. (I must have debugged a few ICEs in my career, but luckily have had more skilled people to rely on when code generation itself was wrong.)
Compilers aren't even that bad. The stack goes much deeper and during your career you may be (un)lucky enough to find yourself far below compilers: https://bostik.iki.fi/aivoituksia/random/developer-debugging...
NB. I've been to vfs/fs depths. A coworker relied on an oscilloscope quite frequently.
I had a fun bug while building a smartwatch app that was caused by the sample rate of the accelerometer increasing when the device heated up. I had code that was performing machine learning on the accelerometer data, which would mysteriously get less accurate during prolonged operation. It turned out that we gathered most of our training data during shorter runs when the device was cool, and when the device heated up during extended use, it changed the frequencies of the recorded signals enough to throw off our model.
I've also used a logic analyzer to debug communications protocols quite a few times in my career, and I've grown to rather like that sort of work, tedious as it may be.
Just this week I built a VFS using FUSE and managed to kernel panic my Mac a half-dozen times. Very fun debugging times.
> The compiler metaphor is simply incorrect
If an LLM was analogous to a compiler, then we would be committing prompts to source control, not the output of the LLM (the "machine code").
> Meanwhile AI can't be trusted to give me a recipe for potato soup.
This just isn't true any more. Outside of work, my most common use case for LLMs is probably cooking. I used to frequently second guess them, but no longer - in my experience SOTA models are totally reliable for producing good recipes.
I recognize that at a higher level we're still talking about probabilistic recipe generation vs. deterministic compiler output, but at this point it's nonetheless just inaccurate to act as though LLMs can't be trusted with simple (e.g. potato soup recipe) tasks.
Compilers and processors are deterministic by design. LLMs are non-deterministic by design.
It's not apples vs. oranges. They are literally opposite of each other.
Just to nitpick - compilers (and, to some extent, processors) weren't deterministic a few decades ago. Getting them to be deterministic has been a monumental effort - see build reproducibility.
”I've never personally found a compiler bug.”
I remember the time I spent hours debugging a feature that worked on Solaris and Windows but failed to produce the right results on SGI. Turns out the SGI C++ compiler silently ignored the `throw` keyword! Just didn’t emit an opcode at all! Or maybe it wrote a NOP.
All I’m saying is, compilers aren’t perfect.
I agree about determinism though. And I mitigate that concern by prompting AI assistants to write code that solves a problem, instead of just asking for a new and potentially different answer every time I execute the app.
Compilers don't change output assemby based on what markdown you provide them via .claude.
Or what tone of voice in prompt you gave them. Or if Saturn is in Aries or Sagittarius.
I'm trying to track down a GCC miscompilation right now ;)
I feel for you :D
> Meanwhile AI can't be trusted to give me a recipe for potato soup.
Because there isn’t a canonical recipe for potato soup.
There's also no canonical way to write software, so in that sense generating code is more similar to coming up with a potato soup recipe than compiling code.
That is not the issue, any potato soup recipe would be fine, the issue is that it might fetch values from different recipes and give you an abomination.
This exactly, I cook as passion, and LLMs just routinely very clearly (weighted) "average" together different recipes to produce, in the worst case, disgusting monstrosities, or, in the best case, just a near-replica of some established site's recipe.
> ... some established site's recipe.
At least with the LLM, you don't have to wade through paragraph after paragraph of "I remember playing in the back yard as a child, I would get hungry..."
In fact LLMs write better and more interesting prose than the average recipe site.
It's not hard to scroll to the bottom of a page, IMO, but regardless, sites like you are mentioning have trash recipes in most cases.
I only go with resources where the text is actual documentation of their testing and/or the steps they've made, or other important details (e.g. SeriousEats, Whats Cooking America / America's Test Kitchen, AmazingRibs, Maangchi for Korean, vegrecipesofindia, Modernist series, etc) or look for someone with some credibility (e.g. Kenji Lopez, other chef on YouTube). In this case the text or surrounding content is valuable and should not be skipped. A plain recipe with no other details is generally only something an amateur would trust.
If you need a recipe, you don't know how to make it by definition, so you need more information to verify that the recipe is done soundly. There is also no reason to assume / trust that the LLMs summary / condensation of various recipes is good, because cooking isn't something where you can semantically condense or even mathematically combine various recipes together to get one good one. It just doesn't work like that, there is just one secret recipe that produces the best dish, and LLMs don't know how to judge quality of recipes, mostly.
I've never had an LLM produce something better or more trustworthy than any of those sites I mentioned, and have had it just make shit up when dealing with anything complicated (i.e. when trying to find the optimal ratio of starch to flour for Korean fried chicken, it just confidently claimed 50/50 is best, when this is obviously total trash to anyone who has done this).
The only time I've ever found LLMs useful for cooking is when I need to cook something obscure that only has information in a foreign language (e.g. icefish / noodlefish), or when I need to use it for search about something involving chemistry or technique (it once quickly found me a paper proving that baking soda can indeed be used to tenderize squid - but only after I prompted it further to get sources and go beyond its training data, because it first hallucinated some bullshit about baking soda only working on collagen or something, which is just not true at all).
So I would still never trust or use the quantities it gives me for any kind of cooking / dish without checking or having the sources, instead I would rely on my own knowledge and intuitions. This makes LLMs useless for recipes in about 99% of cases.
You're correct, and I believe this is only a matter of time. Over time it has been getting better and will keep doing so.
The input to LLMs is natural language. Natural language is ambiguous. No amount of LLM improvements will change that.
It won’t be deterministic.
Maybe. But it's been 3 years and it still isn't good enough to actually trust. That doesn't raise confidence that it will ever get there.
You need to put this revolution in scale with other revolutions.
How long did it take for horses to be super-seeded by cars?
How long did powertool take to become the norm for tradesmen?
This has gone unbelievably fast.
I think things can only be called revolutions in hindsight - while they are going on it's hard to tell if they are a true revolution, an evolution or a dead-end. So I think it's a little premature to call Generative AI a revolution.
AI will get there and replace humans at many tasks, machine learning already has, I'm not completely sure that generative AI will be the route we take, it is certainly superficially convincing, but those three years have not in fact seen huge progress IMO - huge amounts of churn and marketing versions yes, but not huge amounts of concrete progress or upheaval. Lots of money has been spent for sure! It is telling for me that many of the real founders at OpenAI stepped away - and I don't think that's just Altman, they're skeptical of the current approach.
PS Superseded.
*superseded
It comes from the Latin "supersedēre", which taken literally, means "sit on top of". "Super" = above, on top of. "Sedēre" = to sit.
"Super" is already familiar to English speakers. "Sedēre" is the root of words like sedentary, sedan, sedate, reside, and preside.
The more metaphorical meaning of "supersede" as "replace" developed over time and across languages, but the literal meaning is already fairly close.
>super-seeded
Cute eggcorn there.
> Compilers will produce working output given working input literally 100% of my time in my career. I've never personally found a compiler bug.
First compilers were created in the fifties. I doubt those were bug-free.
Give LLMs some fifty or so years, then let's see how (un)reliable they are.
What I don't understand about these arguments is that the input to the LLMs is natural language, which is inherently ambiguous. At which point, what does it even mean for an LLM to be reliable?
And if you start feeding an unambiguous, formal language to an LLM, couldn't you just write a compiler for that language instead of having the LLM interpret it?
> We don't call architects 'vibe architects' even though they copy-paste 4/5th of your next house and use a library of things in their work!
> We don't call builders 'vibe builders' for using earth-moving machines instead of a shovel...
> When was the last time you reviewed the machine code produced by a compiler?
Sure, because those are categorically different. You are describing shortcuts of two classes: boilerplate (library of things) and (deterministic/intentional) automation. Vibe coding doesn't use either of those things. The LLM agents involved might use them, but the vibe coder doesn't.
Vibe coding is delegation, which is a completely different class of shortcut or "tool" use. If an architect delegates all their work to interns, directs outcomes based on whims not principals, and doesn't actually know what the interns are delivering, yeah, I think it would be fair to call them a vibe architect.
We didn't have that term before, so we usually just call those people "arrogant pricks" or "terrible bosses". I'm not super familiar but I feel like Steve Jobs was pretty famously that way - thus if he was an engineer, he was a vibe engineer. But don't let this last point detract from the message, which is that you're describing things which are not really even similar to vibe coding.
Delegation, yes.
I do not see LLM coding as another step up on the ladder of programming abstraction.
If your project is in, say, Python, then by using LLMs, you are not writing software in English; you are having an LLM write software for you in Python.
This is much more like delegation of work to someone else, than it is another layer in the machine-code/assembly/C/Python sort of hierarchy.
In my regular day job, I am a project manager. I find LLM coding to be effectively project management. As a project manager, I am free to dive down to whatever level of technical detail I want, but by and large, it is others on the team who actually write the software. If I assign a task, I don't say "I wrote that code", because I didn't; someone else did, even if I directed it.
And then, project management, delegating to the team, is most certainly nondeterministic behavior. Any programmer on the team might come up with a different solution, each of which works. The same programmer might come up with more than one solutions, all of which work.
I don't expect the programmers to be deterministic. I do expect the compiler to be deterministic.
I think you are right in placing emphasis on delegation.
There’s been a hypothesis floating around that I find appealing. Seemingly you can identify two distinct groups of experienced engineers. Manager, delegator, or team lead style senior engineers are broadly pro-AI. The craftsman, wizard, artist, IC style senior engineers are broadly anti-AI.
But coming back to architects, or most professional services and academia to be honest, I do think the term vibe architect as you define it is exactly how the industry works. An underclass of underpaid interns and juniors do the work, hoping to climb higher and position themselves towards the top of the ponzi-like pyramid scheme.
Architects still need to learn to draw manually quite well to pass exams and stuff.
> We don't call architects 'vibe architects' even though they copy-paste 4/5th of your next house and use a library of things in their work!
Architect's copy-pasting is equivalent to a software developer reusing a tried and tested code library. Generating or writing new code is fundamentally different and not at all comparable.
> We don't call builders 'vibe builders' for using earth-moving machines instead of a shovel...
We would call them "vibe builders" if their machines threw bricks around randomly and the builders focused all of their time on engineering complex scaffolding around the machines to get the bricks flying roughly in the right direction.
But we don't because their machines, like our compilers and linters, do one job and they do it predictably. Most trades spend obscene amounts of money on tools that produce repeatable results.
> That's a lot of years! They're still called architects.
Because they still architect, they don't subcontract their core duties to architecture students overseas and just sign their name under it.
I find it fitting and amusing that people who are uncritical towards the quality of LLM-generated work seem to make the same sorts of reasoning errors that LLMs do. Something about blind spots?
Very likely, yes. One day we'll have a clearer understanding of how minds generalize concepts into well-trodden paths even when they're erroneous, and it'll probably shed a lot of light onto concepts like addiction.
Architects went from drawing everything on paper to using CAD, not over a generation, but over a few years, after CAD and computers got good enough.
It therefore depends on where we place the discovery/availability of the product. If we place it at the time of prototype production (in the early 1960s for CAD), it took a generation (20-30 years), since by the early and mid-1990s, all professionals were already using CAD.
But if we place it at the time when CAD and personal computers became available to the general public (e.g., mid-1980s), it took no more than 5-10 years. I attended a technical school in the 1990s, and we started with hand drawing in the first two years and used CAD systems in the remaining three years of school.
The same can be said for AI. If we place the beginning of AI in the mid-1980s, the wider adoption of AI took more than a generation. If we place it at the time OpenAI developed GPT, it took 5-10 years.
It's not about the tooling it's about the reasoning. An architect copy pasting existing blueprints is still in charge and has to decide what the copy paste and where. Same as programmer slapping a bunch of code together, plumbing libraries or writing fresh code. They are the ones who drive the logical reasoning and the building process.
The ai tooling reverses this where the thinking is outsourced to the machine and the user is borderline nothing more than a spectator, an observer and a rubber stamp on top.
Anyone who is in this position seriously need to think their value added. How do they plan to justify their position and salary to the capital class. If the machine is doing the work for you, why would anyone pay you as much as they do when they can just replace you with someone cheaper, ideally with no-one for maximum profit.
Everyone is now in a competition not only against each other but also against the machine. And any specialized. Expert knowledge moat that you've built over decades of hard work is about to evaporate.
This is the real pressing issue.
And the only way you can justify your value added, your position, your salary is to be able to undermine the AI, find flaws in it's output and reasoning. After all if/when it becomes flawless you have no purpose to the capital class!
> The ai tooling reverses this where the thinking is outsourced to the machine and the user is borderline nothing more than a spectator, an observer and a rubber stamp on top.
I find it a bit rare that this is the case though. Usually I have to carefully review what it's doing and guide it. Either by specific suggestions, or by specific tests, etc. I treat it as a "code writer" that doesn't necessarily understand the big picture. So I expect it to fuck up, and correcting it feels far less frustrating if you consider it a tool you are driving rather than letting it drive you. It's great when it gets things right but even then it's you that is confirming this.
This is exactly what I said in the end. Right now you rely on it fucking things up. What happens to you when the AI no longer fucks things up? Sorry to say, but your position is no longer needed.
Don't take this as criticizing LLMs as a whole, but architects also don't call themselves engineers. Engineers are an entirely distinct set of roles that among other things validate the plan in its totality, not only the "new" 1/5th. Our job spans both of these.
"Architect" is actually a whole career progression of people with different responsibilities. The bottom rung used to be the draftsmen, people usually without formal education who did the actual drawing. Then you had the juniors, mid-levels, seniors, principals, and partners who each oversaw different aspects. The architects with their name on the building were already issuing high level guidance before the transition instead of doing their own drawings.
Last week, to sanity check some code written by an LLM.> Engineers are an entirely distinct set of roles that among other things validate the plan in its totality, not only the "new" 1/5th. Our job spans both of these.
Where this analogy breaks down is that the work you’re describing is done by Professional Engineers that have strict licensing and are (criminally) liable for the end result of the plans they approve.
That is an entirely different role from the army of civil, mechanical, and electrical engineers (some who are PEs and some who are not) who do most of the work for the principal engineer/designated engineer/engineer of record, that have to trust building codes and tools like FEA/FEM that then get final approval from the most senior PE. I don’t think the analogy works, as software engineers rarely report to that kind of hierarchy. Architects of Record on construction projects are usually licensed with their own licensing organization too, with layers of licensed and unlicensed people working for them.
That diversity of roles is what "among other things" was meant to convey. My job at least isn't terribly different, except that licensing doesn't exist and I don't get an actual stamp. My company (and possibly me depending on the facts of the situation) is simply liable if I do something egregious that results in someone being hurt.
> Where this analogy breaks down is that the work you’re describing is done by Professional Engineers that have strict licensing and are (criminally) liable for the end result of the plans they approve.
there are plenty of software engineers that work in regulated industries, with individual licensing, criminal liability, and the ability to be struck off and banned from the industry by the regulator
... such as myself
Sure.
But no one stops you from writing software again.
It's not that PE's can't design or review buildings in whatever city the egregious failure happened.
It's that PE's can't design or review buildings at all in any city after an egregious failure.
It's not that PE's can't design or review hospital building designs because one of their hospital designs went so egregiously sideways.
It's that PE's can't design or review any building for any use because their design went so egregiously sideways.
I work in an FDA regulated software area. I need 510k approval and the whole nine. But if I can't write regulated medical or dental software anymore, I just pay my fine and/or serve my punishment and go sling React/JS/web crap or become a TF/PyTorch monkey. No one stops me. Consequences for me messing up are far less severe than the consequences for a PE messing up. I can still write software because, in the end, I was never an "engineer" in that hard sense of the word.
Same is true of any software developer. Or any unlicensed area of "engineering" for that matter. We're only playing at being "engineers" with the proverbial "monopoly money". We lose? Well, no real biggie.
PE's agree to hang a sword of damocles over their own heads for the lifetime of the bridge or building they design. That's a whole different ball game.
> Consequences for me messing up are far less severe than the consequences for a PE messing up.
if I approve a bad release that leads to an egregious failure, for me it's a prison sentence and unlimited fines
in addition to being struck off and banned from the industry
> That's a whole different ball game.
if you say so
>if I approve a bad release that leads to an egregious failure, for me it's a prison sentence and unlimited fines
Again, I'm in 510k land. The same applies to myself. No one's gonna allow me to irradiate a patient with a 10x dose because my bass ackwards software messed up scientific notation. To remove the wrong kidney because I can't convert orthonormal basis vectors correctly.
But the fact remains that no one would stop either of us from writing software in the future in some other domain.
They do stop PE's from designing buildings in the future in any other domain. By law. So it's very much a different ball game. After an egregious error, we can still practice our craft, because we aren't "engineers" at the end of the day. (Again, "engineers" in that hard sense of the word.) PE's can't practice their craft any longer after an egregious error. Because they are "engineers" in that hard sense of the word.
pray tell, how I can practice my craft from prison
Reasoning by analogy is usually a bad idea, and nowhere is this worse than talking about software development.
It’s just not analogous to architecture, or cooking, or engineering. Software development is just its own thing. So you can’t use analogy to get yourself anywhere with a hint of rigour.
The problem is, AI is generating code that may be buggy, insecure, and unmaintainable. We have as a community spent decades trying to avoid producing that kind of code. And now we are being told that productivity gains mean we should abandon those goals and accept poor quality, as evidenced by MoltBook’s security problems.
It’s a weird cognitive dissonance and it’s still not clear how this gets resolved.
Now then, Moltbook is a pathological case. Either it remains a pathological case or our whole technological world is gonna stumble HARD as all the fundamental things collapse.
I prefer to think Moltbook is a pathological case and unrepresentative, but I've also been rethinking a sort of game idea from computer-based to entirely paper/card based (tariffs be damned) specifically for this reason. I wish to make things that people will have even in the event that all these nice blinky screens are ruined and go dark.
Just the first system that was coded by AI could think of. Note this is unrelated to the fact that its users are LLMs - the problem was in the development of Moltbook itself.
> When was the last time you reviewed the machine code produced by a compiler? ...
Any time I’m doing serious optimization or knee-deep in debugging something where the bug emerged at -O2 but not at -O0.
Sometimes just for fun to see what the compiler is doing in its optimization passes.
You severely limit what you can do and what you can learn if you never peek underneath.
ad > when was the last time
i once found a bug in https://developers.google.com/closure/compiler
by borrowing a function from a not invoked Array broke the compiled code
spent a weekend in reading minified code
good times
> We don't call architects 'vibe architects' even though they copy-paste 4/5th of your next house and use a library of things in their work!
Maybe not, but we don't allow non-architects to vomit out thousands of diagrams that they cannot review, and that is never reviewed, which are subsequently used in the construction of the house.
Your analogy to s/ware is fatally and irredeemably flawed, because you are comparing the regulated and certification-heavy production of content, which is subsequently double-checked by certified professionals, with an unregulated and non-certified production of content which is never checked by any human.
I don't see a flaw, I think you're just gatekeeping software creation.
Anyone can pick up some CAD software and design a house if they so desire. Is the town going to let you build it without a certified engineer/architect signing off? Fuck no. But we don't lock down CAD software.
And presumably, mission critical software is still going to be stamped off on by a certified engineer of some sort.
> Anyone can pick up some CAD software and design a house if they so desire. Is the town going to let you build it without a certified engineer/architect signing off? Fuck no. But we don't lock down CAD software.
No, we lock down using that output from the CAD software in the real world.
> And presumably, mission critical software is still going to be stamped off on by a certified engineer of some sort.
The "mission critical" qualifier is new to your analogy, but is irrelevant anyway - the analogy breaks because, while you can do what you like with CAD software on your own PC, that output never gets used outside of your PC without careful and multiple levels of review, while in the s/ware case, there is no review.
I am not really sure what you are getting at here. Are you suggesting that people should need to acquire some sort of credential to be allowed to code?
> Are you suggesting that people should need to acquire some sort of credential to be allowed to code?
No, I am saying that you are comparing professional $FOO practitioners to professional $BAR practitioners, but it's not a valid comparison because one of those has review and safety built into the process, and the other does not.
You can't use the assertion "We currently allow $FOO practitioners to use every single bit of automation" as evidence that "We should also allow $BAR practitioners to use every bit of automation", because $FOO output gets review by certified humans, and $BAR output does not.
Thanks brother. I flew half way around the world yesterday and am jetlagged as fuck from a 12 hour time change. I'm sorry, my brain apparently shut off, but I follow now. Was out to lunch.
> Thanks brother. I flew half way around the world yesterday and am jetlagged as fuck from a 12 hour time change. I'm sorry, my brain apparently shut off, but I follow now. Was out to lunch.
You know, this was a very civilised discussion; below I've got someone throwing snide remarks my way for some claims I made. You just factually reconfirmed and re-verified until I clarified my PoV.
You're a very pleasant person to argue with.
> We don't call architects 'vibe architects' even though (…)
> We don't call builders 'vibe builders' for (…)
> When was the last time (…)
None of those are the same thing. At all. They are still all deterministic approaches. The architect’s library of things doesn’t change every time they use it or present different things depending on how they hold it. It’s useful because it’s predictable. Same for all your other examples.
If we want to have an honest discussion about the pros and cons of LLM-generated code, proponents need to stop being dishonest in their comparisons. They also need to stop plugging their ears and not ignore the other issues around the technology. It is possible to have something which is useful but whose advantages do not outweigh the disadvantages.
I think the word predictable is doing a bit of heavy lifting there.
Lets say you shovel some dirt, you’ve got a lot of control over where you get it from and where you put it..
Now get in your big digger’s cabin and try to have the same precision. At the level of a shovel-user, you are unpredictable even if you’re skilled. Some of your work might be out a decent fraction of the width of a shovel. That’d never happen if you did it the precise way!
But you have a ton more leverage. And that’s the game-changer.
That’s another dishonest comparison. Predictability is not the same as precision. You don’t need to be millimetric when shovelling dirt at a construction site. But you do need to do it when conducting brain surgery. Context matters.
Sure. If you’re racing your runway to go from 0 to 100 users you’d reach for a different set of tools than if you’re contributing to postgres.
In other words I agree completely with you but these new tools open up new possibilities. We have historically not had super-shovels so we’ve had to shovel all the things no matter how giant or important they are.
> these new tools open up new possibilities.
I’m not disputing that. What I’m criticising is the argument from my original parent post of comparing it to things which are fundamentally different, but making it look equivalent as a justification against criticism.
> We don't call architects 'vibe architects' even though they copy-paste 4/5th of your next house and use a library of things in their work!
Maybe, but they do it through the filter of their knowledge, experience and wisdom, not by rolling a large number of dice to execute a design.
LLMs are useful, just less useful than people think, for instance, 'technical debt' production has now become automated at an industrial scale.
Compilers are deterministic.
I skimmed over it, and didn’t find any discussion of:
I feel like I’m taking crazy pills. Are SWE supposed to move away from code review, one of the core activities for the profession? Code review is as fundamental for SWE as double entry is for accounting.Yes, we know that functional code can get generated at incredible speeds. Yes, we know that apps and what not can be bootstrapped from nothing by “agentic coding”.
We need to read this code, right? How can I deliver code to my company without security and reliability guarantees that, at their core, come from me knowing what I’m delivering line-by-line?
The primary point behind code reviews is to let author to know that someone else will look at their code. They are a psychological tool and that, AFAIK, don't work well with the AI models. If the code is important enough that you want to review it then you should probably be using a different, more interactive flow.
Mitchell talks about this in a round about way... in the "Reproduce your own work" section he obviously reviewed that code as that was the point. In the "End-of-day agents" section he talks about what he found them good for (so far). He previously wrote about how he preferred an interactive style and this article aligns with that with his progress understanding how code agents can be useful.
Give it a read, he mentions briefly how he uses for PR triages and resolving GH issues.
He doesn't go in details, but there is a bit:
> Issue and PR triage/review. Agents are good at using gh (GitHub CLI), so I manually scripted a quick way to spin up a bunch in parallel to triage issues. I would NOT allow agents to respond, I just wanted reports the next day to try to guide me towards high value or low effort tasks.
> More specifically, I would start each day by taking the results of my prior night's triage agents, filter them manually to find the issues that an agent will almost certainly solve well, and then keep them going in the background (one at a time, not in parallel).
This is a short excerpt, this article is worth reading. Very grounded and balanced.
Okay I think this somewhat answers my question. Is this individual a solo developer? “Triaging GitHub issues” sounds a bit like open source solo developer.
Guess I’m just desperate for an article about how organizations are actually speeding up development using agentic AI. Like very practical articles about how existing development processes have been adjusted to facilitate agentic AI.
I remain unconvinced that agentic AI scales beyond solo development, where the individual is liable for the output of the agents. More precisely, I can use agentic AI to write my code, but at the end of the day when I submit it to my org it’s my responsibility to understand it, and guarantee (according to my personal expertise) its security and reliability.
Conversely, I would fire (read: reprimand) someone so fast if I found out they submitted code that created a vulnerability that they would have reasonably caught if they weren’t being reckless with code submission speed, LLM or not.
AI will not revolutionize SWE until it revolutionizes our processes. It will definitely speed us up (I have definitely become faster), but faster != revolution.
> Guess I’m just desperate for an article about how organizations are actually speeding up development using agentic AI. Like very practical articles about how existing development processes have been adjusted to facilitate agentic AI.
They probably aren't really. At least in orgs I worked at, writing the code wasn't usually the bottleneck. It was in retrospect, 'context' engineering, waiting for the decision to get made, making some change and finding it breaks some assumption that was being made elsewhere but wasn't in the ticket, waiting for other stakeholders to insert their piece of the context, waiting for $VENDOR to reply about why their service is/isn't doing X anymore, discovering that $VENDOR_A's stage environment (that your stage environment is testing against for the integration) does $Z when $VENDOR_B_C_D don't do that, etc.
The ecosystem as a whole has to shift for this to work.
The author of the blog made his name and fortune founding Hashicorp, makers of Vagrant and Terraform among other things. Having done all that in his twenties he retired as the CTO and reappeared after a short hiatus with a new open source terminal, Ghostty.
I had a bit of an adjustment of my beliefs since writing these comments. My current take:
Can't believe you don't know who the author is my man.
Generally don’t pay attention to names unless it’s someone like Torvalds, Stroustrop, or Guido. Maybe this guy needs another decade of notoriety or something.
The author is the founder of Hasicorp. He created Vault and Terraform, among others.
If you had that article, would you read it fully before firing off questions?
Either really comprehensive tests (that you read) or read it. Usually i find you can skim most of it, but like in core sections like billing or something you gotta really review it. The models still make mistakes.
You can't skim over AI code.
For even mid-level tasks it will make bad assumptions, like sorting orders or timezone conversions.
Basic stuff really.
You've probably got a load of ticking time bomb bugs if you've just been skimming it.
> got a load of ticking time bomb bugs
Lots and lots of tests!
You read it. You now have an infinite army of overconfident slightly drunken new college grads to throw at any problem.
Some times you’re gonna want to slowly back away from them and write things yourself. Sometimes you can farm out work to them.
Code review their work as you would any one else’s, in fact more so.
My rule of thumb has been it takes a senior engineer per every 4 new grads to mentor them and code review their work. Or put another way bringing on a new grad gets you +1 output at the cost of -0.25 a senior.
Also, there are some tasks you just can’t give new college grads.
Same dynamic seems to be shaping up here. Except the AI juniors are cheap and work 24*7 and (currently) have no hope of growing into seniors.
> Same dynamic seems to be shaping up here. Except the AI juniors are cheap and work 24*7 and (currently) have no hope of growing into seniors.
Each individual trained model... sure. But otoh you can look at it as a very wide junior with "infinite (only limited by your budget)" willpower. Sure, three years ago they were GPT-3.5, basically useless. And now they're Opus 4.6. I wonder what the next few years will bring.
we're talking about _this_ post? He specifically said he only runs one agent, so sure he probably reviews the code or as he stated finds means of auto-verifying what the agent does (giving the agent a way to self-verify as part of its loop).
So read the code.
Cool, code review continues to be one of the biggest bottlenecks in our org, with or without agentic AI pumping out 1k LOC per hour.
For me, AI is the best for code research and review
Since some team members started using AI without care, I did create bunch of agents/skills/commands and custom scripts for claude code. For each PR, it collects changes by git log/diff, read PR data and spin bunch of specialized agents to check code style, architecture, security, performance, and bugs. Each agent armed with necessary requirement documents, including security compliance files. False positives are rare, but it still misses some problems. No PR with ai generated code passes it. If AI did not find any problems, I do manual review.
Ok? You still have to read the code.
That's just not what has been happening in large enterprise projects, internal or external, since long before AI.
Famous example - but by no means do I want to single out that company and product: https://news.ycombinator.com/item?id=18442941
From my own experience, I kept this post bookmarked because I too worked on that project in the late 1990s, you cannot review those changes anyway. It is handled as described, you keep tweaking stuff until the tests pass. There is fundamentally no way to understand the code. Maybe its different in some very core parts, but most of it is just far too messy. I tried merely disentangling a few types ones, because there were a lot of duplicate types for the most simple things, such as 32 bit integers, and it is like trying to pick one noodle out of a huge bowl of spaghetti, and everything is glued and knotted together, so you always end up lifting out the entire bowl's contents. No AI necessary, that is just how such projects like after many generations of temporary programmers (because all sane people will leave as soon as they can, e.g. once they switched from an H1B to a Green Card) under ticket-closing pressure.
I don't know why since the beginning of these discussions some commenters seem to work off wrong assumptions that thus far our actual methods lead to great code. Very often they don't, they lead to a huge mess over time that just gets bigger.
And that is not because people are stupid, its because top management has rationally determined that the best balance for overall profits does not require perfect code. If the project gets too messy to do much the customers will already have been hooked and can't change easily, and when they do, some new product will have already replaced the two decades old mature one. Those customers still on the old one will pay premium for future bug fixes, and the rest will jumpt to the new trend. I don't think AI can make what's described above any, or much worse.
If your team members hand off unreviewable blobs of code and you can't keep up, your problem is team management, not technology.
Yup, you didn't even read anything. Vibe commenting is worse than vibe coding.
You're missing the point. The point is that reading the code is more time consuming than writing it, and has always been thus. Having a machine that can generate code 100x faster, but which you have to read carefully to make sure it hasn't gone off the rails, is not an asset. It is a liability.
> The point is that reading the code is more time consuming than writing it, and has always been thus.
Huh?
First, that is definitely not true. If it were, dev teams would spend the majority of their time on code review, but they don't.
And second, even if it were true, you have to read it for code review even if it was written by a person anyways, if we're talking about the context of a team.
Tell that to Mitchell Hashimoto.
I didn't get into creating software so I could read plagiarism laundering machines output. Sorry, miss me with these takes. I love using my keyboard, and my brain.
So you have a hobby.
I have a profession. Therefore I evaluate new tools. Agents coding I've introduced into my auxiliary tool forgings (one-off bash scripts) and personal projects, and I'm just now comfortable to introduce into my professional work. But I still evaluate every line.
"auxiliary tool forgings" You aren't a serious person.
I may not be a serious person, but I am a serious professional.
And not working with anyone else.
AI written code is often much easier to read/review than some of my coworkers'
Can't say the same for my colleagues' AI written code. It's overtly verbose and always does more that what's required.
I love for companies to pay me money that I can in turn exchange for food, clothes and shelter.
So then type the code as well and read it after. Why are you mad
I think this is the crux of why, when used as an enhancement to solo productivity, you'll have a pretty strict upper bound on productivity gains given that it takes experienced engineers to review code that goes out at scale.
That being said, software quality seems to be decreasing, or maybe it's just cause I use a lot of software in a somewhat locked down state with adblockers and the rest.
Although, that wouldn't explain just how badly they've murdered the once lovely iTunes (now Apple Music) user interface. (And why does CMD-C not pick up anything 15% of the time I use it lately...)
Anyways, digressions aside... the complexity in software development is generally in the organizational side. You have actual users, and then you have people who talk to those users and try to see what they like and don't like in order to distill that into product requirements which then have to be architected, and coordinated (both huge time sinks) across several teams.
Even if you cut out 100% of the development time, you'd still be left with 80% of the timeline.
Over time though... you'll probably see people doing what I do all day (which is move around among many repositories (although I've yet to use the AI much, got my Cursor license recently and am gonna spin up some POCs that I want to see soon)), enabled by their use of AI to quickly grasp what's happening in the repo, and the appropriate places to make changes.
Enabling developers to complete features from tip to tail across deep, many pronged service architectures would could bring project time down drastically and bring project management, and cross team coordination costs down tremendously.
Similarly, in big companies, the hand is often barely aware at best of the foot. And space exploration is a serious challenge. Often folk know exactly one step away, and rely on well established async communication channels which also only know one step further. Principal engineers seem to know large amounts about finite spaces and are often in the dark small hops away to things like the internal tooling for the systems they're maintaining (and often not particularly great at coming in to new spaces and thinking with the same perspective... no we don't need individual micro services for every 12 request a month admin api group we want to set up).
Once systems can take a feature proposal and lay out concrete plans which each little kingdom can give a thumbs up or thumbs down to for further modifications, you can again reduce exploration, coordination, and architecture time down.
Sadly, seems like User Experience design is an often terribly neglected part of our profession. I love the memes about an engineer building the perfect interface like a water pitcher only for the person to position it weirdly in order to get a pour out of the fill hole or something. Lemme guess how many users you actually talked to (often zero), and how many layers of distillation occurred before you received a micro picture feature request that ends up being build and taking input from engineers with no macro understanding of a user's actual needs, or day to day.
And who often are much more interested in perfecting some little algorithm thank thinking about enabling others.
So my money is on money flowing to... - People who can actually verify system integrity, and can fight fires and bugs (but a lot of bug fixing will eventually becoming prompting?) - Multi-talented individuals who can say... interact with users well enough to understand their needs as well as do a decent job verifying system architecture and security
It's outside of coding where I haven't seen much... I guess people use it to more quickly scaffold up expense reports, or generate mocks. So, lots of white collar stuff. But... it's not like the experience of shopping at the supermarket has changed, or going to the movies, or much of anything else.
Your sentiment resonates with me a lot. I wonder what we’ll consider the inflection point 10 years from now. It seemed like the zeitgeist was screaming about scaling limits and running out of training data, then we got Claude code, sonnet 4.5, then Opus 4.5 and no ones looked back since.
I wonder too. It might be that progress on the underlying models is going to plateau, or it might be that we haven't yet reached what in retrospect will be the biggest inflection point. Technological developments can seem to make sense in hindsight as a story of continuous progress when the dust has settled and we can write and tell the history, but when you go back and look at the full range of voices in the historical sources you realize just how deeply nothing was clear to anyone at all at the time it was happening because everyone was hurtling into the unknown future with a fog of war in front of them. In 1910 I'd say it would have been perfectly reasonable to predict airplanes would remain a terrifying curiosity reserved for daredevils only (and people did); or conversely, in the 1960s a lot of commentators thought that the future of passenger air travel in the 70s and 80s would be supersonic jets. I keep this in mind and don't really pay too much attention to over-confident predictions about the technological future.
Should AI tools use memory safe tabs or spaces for indentation? :)
It is a shame it's become such a polarized topic. Things which actually work fine get immediately bashed by large crowds at the same time things that are really not there get voted to the moon by extremely eager folks. A few years from now I expect I'll be thinking "man, there was some really good stuff I missed out on because the discussions about it were so polarized at the time. I'm glad that has cleared up significantly!"
let me ask a stupid/still-ignorant question - about repeatability.
If one asks this generator/assistant same request/thing, within same initial contexts, 10 times, would it generate same result ? in different sessions and all that.
because.. if not, then it's for once-off things only..
If I asked you for the same thing 10 times, wiping your memory each time, would you generate the same result?
And why does it matter anyway? I'd the code passes the tests and you like the look of it, it's good. It doesn't need to be existentially complicated.
A pretty bad comparison. If I gave you the correct answer once, it's unlikely that I'll give you a wrong answer the next time. Also, aren't computers supposed to be more reliable than us? If I'm going to use a tool that behaves just like humans, why not just use my brain instead?
I will give Claude Code a trial run if I can run it locally without an internet connection. AI companies have procured so much training data through illegal means you have to be insane to trust them in even the smallest amount.
You can run OpenCode in a container restricted to local network only and communicating with local/self-hosted models.
Claude Code is linked to Anthropic's hosted models so you can't achieve this.
this is such a strawman argument. what are they going to take from you? your triple forloop? they literally own the weights for a neural net that scores 77% on SWE. they dont need, nor care, about your code
GPT-4 showed the potential but the automated workflows (context management, loops, test-running) and pure execution speed to handle all that "reasoning"/workflows (remember watching characters pop in slowly in GPT-4 streaming API response calls) are gamechangers.
The workflow automation and better (and model-directed) context management are all obvious in retrospect but a lot of people (like myself) were instead focused on IDE integration and such vs `grep` and the like. Maybe multi-agent with task boards is the next thing, but it feels like that might also start to outrun the ability to sensibly design and test new features for non-greenfield/non-port projects. Who knows yet.
I think it's still very valuable for someone to dig in to the underlying models periodically (insomuch as the APIs even expose the same level of raw stuff anymore) to get a feeling for what's reliable to one-shot vs what's easily correctable by a "ran the tests, saw it was wrong, fixed it" loop. If you don't have a good sense of that, it's easy to get overambitious and end up with something you don't like if you're the sort of person who cares at all about what the code looks like.
I think for a lot of people the turn off is the constant churn and the hype cycle. For a lot of people, they just want to get things done and not have to constantly keep on top of what's new or SOTA. Are we still using MCPs or are we using Skills now? Not long ago you had to know MCP or you'd be left behind and you definitely need to know MCP UI or you'll be left behind. I think. It just becomes really tiring, especially with all the FUD.
I'm embracing LLMs but I think I've had to just pick a happy medium and stick with Claude Code with MCPs until somebody figures out a legitimate way to use the Claude subscription with open source tools like OpenCode, then I'll move over to that. Or if a company provides a model that's as good value that can be used with OpenCode.
It reminds me a lot of 3D Printing tbh. Watching all these cool DIY 3d printing kits evolve over years, I remember a few times I'd checked on costs to build a DIY one. They kept coming down, and down, and then around the same time as "Build a 3d printer for $200 (some assembly required)!" The Bambu X1C was announced/released, for a bit over a grand iirc? And its whole selling point was that it was fast and worked, out of the box. And so I bought one and made a bunch of random one-off-things that solved _my_ specific problem, the way I wanted it solved. Mostly in the form of very specific adapter plates that I could quickly iterate on and random house 'wouldn't it be nice if' things.
That's kind of where AI-agent-coding is now too, though... software is more flexible.
> Or if a company provides a model that's as good value that can be used with OpenCode.
OpenAI's Codex?
From everything I've heard, Claude Code is still better at coding and better value (subscription) but I'm happy to be proven wrong.
> For a lot of people, they just want to get things done and not have to constantly keep on top of what's new or SOTA
That hasn’t been tech for a long time.
Frontend has been changing forever. React and friends have new releases all the time. Node has new package managers and even Deno and Bun. AWS keeps changing things.
You really shouldn't use the absolute hellscape of churn that is web dev as an example of broader industry trends. No other sub-field of tech is foolish enough to chase hype and new tools the way web dev is.
I think the web/system dichotomy is also a major conflating factor for LLM discussions.
A “few hundred lines of code” in Rust or Haskell can be bumping into multiple issues LLM assisted coding struggles with. Moving a few buttons on a website with animations and stuff through multiple front end frameworks may reasonably generate 5-10x that much “code”, but of an entirely different calibre.
3,000 lines a day of well-formatted HTML template edits, paired with a reloadable website for rapid validation, is super digestible, while 300 lines of code per day into curl could be seen as reckless.
Exactly this. At work, I’ve seen front-end people generating probably 80% of their code because when you set aside framework churn, a lot of it is boilerplatey and borderline trivial (sorry). Meanwhile, the programmers working on the EV battery controller that uses proprietary everything and where a bug could cause an actual explosion are using LLMs as advanced linters and that’s it.
There's a point at which these things become Good Enough though, and don't bottleneck your capacity to get things done.
To your point, React, while it has new updates, hasn't changed the fundamentals since 16.8.0 (introduction of hooks) and that was 7 years ago. Yes there are new hooks, but they typically build on older concepts. AWS hasn't deprecated any of our existing services at work (besides maybe a MySQL version becoming EOL) in the last 4 years that I've worked at my current company.
While I prefer pnpm (to not take up my MacBook's inadequate SSD space), you can still use npm and get things done.
I don't need to keep obsessing over whether Codex or Claude have a 1 point lead in a gamed benchmark test so long as I'm still able to ship features without a lot of churn.
Isn’t there something off about calling predictions about the future, that aren’t possible with current tech, hype? Like people predicted AI agents would be this huge change, they were called hype since earlier models were so unreliable, and now they are mostly right as ai agents work like a mid level engineer. And clearly super human in some areas.
> ai agents work like a mid level engineer
They do not.
> And clearly super human in some areas.
Sure, if you think calculators or bicycles are "superhuman technology".
Lay off the hype pills.
> They do not. Do you have anything to back this up? This seems like a shallow dismissal. Claude Code is mostly used to ship Claude Code and Claude Cowork - which are at multi billion ARR. I use Claude Code to ship technically deep dev tools for myself for example here https://github.com/ianm199/bubble-analysis. I am a decent engineer and I wouldn't have the time or expertise to ship that.
>Sure, if you think calculators or bicycles are "superhuman technology".
Uh, yes they are? That's why they were revolutionary technologies!
It's hard to see why a bike that isn't superhuman would even make sense? Being superhuman in at least some aspect really seems like the bare minimum for a technology to be worth adopting.
Is there any reason to use Claude Code specifically over Codex or Gemini? I’ve found the both Codex and Gemini similar in results, but I never tried Claude because of I keep hearing usage runs out so fast on pro plans and there’s no free trial for the CLI.
I mostly mentioned Claude Code because it's what Mitchell first tried according to his post, and it's what I personally use. From what I hear Codex is pretty comparable; it has a lot of fans. There are definitely some differences and strengths and weaknesses of both the CLIs and the underlying LLMs that others who use more than one tool might want to weight in on, but they're all fairly comparable. (Although, we'll see how the new models released from Anthropic and OpenAI today stack up.) Codex and Gemini CLI are basically Claude Code clones with different LLMs behind them, after all.
IME Gemini is pretty slow in comparison to Claude - but hey, it's super cheap at least.
But that speed makes a pretty significant difference in experience.
If you wait a couple minutes and then give the model a bunch of feedback about what you want done differently, and then have to wait again, it gets annoying fast.
If the feedback loop is much tighter things feel much more engaging. Cursor is also good at this (investigate and plan using slower/pricier models, implement using fast+cheap ones).
but annoying hype is exactly the issue with AI in my eyes. I get it's a useful tool in moderation and all, but I also experience that management values speed and quantity of delivery above all else, and hype-driven as they are I fear they will run this industry to the ground and we as users and customers will have to deal with the world where software is permanently broken as a giant pile of unmaintainable vibe code and no experienced junior developers to boot.
>management values speed and quantity of delivery above all else
I don't know about you but this has been the case for my entire career. Mgmt never gave a shit about beautiful code or tech debt or maintainability or how enlightened I felt writing code.
> It's a shame that AI coding tools have become such a polarizing issue among developers.
Frankly I'm so tired of the usual "I don't find myself more productive", "It writes soup". Especially when some of the best software developers (and engineers) find many utility in those tools, there should be some doubt growing in that crowd.
I have come to the conclusion that software developers, those only focusing on the craft of writing code are the naysayers.
Software engineers immediately recognize the many automation/exploration/etc boosts, recognize the tools limits and work on improving them.
Hell, AI is an insane boost to productivity, even if you don't have it write a single line of code ever.
But people that focus on the craft (the kind of crowd that doesn't even process the concept of throwaway code or budgets or money) will keep laying in their "I don't see the benefits because X" forever, nonsensically confusing any tool use with vibe coding.
I'm also convinced that since this crowd never had any notion of what engineering is (there is very little of it in our industry sadly, technology and code is the focus and rarely the business, budget and problems to solve) and confused it with architectural, technological or best practices they are genuinely insecure about their jobs because once their very valued craft and skills are diminished they pay the price of never having invested in understanding the business, the domain, processes or soft skills.
I've spent 2+ decades producing software across a number of domains and orgs and can fully agree that _disciplined use_ of LLM systems can significantly boost productivity, but the rules and guidance around their use within our industry writ large are still in flux and causing as many problems as they're solving today.
As the most senior IC within my org, since the advent of (enforced) LLM adoption my code contribution/output has stalled as my focus has shifted to the reactionary work of sifting through the AI generated chaff following post mortems of projects that should have never have shipped in the first place. On a good day I end up rejecting several PRs that most certainly would have taken down our critical systems in production due to poor vetting and architectural flaws, and on the worst I'm in full on fire fighting mode to "fix" the same issues already taking down production (already too late.)
These are not inherent technical problems in LLMs, these are organizational/processes problems induced by AI pushers promising 10x output without the necessary 10x requirements gathering and validation efforts that come with that. "Everyone with GenAI access is now a 10x SDE" is the expectation, when the reality is much more nuanced.
The result I see today is massive incoming changesets that no one can properly vet given the new shortened delivery timelines and reduced human resourcing given to projects. We get test suite coverage inflation where "all tests pass" but undermine core businesses requirements and no one is being given the time or resources to properly confirm the business requirements are actually being met. Shit hits the fan, repeat ad nauseum. The focus within our industry needs to shift to education on the proper application and use of these tools, or we'll inevitably crash into the next AI winter; an increasingly likely future that would have been totally avoidable if everyone drinking the Koolaid stopped to observe what is actually happening.
As you implied, code is cheap and most code is "throwaway" given even modest time horizons, but all new code comes with hidden costs not readily apparent to all the stakeholders attempting to create a new normal with GenAI. As you correctly point out, the biggest problems within our industry aren't strictly technical ones, they're interpersonal, communication and domain expertise problems, and AI use is simply exacerbating those issues. Maybe all the orgs "doing it wrong" (of which there are MANY) simply fail and the ones with actual engineering discipline "make it," but it'll be a reckoning we should not wish for.
I have heard from a number of different industry players and they see the same patterns. Just look at the average linked in post about AI adoption to confirm. Maybe you observe different patterns and the issues aren't as systemic as I fear. I honestly hope so.
Your implication that seniors like myself are "insecure about our jobs" is somewhat ironically correct, but not for the reasons you think.
The Death of the "Stare": Why AI’s "Confident Stupidity" is a Threat to Human Genius
OPINION | THE REALITY CHECK In the gleaming offices of Silicon Valley and the boardrooms of the Fortune 500, a new religion has taken hold. Its deity is the Large Language Model, and its disciples—the AI Evangelists—speak in a dialect of "disruption," "optimization," and "seamless integration." But outside the vacuum of the digital world, a dangerous friction is building between AI’s statistical hallucinations and the unyielding laws of physics.
The danger of Artificial Intelligence isn't that it will become our overlord; the danger is that it is fundamentally, confidently, and authoritatively stupid.
The Paradox of the Wind-Powered Car The divide between AI hype and reality is best illustrated by a recent technical "solution" suggested by a popular AI model: an electric vehicle equipped with wind generators on the front to recharge the battery while driving. To the AI, this was a brilliant synergy. It even claimed the added weight and wind resistance amounted to "zero."
To any human who has ever held a wrench or understood the First Law of Thermodynamics, this is a joke—a perpetual motion fallacy that ignores the reality of drag and energy loss. But to the AI, it was just a series of words that sounded "correct" based on patterns. The machine doesn't know what wind is; it only knows how to predict the next syllable.
The Erosion of the "Human Spark" The true threat lies in what we are sacrificing to adopt this "shortcut" culture. There is a specific human process—call it The Stare. It is that thirty-minute window where a person looks at a broken machine, a flawed blueprint, or a complex problem and simply observes.
In that half-hour, the human brain runs millions of mental simulations. It feels the tension of the metal, the heat of the circuit, and the logic of the physical universe. It is a "Black Box" of consciousness that develops solutions from absolutely nothing—no forums, no books, and no Google.
However, the new generation of AI-dependent thinkers views this "Stare" as an inefficiency. By outsourcing our thinking to models that cannot feel the consequences of being wrong, we are witnessing a form of evolutionary regression. We are trading hard-earned competence for a "Yes-Man" in a box.
The Gaslighting of the Realist Perhaps most chilling is the social cost. Those who still rely on their intuition and physical experience are increasingly being marginalized. In a world where the screen is king, the person pointing out that "the Emperor has no clothes" is labeled as erratic, uneducated, or naive.
When a master craftsman or a practical thinker challenges an AI’s "hallucination," they aren't met with logic; they are met with a robotic refusal to acknowledge reality. The "AI Evangelists" have begun to walk, talk, and act like the models they worship—confidently wrong, devoid of nuance, and completely detached from the ground beneath their feet.
The High Cost of Being "Authoritatively Wrong" We are building a world on a foundation of digital sand. If we continue to trust AI to design our structures and manage our logic, we will eventually hit a wall that no "prompt" can fix.
The human brain runs on 20 watts and can solve a problem by looking at it. The AI runs on megawatts and can’t understand why a wind-powered car won't run forever. If we lose the ability to tell the difference, we aren't just losing our jobs—we're losing our grip on reality itself.
> Break down sessions into separate clear, actionable tasks. Don't try to "draw the owl" in one mega session.
This is the key one I think. At one extreme you can tell an agent "write a for loop that iterates over the variable `numbers` and computes the sum" and they'll do this successfully, but the scope is so small there's not much point in using an LLM. On the other extreme you can tell an agent "make me an app that's Facebook for dogs" and it'll make so many assumptions about the architecture, code and product that there's no chance it produces anything useful beyond a cool prototype to show mom and dad.
A lot of successful LLM adoption for code is finding this sweet spot. Overly specific instructions don't make you feel productive, and overly broad instructions you end up redoing too much of the work.
This is actually an aspect of using AI tools I really enjoy: Forming an educated intuition about what the tool is good at, and tastefully framing and scoping the tasks I give it to get better results.
It cognitively feels very similar to other classic programming activities, like modularization at any level from architecture to code units/functions, thoughtfully choosing how to lay out and chunk things. It's always been one of the things that make programming pleasurable for me, and some of that feeling returns when slicing up tasks for agents.
"Become better at intuiting the behavior of this non-deterministic black box oracle maintained by a third party" just isn't a strong professional development sell for me, personally. If the future of writing software is chasing what a model trainer has done with no ability to actually change that myself I don't think that's going to be interesting to nearly as many people.
It sounds like you're talking more about "vibe coding" i.e. just using LLMs without inspecting the output. That's neither what the article nor the people to whom you're replying are saying. You can (and should) heavily review and edit LLM generated code. You have the full ability to change it yourself, because the code is just there and can be edited!
And yet the comments are chock full of cargo-culting about different moods of the oracle and ways to get better output.
I think this is underrating the role of intuition in working effectively with deterministic but very complex software systems like operating systems and compilers. Determinism is a red herring.
Whether it's interesting or not is irrelevant to whether it produces usable output that could be economically valuable.
Yeah, still waiting for something to ship before I form a judgement on that
Claude Code is made with Anthropic's models and is very commercially successful.
Something besides AI tooling. This isn't Amway.
Since they started doing that it's gained a lot of bugs.
Should have used codex. (jk ofc)
I agree that framing and scoping tasks is becoming a real joy. The great thing about this strategy is there's a point at which you can scope something small enough that it's hard for the AI to get it wrong and it's easy enough for you as a human to comprehend what it's done and verify that it's correct.
I'm starting to think of projects now as a tree structure where the overall architecture of the system is the main trunk and from there you have the sub-modules, and eventually you get to implementations of functions and classes. The goal of the human in working with the coding agent is to have full editorial control of the main trunk and main sub-modules and delegate as much of the smaller branches as possible.
Sometimes you're still working out the higher-level architecture, too, and you can use the agent to prototype the smaller bits and pieces which will inform the decisions you make about how the higher-level stuff should operate.
[Edit: I may have been replying to another comment in my head as now I re-read it and I'm not sure I've said the same thing as you have. Oh well.]
I agree. This is how I see it too. It's more like a shortcut to an end result that's very similar (or much better) than I would've reached through typing it myself.
The other day I did realise that I'm using my experience to steer it away from bad decisions a lot more than I noticed. It feels like it does all the real work, but I have to remember it's my/our (decades of) experience writing code playing a part also.
I'm genuinely confused when people come in at this point and say that it's impossible to do this and produce good output and end results.
I feel the same, but, also, within like three years this might look very different. Maybe you'll give the full end-to-end goal upfront and it just polls you when it needs clarification or wants to suggest alternatives, and it self-manages cleanly self-delegating.
Or maybe something quite different but where these early era agentic tooling strategies still become either unneeded or even actively detrimental.
> it just polls you when it needs clarification
I think anyone who has worked on a serious software project would say, this means it would be polling you constantly.
Even if we posit that an LLM is equivalent to a human, humans constantly clarify requirements/architecture. IMO on both of those fronts the correct path often reveals itself over time, rather than being knowable from the start.
So in this scenario it seems like you'd be dealing with constant pings and need to really make sure you're understanding of the project is growing with the LLM's development efforts as well.
To me this seems like the best-case of the current technology, the models have been getting better and better at doing what you tell it in small chunks but you still need to be deciding what it should be doing. These chunks don't feel as though they're getting bigger unless you're willing to accept slop.
> Break down sessions into separate clear, actionable tasks.
What this misses, of course, is that you can just have the agent do this too. Agent's are great at making project plans, especially if you give them a template to follow.
It sounds to me like the goal there is to spell out everything you don't want the agent to make assumptions about. If you let the agent make the plan, it'll still make those assumptions for you.
If you've got a plan for the plan, what else could you possibly need!
You joke, but the more I iterate on a plan before any code, the more successful the first pass is.
1) Tell claude my idea with as much as I know, ask it to ask me questions. This could go on for a few rounds. (Opus)
2) Run a validate skill on the plan, reviewer with a different prompt (Opus)
3) codex reviews the plan, always finds a few small items after the above 2.
4) claude opus implements in 1 shot, usually 99% accurate, then I manually test.
If I stay on target with those steps I always have good outcomes, but it is time consuming.
I do something very similar. I have an "outside expert" script I tell my agent to use as the reviewer. It only bothers me when neither it OR the expert can figure out what the heck it is I actually wanted.
In my case I have Gemini CLI, so I tell Gemini to use the little python script called gatekeeper.py to validate it's plan before each phase with Qwen, Kimi, or (if nothing else is getting good results) ChatGPT 5.2 Thinking. Qwen & Kimi are via fireworks.ai so it's much cheaper than ChatGPT. The agent is not allowed to start work until one of the "experts" approves it via gatekeeper. Similarly it can't mark a phase as complete until the gatekeeper approves the code as bug free and up to standards and passes all unit tests & linting.
Lately Kimi is good enough, but when it's really stuck it will sometimes bother ChatGPT. Seldom does it get all the way to the bottom of the pile and need my input. Usually it's when my instructions turned out to be vague.
I also have it use those larger thinking models for "expert consultation" when it's spent more than 100 turns on any problem and hasn't made progress by it's own estimation.
> On the other extreme you can tell an agent "make me an app that's Facebook for dogs" and it'll make so many assumptions about the architecture, code and product that there's no chance it produces anything useful beyond a cool prototype to show mom and dad.
Amusingly, this was my experience in giving Lovable a shot. The onboarding process was literally just setting me up for failure by asking me to describe the detailed app I was attempting to build.
Taking it piece by piece in Claude Code has been significantly more successful.
so many times I catch myself asking a coding agent e.g “please print the output” and it will update the file with “print (output)”.
Maybe there’s something about not having to context switch between natural language and code just makes it _feel_ easier sometimes
I actually enjoy writing specifications. So much so that I made it a large part of my consulting work for a huge part of my career. SO it makes sense that working with Gen-AI that way is enjoyable for me.
The more detailed I am in breaking down chunks, the easier it is for me to verify and the more likely I am going to get output that isn't 30% wrong.
> the scope is so small there's not much point in using an LLM
Actually that's how I did most of my work last year. I was annoyed by existing tools so I made one that can be used interactively.
It has full context (I usually work on small codebases), and can make an arbitrary number of edits to an arbitrary number of files in a single LLM round trip.
For such "mechanical" changes, you can use the cheapest/fastest model available. This allows you to work interactively and stay in flow.
(In contrast to my previous obsession with the biggest, slowest, most expensive models! You actually want the dumbest one that can do the job.)
I call it "power coding", akin to power armor, or perhaps "coding at the speed of thought". I found that staying actively involved in this way (letting LLM only handle the function level) helped keep my mental model synchronized, whereas if I let it work independently, I'd have to spend more time catching up on what it had done.
I do use both approaches though, just depends on the project, task or mood!
Do you have the tool open sourced somewhere? I have been thinking of using something similar
And lately, the sweet spot has been moving upwards every 6-8 weeks with the model release cycle.
Exactly. The LLMs are quite good at "code inpainting", eg "give me the outline/constraints/rules and I'll fill-in the blanks"
But not so good at making (robust) new features out of the blue
This matches my experience, especially "don’t draw the owl" and the harness-engineering idea.
The failure mode I kept hitting wasn’t just "it makes mistakes", it was drift: it can stay locally plausible while slowly walking away from the real constraints of the repo. The output still sounds confident, so you don’t notice until you run into reality (tests, runtime behaviour, perf, ops, UX).
What ended up working for me was treating chat as where I shape the plan (tradeoffs, invariants, failure modes) and treating the agent as something that does narrow, reviewable diffs against that plan. The human job stays very boring: run it, verify it, and decide what’s actually acceptable. That separation is what made it click for me.
Once I got that loop stable, it stopped being a toy and started being a lever. I’ve shipped real features this way across a few projects (a git like tool for heavy media projects, a ticketing/payment flow with real users, a local-first genealogy tool, and a small CMS/publishing pipeline). The common thread is the same: small diffs, fast verification, and continuously tightening the harness so the agent can’t drift unnoticed.
>The failure mode I kept hitting wasn’t just "it makes mistakes", it was drift: it can stay locally plausible while slowly walking away from the real constraints of the repo. The output still sounds confident, so you don’t notice until you run into reality (tests, runtime behaviour, perf, ops, UX).
Yeah I would get patterns where, initial prototypes were promising, then we developed something that was 90% close to design goals, and then as we try to push in the last 10%, drift would start breaking down, or even just forgetting, the 90%.
So I would start getting to 90% and basically starting a new project with that as the baseline to add to.
No harm meant, but your writing is very reminiscent of an LLM. It is great actually, there is just something about it - "it wasn't.. it was", "it stopped being.. and started". Claude and ChatGPT seem to love these juxtapositions. The triplets on every other sentence. I think you are a couple em-dashes away from being accused of being a bot.
These patterns seem to be picking up speed in the general population; makes the human race seem quite easily hackable.
>makes the human race seem quite easily hackable.
If the human race were not hackable then society would not exist, we'd be the unchanging crocodiles of the last few hundred million years.
Have you ever found yourself speaking a meme? Had a catchy toon repeating in your head? Started spouting nation state level propaganda? Found yourself in crowd trying to burn a witch at the stake?
Hacking the flow of human thought isn't that hard, especially across populations. Hacking any one particular humans thoughts is harder unless you have a lot of information on them.
How do I hack the human population to give me money, and simultaneously, hack law enforcement to not arrest me?
> How do I hack the human population to give me money
Make something popular or become famous.
> hack law enforcement to not arrest me
Don't become famous with illegal stuff.
The hack is that we live in a society that makes people think they need a lot of money and at the same time allows individuals to accumulate obscene amounts of wealth and influence and many people being ok with that.
This is what I experienced as well.
these are some ticks I use now.
1. Write a generic prompts about the project and software versions and keep it in the folder. (I think this getting pushed as SKIILS.md now)
2. In the prompt add instructions to add comments on changes, since our main job is to validate and fix any issues, it makes it easier.
3. Find the best model for the specific workflow. For example, these days I find that Gemini Pro is good for HTML UI stuff, while Claude Sonnet is good for python code. (This is why subagents are getting popluar)
Would love to hear more about your geneology app.
This is the most common answer from people that are rocking and rolling with AI tools but I cannot help but wonder how is this different from how we should have built software all along. I know I have been (after 10+ years…)
I think you are right, the secret is that there is no secret. The projects I have been involved with thats been most successful was using these techniques. I also think experience helps because you develop a sense that very quickly knows if the model wants to go in a wonky direction and how a good spec looks like.
With where the models are right now you still need a human in the loop to make sure you end up with code you (and your organisation) actually understands. The bottle neck has gone from writing code to reading code.
> The bottle neck has gone from writing code to reading code.
This has always been the bottleneck. Reviewing code is much harder and gets worse results than writing it, which is why reviewing AI code is not very efficient. The time required to understand code far outstrips the time to type it.
Most devs don’t do thorough reviews. Check the variable names seem ok, make sure there’s no obvious typos, ask for a comment and call it good. For a trusted teammate this is actually ok and why they’re so valuable! For an AI, it’s a slot machine and trusting it is equivalent to letting your coworkers/users do your job so you can personally move faster.
This was a great post, one of the best I've seen on this topic at HN.
But why is the cost never discussed or disclosed in these conversations? I feel like I'm going crazy, there is so much written extolling the virtues of these tools but with no mention of what it costs to run them now. It will surely only get more expensive from here!
> But why is the cost never discussed or disclosed in these conversations?
And not just the monetary cost of accessing the tools, but the amount of time it takes to actually get good results out. I strongly suspect that even though it feels more productive, in many cases things just take longer than they would if done manually.
I think there are really good uses for LLMs, but I also think that people are likely using them in ways that feel useful, but end up being more costly than not.
Indeed, most of us are probably limited with what our companies let us use and also not to mention not everyone can afford to use AI tooling in their own time without thinking about the cost assuming you want to build something your company doesn't claim as their own IP.
the first time I did work as the article suggests I used my monthly allowance in a day.
Apparently out of 3-5k people with access to our AI tools, there's fewer than a handful of us REALLY using it. Most are asking questions in the chatbot style.
Anyway, I had to ask my manager, the AI architect, and the Tooling Manager for approval to increase my quota.
I asked everyone in the chain how much equivalent dollars I am allocated, and how much the increase was and no one could tell me.
Honestly, the costs are so minimal and vary wildly relative to the cost of a developer that it's frankly not worth the discussion...yet. The reality is the standard deviation of cost is going to oscillate until there is a common agreed upon way to use these tools.
Yes, but the lack of clear pricing probably makes people think it's more expensive than it actually is. (It did so to me.)
There is nothing quantifiable here: https://claude.com/pricing
Pro: "Everything in Free, plus: More usage"
Max: "Choose 5x or 20x more usage than Pro"
Wow, 5x or 20x more of "more". That's some masterful communication right there.
> Honestly, the costs are so minimal and vary wildly relative to the cost of a developer that it's frankly not worth the discussion...yet
Is it? Sure, the chatbot style maxes at $200/month. I consider that ... not unreasonable ... for a professional tool. It doesn't make me happy, but it's not horrific.
The article, however, explicitly pans the chatbot style and is extolling the API style being accessed constantly by agents, and that has no upper bound. Roughly $10-ish per Megatokens. $10-ish per 1K web searches. etc.
This doesn't sound "minimal" to me. This sounds like every single "task" I kick off is $10. And it can kick those tasks and costs off very quickly in an automated fashion. It doesn't take many of those tasks before I'm paying more than an actual full developer.
Ref: https://claude.com/pricing#api
The current realistic lower bound for actual work is the $100/€90/month Claude Max ("5x") plan. It allows roughly enough usage for a typical working month (4.25 x 40-50h). "Single-threaded", interactive usage with normal human breaks, sort of.
There are two usage quota windows to be aware of: 5h and 7d. I use https://github.com/richhickson/claudecodeusage (Mac) to keep track of the status. It shows green/yellow/red and a percentage in the menu bar.
is there a guidance on when an API v.s. a subscription is a better deal?
AI chat for research is great and really helps me.
I just don't need the AI writing code for me, don't see the point. Once I know from the ai chat research what my solution is I can code it myself with the benefit I then understand more what I am doing.
And yes I've tried the latest models! Tried agent mode in copilot! Don't need it!
I still use the chatbot but like to do it outside-in. Provide what I need, and instruct it to not write any code except the api (signatures of classes, interfaces, hierarchy, essential methods etc). We keep iterating about this until it looks good - still no real code. Then I ask it to do a fresh review of the broad outline, any issues it foresees etc. Then I ask it to write some demonstrator test cases to see how ergonomic and testable the code is - we fine tune the apis but nothing is fleshed out yet. Once this is done, we are done with the most time consuming phase.
After that is basically just asking it to flesh out the layers starting from zero dependencies to arriving at the top of the castle. Even if we have any complexities within the pieces or the implementation is not exactly as per my liking, the issues are localised - I can dive in and handle it myself (most of the time, I don't need to).
I feel like this approach works very well for me having a mental model of how things are connected because the most of the time I spent was spent on that model.
> I'm a software craftsman
This perspective is why I think this article is so refreshing.
Craftsmen approach tools differently. They don't expect tools to work for them out-of-the-box. They customize the tool to their liking and reexamine their workflow in light of the tool. Either that or they have such idiosyncratic workflows they have to build their own tools.
They know their tools are custom to _them_. It would be silly to impose that everyone else use their tools-- they build different things!
> Always Have an Agent Running
I'm a huge believer in AI agent use and even I think this is wrong. It's like saying "always have something compiling" or "make sure your Internet is always downloading something".
The most important work happens when an agent is not running, and if you spend most of your time looking for ways to run more agents you're going to streetlight-effect your way into solving the wrong problems https://en.wikipedia.org/wiki/Streetlight_effect
I've been thinking about this as three maturity levels.
Level 1 is what Mitchell describes — AGENTS.md, a static harness. Prevents known mistakes. But it rots. Nobody updates the checklist when the environment changes.
Level 2 is treating each agent failure as an inoculation. Agent duplicates a util function? Don't just fix it — write a rule file: "grep existing helpers before writing new ones." Agent tries to build a feature while the build is broken? Rule: "fix blockers first." After a few months you have 30+ of these. Each one is an antibody against a specific failure class. The harness becomes an immune system that compounds.
Level 3 is what I haven't seen discussed much: specs need to push, not just be read. If a requirement in auth-spec.md changes, every linked in-progress task should get flagged automatically. The spec shouldn't wait to be consulted.
The real bottleneck isn't agent capability — it's supervision cost. Every type of drift (requirements change, environments diverge, docs rot) inflates the cost of checking the agent's work.
Crush that cost and adoption follows.
> level 2 - becomes an immune system
i'd bet that above some number there will be contradictions. Things that apply to different semantic contexts, but look same on syntax level (and maybe with various levels of "syntax" and "semantic"). And debugging those is going to be nightmare - same as debugging requirements spec / verification of that
For those wondering how that looks in practice, here's one of OP's past blog posts describing a coding session to implement a non-trivial feature: https://mitchellh.com/writing/non-trivial-vibing (covered on HN here: https://news.ycombinator.com/item?id=45549434)
How much does it cost per day to have all these agents running on your computer?
Is your company paying for it or you?
What is your process of the agent writes a piece of code, let's say a really complex recursive function, and you aren't confident you could have come up with the same solution? Do you still submit it?
The guy who wrote the post is a billionaire
I thought this was a joke ie you need to be a billionaire to be able to use agents like this, but you are correct.
I think we need to stop listening to billionaires. The article is well thought out and well written, but his perspective is entirely biased by never having to think about money at all... all of this stuff is incredibly expensive.
Billionaires also tend to have a vested interest in the tech being hyped and adopted, after all one doesn't become a billionaire without investments.
Define investment in this case. He's the cofounder of HashiCorp. I guess you could refer to his equity as an investment here, but I don't really think it tracks the same in this context.
He may have a vested interest, but he did cofound HashiCorp as an engineer that actually developed the products, so I find his insight at least somewhat valuable.
Well did he become a billionaire from hashicorp alone or did he invest e.g. millions in stocks (like perhaps ai stocks) to become a billionaire
Oh, never heard of him!
Much more pragmatic and less performative than other posts hitting frontpage. Good article.
Finally, a step-by-step guide for even the skeptics to try to see what spot the LLM tools have in their workflows, without hype or magic like I vibe-coded an entire OS, and you can too!.
Very much the same experience. But it does not talk much about the project setup and the influence of it on the session success. In the narrow scoped projects it works really well, especially when tests are easy to execute. I found that this approach melts down when facing enterprise software with large repositories and unconventional layouts. Then you need to do a bunch of context management upfront, and verbose instructions for evaluations. But we know what it needs is a refactor thats all.
And the post touches on a next type of a problem, how to plan far ahead of time to utilise agents when you are away. It is a difficult problem but IMO we’re going in a direction of having some sort of shared “templated plans”/workflows and budgeted/throttled task execution to achieve that. It is like you want to give a little world to explore so that it does not stop early, like a little game to play, then you come back in the morning and check how far it went.
With so much noise in the AI world and constant model updates (just today GPT-5.3-Codex and Claude Opus 4.6 were announced), this was a really refreshing read. It’s easy to relate to his phased approach to finding real value in tooling and not just hype. There are solid insights and practical tips here. I’m increasingly convinced that the best way not to get overwhelmed is to set clear expectations for what you want to achieve with AI and tailor how you use it to work for you, rather than trying to chase every new headline. Very refreshing.
I think the sweet spot is ai-assisted chat with manual review: readily available, not as costly
agents jump ahead to the point of the user and project being out of control and more expensive
I think a lot of us still hesitate to make that jump; or at least I am not sure of a cost-effective agent approach (I guess I could manually review their output, but I could see it going off track quickly)
I guess I'd like to see more of an exact breakdown of what prompts and tools and AI are used to get ideas on if I'd use that for myself more
Suspect the sweet spot also depends on the objective. If it’s a personal tool where you are the primary user then vibe coding all the way. You can describe requirements precisely and if it breaks there are no angry customers.
Something with actual users needs a bit more care
I respect Hashimoto for his contributions in the field, but to be honest, I am fed up with posts talking about using AI in ways that are impossible for most people due to high costs. I want to see more posts on cost-effective techniques, rather than just another guy showing off how he turned a creative 'burning-time' hobby into a 'burning-money' one.
It's amusing how everyone seems to be going through the same journey.
I do run multiple models at once now. On different parts of the code base.
I focus solely on the less boring tasks for myself and outsource all of the slam dunk and then review. Often use another model to validate the previous models work while doing so myself.
I do git reset still quite often but I find more ways to not get to that point by knowing the tools better and better.
Autocompleting our brains! What a crazy time.
I don't understand how Agents make you feel productive. Single/Multiple agents reading specs, specs often produced with agents itself and iterated over time with human in the loop, a lot of reviewing of giant gibberish specs. Never had a clear spec in my life. Then all the dancing for this apperantly new paradigm, of not reviewing code but verifying behaviour, and so many other things. All of this to me is a total UNproductive mess. I use Cursor autocomplete from day one till to this day, I was super productive before LLMs, I'm more productive now, I'm capable, I have experience, product is hard to maintain but customers are happy, management is happy. So I can't really relate anymore to many of the programmers out there, that's sad, I can count on my hands devs that I can talk to that have hard skills and know-how to share instead of astroturfing about AI Agents
> Never had a clear spec in my life.
To me part of our job has always been about translating garbage/missing specs in something actionnable.
Working with agents don't change this and that's why until PM/business people are able to come up with actual specs, they'll still need their translators.
Furthermore, it's not because the global spec is garbage that you, as a dev, won't come up with clear specs to solve technical issues related to the overall feature asked by stakeholders.
One funny thing I see though, is in the AI presentations done to non-technical people, the advice: "be as thorough as possible when describing what you except the agent to solve!". And I'm like: "yeah, that's what devs have been asking for since forever...".
With "Never had a clear spec in my life" what I mean is also that I don't how something should come out till I'm actually doing it. Writing code for me lead to discovery, I don't know what to produce till I see it in the wrapping context, like what a function should accept, for example a ref or a copy. Only at that point I have the proper intuition to make a decision that has to be supported long term. I don't want cheap code now I want a solit feature working tomorrow and not touching it for a long a time hopefully
In my real life bubble, AI isn't a big deal either, at least for programmers. They tend to be very sceptical about it for many reasons, perceived productivity being only one of them. So, I guess it's much less of a thing than you would expect from media coverage and certain internet communities.
Are you hiring?
Open to applications I would say, but not completely remote. So unless you're Python or c/c++ Dev living in nrw, Germany..
> Never had a clear spec in my life.
Just because you haven't or you work in a particular way, doesn't mean everyone does things the same way.
Likewise, on your last point, just because someone is using AI in their work, doesn't mean they don't have hard skills and know-how. Author of this article Mitchell is a great example of that - someone who proved to be able to produce great software and, when talking about individuals who made a dent in the industry, definitely had/has an impactful career.
Never mentioned Mitchell I'm generally speaking, 95% of industry is not Mitchell
Well, you are commenting on a post he wrote.
Well, this site exist to discuss between people
Very nice. As a consequence of this new way of working I'm using `git worktree` and diffview all the time.
For more on the "harness engineering", see what Armin Ronacher and Mario Zechner are doing with pi: https://lucumr.pocoo.org/2026/1/31/pi/ https://mariozechner.at/posts/2025-11-30-pi-coding-agent/
> I really don't care one way or the other if AI is here to stay3, I'm a software craftsman that just wants to build stuff for the love of the game.
I suspect having three comma on one's bank account helps being very relaxed about the outcome ;)
> At a bare minimum, the agent must have the ability to: read files, execute programs, and make HTTP requests.
That's one very short step removed from Simon Willison's lethal trifecta.
This is why I won't run Claude without additional sandboxing. I'm currently using (and quite pleased with) https://github.com/strongdm/leash
I will say one thing Claude does is it doesn't run a command until you approve it, and you can choose between a one-time approval and always allowing a command's pattern. I usually approve the simple commands like `zig build test`, since I'm not particularly worried about the test harness. I believe it also scopes file reading by default to the current directory.
A lot of people run the claude with --dangerously-skip-permissions
I'm definitely not running that on my machine.
The way this is generally implemented is that agents have the ability to request a tool use. Then you confirm "yes, you may run this grep".
Same, but I felt okay sticking my code base in a VM and then letting an agent run there. I’d say it worked well
This seems like a pretty reasonable approach that charts a course between skepticism and "it's a miracle".
I wonder how much all this costs on a monthly basis?
The comment by user senko [1] links to a post from this same author with an example for a specific coding session that costs $15.98 for 8 hours of work. The example in this post talks about leaving agents running overnight, in which case I'd guess "twice that amount" would be a reasonable approximation.
Or if we assume that the OP can only do 4 hours per sitting (mentioned in the other post) and 8 hours of overnight agents then it would come down to $15.98 * 1.5 * 20 = $497,40 a month (without weekends).
[1] https://news.ycombinator.com/item?id=46905872
>$15.98 * 1.5 * 20 = $497,40 a month
Are people seriously dropping hundreds of dollars a month on these products to get their work done?
I am not really happy with thinking about what this does to small companies, hobbyists, open source programmers and so on, if it becomes a necessity to be competitive.
Especially since so many of those models have just freely ingested a whole bunch of open source software to be able to do what they do.
If you make 10k/mo -- which is not that much!, $500 is 5% of revenue. All else held equal, if that helps you go 20% faster, it's an absolute no brainer.
The question is.. does it actually help you do that, or do you go 0% faster? Or 5% slower?
Inquiring minds want to know.
>if that helps you go 20% faster, it's an absolute no brainer.
Another thing--is your job paying you $500 more per month for going 20% faster?
>If you make 10k/mo -- which is not that much!,
This is the sort of statement that immediately tells me this forum is disconnected from the real world. ~80% of full time workers in the US make less than $10k a month before tax.
Source: https://dqydj.com/income-percentile-calculator/
10k is more close to a yearly software developer salary in my country than a monthly one.
That being said at least the $20/mo Claude Code subscription is really worth it, and many companies are paying for the AI tools anyways.
And yet, the average salary of an IT worker in the US is somewhere between 104 and 110k. Since we're discussing coders here, and IT workers tend to be at the lower end of that, maybe there is some context you didn't consider?
And yet, the difference between average and median isn't understood.
>And yet, the average salary of an IT worker in the US is somewhere between 104 and 110k.
After tax that's like 8% of your take home pay. I don't know why it's unreasonable to scoff at having to pay that much to get the most out of these tools.
>maybe there is some context you didn't consider?
The context is that the average poster on HN has no idea how hard the real world is as they work really high paying jobs. To make a statement that "$10k a month is not a lot" makes you sound out of touch.
We're talking about people who work really high paying jobs deciding if a tool is worth their time.
Why would anyone discuss whether or not people who don't work those jobs should be using those tools, when that isn't part of their job?
As long as we're on the same page that what he's describing is itself a miracle.
It’s not. A miracle is “an event that is inexplicable by natural or scientific laws and accordingly gets attributed to some supernatural or preternatural cause”. Could we please stop trivialising and ignoring the meaning of words?
The word miracle itself is hyperbolic in nature...it's meant to enchant and not to be used literal or concretely.
No need to be pedantic here, there is a large cohort of the population that seemingly never thought a robot would be able to write usable code ("inexplicable by natural or scientific laws") and now here we are seeing that happen ("hey this must be preternatural! there is no other explanation")
Take your religion somewhere else please.
> This blog post was fully written by hand, in my own words.
This reminded me of back when wysiwyg web editors started becoming a thing, and coders started adding those "Created in notepad" stickers to their webpages, to point out they were 'real' web developers. Fun times.
It's so sad that we're the ones who have to tell the agent how to improve by extending agent.md or whatever. I constantly have to tell it what I don't like or what can be improved or need to request clarifications or alternative solutions.
This is what's so annoying about it. It's like a child that does the same errors again and again.
But couldn't it adjust itself with the goal of reducing the error bit by bit? Wouldn't this lead to the ultimate agent who can read your mind? That would be awesome.
> It's so sad that we're the ones who have to tell the agent how to improve by extending agent.md or whatever.
Your improvement is someone else's code smell. There's no absolute right or wrong way to write code, and that's coming from someone who definitely thinks there's a right way. But it's my right way.
Anyway, I don't know why you'd expect it to write code the way you like after it's been trained on the whole of the Internet & the the RLHF labelers' preferences and the reward model.
Putting some words in AGENTS.md hardly seems like the most annoying thing.
tip: Add a /fix command that tells it to fix $1 and then update AGENTS.md with the text that'd stop it from making that mistake in the future. Use your nearest LLM to tweak that prompt. It's a good timesaver.
While this may be the end goal, I do think humanity needs to take the trip along with AI to this point.
A mind reading ultimate agent sounds more like a deity, and there are more than enough fables warning one not to create gods because things tend to go bad. Pumping out ASI too quickly will cause massive destabilization and horrific war. Not sure who against really either. Could be us humans against the ASI, could be the rich humans with ASI against us. Anyway about it, it would represent a massive change in the world order.
It is not a mind reader. I enjoy giving it feedback because it shows I am in charge of the engineering.
I also love using it for research for upcoming features. Research + pick a solution + implement. It happens so fast.
I've been building systems like what the OP is using since gpt3 came out.
This is the honeymoon phase. You're learning the ins and outs of the specific model you're using and becoming more productive. It's magical. Nothing can stop you. Then you might not be improving as fast as you did at the start, but things are getting better every day. Or maybe every week. But it's heaps better than doing it by hand because you have so much mental capacity left.
Then a new release comes up. An arbitrary fraction of your hard earned intuition is not only useless but actively harmful to getting good results with the new models. Worse you will never know which part it is without unlearning everything you learned and starting over again.
I've had to learn the quirks of three generations of frontier families now. It's not worth the hassle. I've gone back to managing the context window in Emacs because I can't be bothered to learn how to deal with another model family that will be thrown out in six months. Copy and paste is the universal interface and being able to do surgery on the chat history is still better than whatever tooling is out there.
Unironically learning vim or Emacs and the standard Unix code tools is still the best thing you can do to level up your llm usage.
First off, appreciate you sharing your perspective. I just have a few questions.
> I've gone back to managing the context window in Emacs because I can't be bothered to learn how to deal with another model family that will be thrown out in six months.
Can you expand more on what you mean by that? I'm a bit of a noob on llm enabled dev work. Do you mean that you will kick off new sessions and provide a context that you manage yourself instead of relying on a longer running session to keep relevant information?
> Unironically learning vim or Emacs and the standard Unix code tools is still the best thing you can do to level up your llm usage.
I appreciate your insight but I'm failing to understand how exactly knowing these tools increases performance of llms. Is it because you can more precisely direct them via prompts?
LLMs work on text and nothing else. There isn't any magic there. Just a limited context window on which the model will keep predicting the next token until it decides that it's predicted enough and stop.
All the tooling is there to manage that context for you. It works, to a degree, then stops working. Your intuition is there to decide when it stops working. This intuition gets outdated with each new release of the frontier model and changes in the tooling.
The stateless API with a human deciding what to feed it is much more efficient in both cost and time as long as you're only running a single agent. I've yet to see anyone use multiple agents to generate code successfully (but I have used agent swarms for unstructured knowledge retrieval).
The Unix tools are there for you to progra-manually search and edit the code base copy/paste into the context that you will send. Outside of Emacs (and possibly vim) with the ability to have dozens of ephemeral buffers open to modify their output I don't imagine they will be very useful.
Or to quote the SICP lectures: The magic is that there is no magic.
I can't speak for parent, but I use gptel, and it sounds like they do as well. It has a number of features, but primarily it just gives you a chat buffer you can freely edit at any time. That gives you 100% control over the context, you just quickly remove the parts of the conversation where the LLM went off the rails and keep it clean. You can replace or compress the context so far any way you like.
While I also use LLMs in other ways, this is my core workflow. I quickly get frustrated when I can't _quickly_ modify the context.
If you have some mastery over your editor, you can just run commands and post relevant output and make suggested changes to get an agent like experience, at a speed not too different from having the agent call tools. But you retain 100% control over the context, and use a tiny fraction of the tokens OpenCode and other agents systems would use.
It's not the only or best way to use LLMs, but I find it incredibly powerful, and it certainly has it's place.
A very nice positive effect I noticed personally is that as opposed to using agents, I actually retain an understanding of the code automatically, I don't have to go in and review the work, I review and adjust on the fly.
One thing to keep in mind is that the core of an LLM is basically a (non-deterministic) stateless function that takes text as input, and gives text as output.
The chat and session interfaces obscure this, making it look more stateful than it is. But they mainly just send the whole chat so far back to the LLM to get the next response. That's why the context window grows as a chat/session continues. It's also why the answers tend to get worse with longer context windows – you're giving the LLM a lot more to sift through.
You can manage the context window manually instead. You'll potentially lose some efficiencies from prompt caching, but you can also keep your requests much smaller and more relevant, likely spending fewer tokens.
> I've been building systems like what the OP is using since pgt3 came out.
OP is also a founder of Hashicorp, so.. lol.
> This is the honeymoon phase.
No offense but you come across as if you didn’t read the article.
You come across as if you didn't read my post.
I'll wait for OP to move their workflow to Claude 7.0 and see if they still feel as bullish on AI tools.
People who are learning a new AI tool for the first time don't realzie that they are just learning quirks of the tool and underlying and not skills that generalize. It's not until you've done it a few times that you realzie you've wasted more than 80% of your time on a model that is completely useless and will be sunset in 6 months.
OT but, the style. The journey. What is it? What does this remind me of?
Flowers for Algernon.
Or at least the first half. I don't wanna see what it looks like when AI capabilities start going in reverse.
But I want to know.
LLMs are not for me. My position is that the advantage we humans have over the rest of the natural world, is our minds. Our ability to think, create and express ideas is what separates us from the rest of the animal kingdom. Once we give that over to "thinking" machines, we weaken ourselves, both individually and as a species.
That said, I've given it a go. I used zed, which I think is a pretty great tool. I bought a pro subscription and used the built in agent with Claude Sonnet 4.x and Opus. I'm a Rails developer in my day job, and, like MitchellH and many others, found out fairly quickly that tasks for the LLM need to be quite specific and discrete. The agent is great a renames and minor refactors, but my preferred use of the agent was to get it to write RSpec tests once I'd written something like a controller or service object.
And generally, the LLM agent does a pretty great job of this.
But here's the rub: I found that I was losing the ability to write rspec.
I went to do it manually and found myself trying to remember API calls and approaches required to write some specs. The feeling of skill leaving me was quite sobering and marked my abandonment of LLMs and Zed, and my return to neovim, agent-free.
The thing is, this is a common experience generally. If you don't use it, you lose it. It applies to all things: fitness, language (natural or otherwise), skills of all kinds. Why should it not apply to thinking itself.
Now you may write me and my experience off as that of a lesser mind, and that you won't have such a problem. You've been doing it so long that it's "hard-wired in" by now. Perhaps.
It's in our nature to take the path of least resistance, to seek ease and convenience at every turn. We've certainly given away our privacy and anonymity so that we can pay for things with our phones and send email for "free".
LLMs are the ultimate convenience. A peer or slave mind that we can use to do our thinking and our work for us. Some believe that the LLM represents a local maxima, that the approach can't get much better. I dunno, but as AI improves, we will hand over more and more thinking and work to it. To do otherwise would be to go against our very nature and every other choice we've made so far.
But it's not for me. I'm no MitchellH, and I'm probably better off performing the mundane activities of my work, as well as the creative ones, so as to preserve my hard-won knowledge and skills.
YMMV
I'll leave off with the quote that resonates the most with me as I contemplate AI:-
"I say your civilization, because as soon as we started thinking for you, it really became our civilization, which is, of course, what this is all about." -- Agent Smith "The Matrix"
I was using it the same way you just described but for C# and Angular and you're spot on. It feels amazing not having to memorize APIs and just let the AI even do code coverage near to 100%, however at some point I began noticing 2 things:
- When tests didn't work I had to check what was going on and the LLMs do cheat a lot with Volkswagen tests, so that began to make me skeptic even of what is being written by the agents
- When things were broken, spaghetti and awful code tends to be written in an obnoxius way it's beyond repairable and made me wish I had done it from scratch.
Thankfully I just tried using agents for tests and not for the actual code, but it makes me think a lot if "vibe coding" really produces quality work.
I don't understand why you were letting your code get into such a state just because an agent wrote it? I won't approve such code from a human, and will ask them to change it with suggestions on how. I do the same for code written by claude.
And then I raise the PR and other humans review it, and they won't let me merge crap code.
Is it that a lot of you are working with much lighter weight processes and you're not as strict about what gets merged to main?
AI adoption is being heavily pushed at my work and personally I do use it, but only for the really "boilerplate-y" kinds of code I've already written hundreds of times before. I see it as a way to offload the more "typing-intensive" parts of coding (where the bottleneck is literally just my WPM on the keyboard) so I have more time to spend on the trickier "thinking-intensive" parts.
I'm kind of on the same journey, a bit less far along. One thing I have observed is that I am constantly running out of tokens in claude. I guess this is not an issue for a wealthy person like Mitchell but it does significantly hamper my ability to experiment.
I recently also reflected on the evolution of my use of ai in programming. Same evolution, other path. If anyone is interested: https://www.asfaload.com/blog/ai_use/
Just wanted to say that was a nice and very grounded write up; and as a result very informative. Thank you. More stuff like this is a breath of fresh air in a landscape that has veered into hyperbole territory both in the for and against ai sides
> Immediately cease trying to perform meaningful work via a chatbot.
That depends on your budget. To work within my pro plan's codex limits, I attach the codebase as a single file to various chat windows (GPT 5.2 Thinking - Heavy) and ask it to find bugs/plan a feature/etc. Then I copy the dense tasklist from chat to codex for implementation. This reduces the tokens that codex burns.
Also don't sleep on GPT 5.2 Pro. That model is a beast for planning.
not quite as technically rich as i came to expect from previous posts from op, but very insightful regardless.
not ashamed to say that i am between steps 2 and 3 in my personal workflow.
>Adopting a tool feels like work, and I do not want to put in the effort
all the different approaches floating online feel ephemeral to me. this, just like for different tools for the op, seem like a chore to adopt. i like the fomo mongering from the community does not help here, but in the end it is a matter of personal discovery to stick with what works for you.
What a lovely read. Thank you for sharing your experience.
The human-agent relationship described in the article made me wonder: are natural, or experienced, managers having more success with AI as subordinates than people without managerial skill? Are AI agents enormously different than arbitrary contractors half a world away where the only communication is daily text exchanges?
So does everyone just run with giving full permissions on Claude code these days? It seems like I’m constantly coming back to CC to validate that it’s not running some bash that’s going to nuke my system. I would love to be able to fully step away but it feels like I can’t.
I run my agents with full permissions in containers. Feels like a reasonable tradeoff. Bonus is I can set up each container with exactly the stack needed.
I sandbox everything inside https://github.com/strongdm/leash
That way the blast radius is vastly reduced.
Honest question, when was the last time you caught it trying to use a command that was going to "nuke your system"?
“Nuke” is maybe too strong of a word, but it has not been uncommon for me to see it trying to install specific versions of languages on my machine, or services I intentionally don’t have configured, or sometimes trying to force npm when I’m using bun, etc.
Maybe once a month
I'd be interested to know what agents you're using. You mentioned Claude and GPT in passing, but don't actually talk about which you're using or for which tasks.
Good article! I especially liked the approach to replicate manual commits with the agent. I did not do that when learning but I suspect I'd have been much better off if I had.
> Context switching is very expensive. In order to remain efficient, I found that it was my job as a human to be in control of when I interrupt the agent, not the other way around. Don't let the agent notify you.
This I have found to be important too.
This is yet one more indication to me that the winds have shifted with regards to the utility of the “agent” paradigm of coding with an LLM. With all the talk around Opus 4.5 I decided to finally make the jump there myself and haven’t yet been disappointed (though admittedly I’m starting it on some pretty straightforward stuff).
Thanks for sharing your experiences :)
You mentioned "harness engineering". How do you approach building "actual programmed tools" (like screenshot scripts) specifically for an LLM's consumption rather than a human's? Are there specific output formats or constraints you’ve found most effective?
For those of working on large proprietary, in fringe languages as well, what can we do? Upload all the source code to the cloud model? I am really wary of giving it a million lines of code it’s never seen.
I've found mostly for context reasons its better to just have a grand overview of the systems and how they work together and feed that to the agent as context, it will use the additional files it touches to expand its understanding if you prompt well.
Does this essentially give the companies controlling these models access to our source code? That is, it goes into training future versions of the model?
AI is getting to the game-changing point. We need more hand-written reflections on how individuals are managing to get productivity gains for real (not a vibe coded app) software engineering.
Do you have any ideas on how to harness AI to only change specific parts of a system or workpiece? Like "I consider this part 80/100 done and only make 'meaningful' or 'new contributions' here" ...?
> having an agent running at all times
This gave me a physical flinch. Perhaps this is unfounded, but all this makes me think of is this becoming the norm, millions of people doing this, and us cooking our planet out much faster than predicted.
Now that the Nasdaq crashes, people switch from the stick to the carrot:
"Please let us sit down and have a reasonable conversation! I was a skeptic, too, but if all skeptics did what I did, they would come to Jesus as well! Oh, and pay the monthly Anthropic tithe!"
Refreshing to read a balanced opinion, from a person who has significant experience and grounding in the real world.
I know I'm in the minority here, but I've been finding AI to be increasingly useless.
I'd already abandoned it for generating code, for all the reasons everyone knows, that don't need to be rehashed.
I was still in the camp of "It's a better google" and can save me time with research.
The issue it, at this point in my career (30+ years) the questions I have are a bit more nuanced and complex. They aren't things like "how do I make a form with React".
I'm working on developing a very high performance peer server that will need to scale up to hundreds of thousands to a million concurrent web socket connections to work as a signaling server for WebRTC connection negotiation.
I wanted to start as simple as possible, so peerjs is attractive. I asked the AI if peerjs peer-server would work with NodeJS's cluster server. It enthusiastically told me it would work just fine and was, in fact, designed for that.
I took a look at the source code, and it looked to me like that was dead wrong. The AI kept arguing with me before finally admitting it was completely wrong. A total waste of time.
Same results asking it how to remove Sophos from a Mac.
Same with legal questions about HOA laws, it just totally hallucinates things thay don't exist.
My wife and I used to use it to try to settle disagreements (i.e a better google) but amusingly we've both reached a place where we distrust anything it says so much, we're back to sending each other web articles :-)
I'm still pretty excited about the potential use of AI in elementary education, maybe through high school in some cases, but for my personal use, I've been reaching for it less and less.
I can relate as far as asking AI for advice on complex design tasks. The fundamental problem is that it is still basically a pattern matching technology that "speaks before thinking". For shallow problems this is fine, but where it fails is when it a useful response would require it to have analyzed the consequences of what it is suggesting, although (not that it helps) many people might respond in the same way - with whatever "comes to mind".
I used to joke that programming is not a career - it's a disease - since practiced long enough it fundamentally changes the way you think and talk, always thinking multiple steps ahead and the implications of what you, or anyone else, is saying. Asking advice from another seasoned developer you'll get advice that has also been "pre-analyzed", but not from an LLM.
How could the author write all of that and not talk about actual time savings versus the prior method?
I mean, what is the point of change if not to improve? I don't mean "I felt I was more efficient." Feelings aren't measurements. Numbers!
> I'm not [yet?] running multiple agents, and currently don't really want to
This is the main reason to use AI agents, though: multitasking. If I'm working on some Terraform changes and I fire off an agent loop, I know it's going to take a while for it to produce something working. In the meantime I'm waiting for it to come back and pretend it's finished (really I'll have to fix it), so I start another agent on something else. I flip back and forth between the finished runs as they notify me. At the end of the day I have 5 things finished rather than two.
The "agent" doesn't have to be anything special either. Anything you can run in a VM or container (vscode w/copilot chat, any cli tool, etc) so you can enable YOLO mode.
How much electricity (and associated materials like water) must this use?
It makes me profoundly sad to think of the huge number of AI agents running endlessly to produce vibe-coded slop. The environmental impact must be massive.
If you'd like an estimate, I like this from Simon Willison: https://simonwillison.net/2025/Nov/29/
Keep in mind that these are estimates, but you could attempt to extrapolate from here. Programming prompts probably take more because I assume the average context is a good bit higher than the average ChatGPT question, plus additional agents.
All in, I'm not sure if the energy usage long term is going to be overblown by media or if it'll be accurate. I'm personally not sure yet.
This are all valid points and a hype-free pragmatic take, I've been wondering about the same things even when I'm still in the skeptics side. I think there are other things that should be added since Mitchell's reality won't apply to everyone:
- What about non opensource work that's not on Github?
- Costs! I would think "an agent always running" would add up quickly
- In open source work, how does it amplify others. Are you seeing AI Slop as PRs? Can you tell the difference?
If the author is here, please could you also confirm you’ve never been paid by any AI company, marketing representative, community programme, in any shape or form?
I don't think you appreciate how un-bribeable this particular author is, and I don't just mean in a moral sense.
He explicitly said "I don't work for, invest in, or advise any AI companies." in the article.
But yes, Hashimoto is a high profile CEO/CTO who may well have an indirect, or near-future interest in talking up AI. HN articles extoling the productivity gains of Claude on HN do generally tend to be from older, managerial types (make of that what you will).
What made me feel old today: seeing a 36-year-old referred to as an older type
Bit strange that you are skeptical by default.
Isn't skeptical by default quite reasonable?
Probably exhausting to be that way. The author is well respected and well known and has a good track record. My immediate reaction wasn’t to question that he spoke in good faith.
I don’t know the author, and am suspicious of the amount of astroturfing that has gone on with AI. This article seems reasonable so I looked for a disclaimed and found it oddly worded, hence the request for clarification.
I find it interesting that this thread is full of pragmatic posts that seem to honestly reflect the real limits of current Gen-Ai.
Versus other threads (here on HN, and especially on places like LinkedIn) where it's "I set up a pipeline and some agents and now I type two sentences and amazing technology comes out in 5 minutes that would have taken 3 devs 6 months to do".
I never see those type of posts. Maybe I'm immune and ignoring them.
There are so many stories about how people use agentic AI but they rarely post how much they spend. Before I can even consider it, I need to know how it will cost me per month. I'm currently using one pro subscription and it's already quite expensive for me. What are people doing, burning hundreds of dollars per month? Do they also evaluate how much value they get out of it?
Low hundreds ($190 for me) but yes.
I quickly run out of the JetBrains AI 35 monthly credits for $300/yr and spending an additional $5-10/day on top of that, mostly for Claude.
I just recently added in Codex, since it comes with my $20/mo subscription to GPT and that's lowering my Claude credit usage significantly... until I hit those limits at some point.
2012 + 300 + 5~200... so about $1500-$1600/year.
It is 100% worth it for what I'm building right now, but my fear is that I'll take a break from coding and then I'm paying for something I'm not using with the subscriptions.
I'd prefer to move to a model where I'm paying for compute time as I use it, instead of worrying about tokens/credits.
Not using Hot Aisle for inference?
We're literally full. Just a few 1x GPUs available right now.
So far, I haven't been happy with any of the smaller coding models, they just don't compare to claude/codex.
> If an agent isn't running, I ask myself "is there something an agent could be doing for me right now?"
Solution-looking-for-a-problem mentality is a curse.
The Death of the "Stare": Why AI’s "Confident Stupidity" is a Threat to Human Genius
OPINION | THE REALITY CHECK In the gleaming offices of Silicon Valley and the boardrooms of the Fortune 500, a new religion has taken hold. Its deity is the Large Language Model, and its disciples—the AI Evangelists—speak in a dialect of "disruption," "optimization," and "seamless integration." But outside the vacuum of the digital world, a dangerous friction is building between AI’s statistical hallucinations and the unyielding laws of physics.
The danger of Artificial Intelligence isn't that it will become our overlord; the danger is that it is fundamentally, confidently, and authoritatively stupid.
The Paradox of the Wind-Powered Car The divide between AI hype and reality is best illustrated by a recent technical "solution" suggested by a popular AI model: an electric vehicle equipped with wind generators on the front to recharge the battery while driving. To the AI, this was a brilliant synergy. It even claimed the added weight and wind resistance amounted to "zero."
To any human who has ever held a wrench or understood the First Law of Thermodynamics, this is a joke—a perpetual motion fallacy that ignores the reality of drag and energy loss. But to the AI, it was just a series of words that sounded "correct" based on patterns. The machine doesn't know what wind is; it only knows how to predict the next syllable.
The Erosion of the "Human Spark" The true threat lies in what we are sacrificing to adopt this "shortcut" culture. There is a specific human process—call it The Stare. It is that thirty-minute window where a person looks at a broken machine, a flawed blueprint, or a complex problem and simply observes.
In that half-hour, the human brain runs millions of mental simulations. It feels the tension of the metal, the heat of the circuit, and the logic of the physical universe. It is a "Black Box" of consciousness that develops solutions from absolutely nothing—no forums, no books, and no Google.
However, the new generation of AI-dependent thinkers views this "Stare" as an inefficiency. By outsourcing our thinking to models that cannot feel the consequences of being wrong, we are witnessing a form of evolutionary regression. We are trading hard-earned competence for a "Yes-Man" in a box.
The Gaslighting of the Realist Perhaps most chilling is the social cost. Those who still rely on their intuition and physical experience are increasingly being marginalized. In a world where the screen is king, the person pointing out that "the Emperor has no clothes" is labeled as erratic, uneducated, or naive.
When a master craftsman or a practical thinker challenges an AI’s "hallucination," they aren't met with logic; they are met with a robotic refusal to acknowledge reality. The "AI Evangelists" have begun to walk, talk, and act like the models they worship—confidently wrong, devoid of nuance, and completely detached from the ground beneath their feet.
The High Cost of Being "Authoritatively Wrong" We are building a world on a foundation of digital sand. If we continue to trust AI to design our structures and manage our logic, we will eventually hit a wall that no "prompt" can fix.
The human brain runs on 20 watts and can solve a problem by looking at it. The AI runs on megawatts and can’t understand why a wind-powered car won't run forever. If we lose the ability to tell the difference, we aren't just losing our jobs—we're losing our grip on reality itself.
> babysitting my kind of stupid and yet mysteriously productive robot friend
LOL, been there, done that. It is much less frustrating and demoralizing than babysitting your kind of stupid colleague though. (Thankfully, I don't have any of those anymore. But at previous big companies? Oh man, if only their commits were ONLY as bad as a bad AI commit.)
[flagged]
"Please don't post shallow dismissals, especially of other people's work. A good critical comment teaches us something."
"Don't be snarky."
"Don't be curmudgeonly. Thoughtful criticism is fine, but please don't be rigidly or generically negative."
https://news.ycombinator.com/newsguidelines.html
For the AI skeptics reading this, there is an overwhelming probability that Mitchell is a better developer than you. If he gets value out of these tools you should think about why you can't.
The AI skeptics instead stick to hard data, which so far shows a 19% reduction in productivity when using AI.
https://metr.org/blog/2025-07-10-early-2025-ai-experienced-o...
> 1) We do NOT provide evidence that AI systems do not currently speed up many or most software developers. Clarification: We do not claim that our developers or repositories represent a majority or plurality of software development work.
> 2) We do NOT provide evidence that AI systems do not speed up individuals or groups in domains other than software development. Clarification: We only study software development.
> 3) We do NOT provide evidence that AI systems in the near future will not speed up developers in our exact setting. Clarification: Progress is difficult to predict, and there has been substantial AI progress over the past five years [3].
> 4) We do NOT provide evidence that there are not ways of using existing AI systems more effectively to achieve positive speedup in our exact setting. Clarification: Cursor does not sample many tokens from LLMs, it may not use optimal prompting/scaffolding, and domain/repository-specific training/finetuning/few-shot learning could yield positive speedup.
Points 2 and 3 are irrelevant.
Point 1 is saying results may not generalise, which is not a counter claim. It’s just saying “we cannot speak for everyone”.
Point 4 is saying there may be other techniques that work better, which again is not a counter claim. It’s just saying “you may find bette methods.”
Those are standard scientific statements giving scope to the research. They are in no way contradicting their findings. To contradict their findings, you would need similarly rigorous work that perhaps fell into those scenarios.
Not pushing an opinion here, but if we’re talking about research then we should be rigorous and rationale by posting counter evidence. Anyone who has done serious research in software engineering knows the difficulties involved and that this study represents one set of data. But it is at least a rigorous set and not anecdata or marketing.
I for one would love a rigorous study that showed a reliable methodology for gaining generalised productivity gains with the same or better code quality.
There is no such hard data. It's just research done on 16 developers using Cursor and Sonnet 3.5.
Perhaps that's the reason. Maybe I'm just not a good enough developer. But that's still not actionable. It's not like I never considered being a better developer.
I'm not as good as Fabrice Bellard either but I don't let that bother me as I go about my day.
Don't get it. What's the relation between Mitchell being a "better" developer than most of us (and better is always relative, but that's another story) and getting value out of AI? That's like saying Bezos is a way better businessman than you, so you should really hear his tips about becoming a billionaire. No sense (because what works for him probably doesn't work for you)
Tons of respect for Mitchell. I think you are doing him a disservice with these kinds of comments.
Maybe you disagree with it, but it seems like a pretty straightforward argument: A lot of us dismiss AI because "it can't be trusted to do as good a job as me". The OP is arguing that someone, who can do better than most of us, disagrees with this line of thinking. And if we have respect for his abilities, and recognize them as better than our own, we should perhaps re-assess our own rationale in dismissing the utility of AI assistance. If he can get value out of it, surely we can too if we don't argue ourselves out of giving it a fair shake. The flip side of that argument might be that you have to be a much better programmer than most of us are, to properly extract value out of the AI... maybe it's only useful in the hands of a real expert.
No, it doesn't work that way. I don't know if Mitchell is a better programmer than me, but let's say he is for the sake of argument. That doesn't make him a god to whom I must listen. He's just a guy, and he can be wrong about things. I'm glad he's apparently finding value here, but the cold hard reality is that I have tried the tools and they don't provide value to me. And between another practicioner's opinion and my own, I value my own more.
>A lot of us dismiss AI because "it can't be trusted to do as good a job as me"
Some of us enjoy learning how systems work, and derive satisfaction from the feeling of doing something hard, and feel that AI removes that satisfaction. If I wanted to have something else write the code, I would focus on becoming a product manager, or a technical lead. But as is, this is a craft, and I very much enjoy the autonomy that comes with being able to use this skill and grow it.
There is no dichotomy of craft and AI.
I consider myself a craftsman as well. AI gives me the ability to focus on the parts I both enjoy working on and that demand the most craftsmanship. A lot of what I use AI for and show in the blog isn’t coding at all, but a way to allow me to spend more time coding.
This reads like you maybe didn’t read the blog post, so I’ll mention there many examples there.
I enjoy Japanese joinery, but for some reason the housing market doesn't.
Nobody is trying to talk anyone out of their hobby or artisanal creativeness. A lot of people enjoy walking, even after the invention of the automobile. There's nothing wrong with that, there are even times when it's the much more efficient choice. But in the context of say transporting packages across the country... it's not really relevant how much you enjoy one or the other; only one of them can get the job done in a reasonable amount of time. And we can assume that's the context and spirit of the OP's argument.
>Nobody is trying to talk anyone out of their hobby or artisanal creativeness.
Well, yes, they are, some folks don't think "here's how I use AI" and "I'm a craftsman!" are consistent. Seems like maybe OP should consider whether "AI is a tool, why can't you use it right" isn't begging the question.
Is this going to be the new rhetorical trick, to say "oh hey surely we can all agree I have reasonable goals! And to the extent they're reasonable you are unreasonable for not adopting them"?
>But in the context of say transporting packages across the country... it's not really relevant how much you enjoy one or the other; only one of them can get the job done in a reasonable amount of time.
I think one of the more frustrating aspects of this whole debate is this idea that software development pre-AI was too "slow", despite the fact that no other kind of engineering has nearly the same turn around time as software engineering does (nor does they have the same return on investment!).
I just end up rolling my eyes when people use this argument. To me it feels like favoring productivity over everything else.
The value Mitchell describes aligns well with the lack of value I'm getting. He feels that guiding an agent through a task is neither faster nor slower than doing it himself, and there's some tasks he doesn't even try to do with an agent because he knows it won't work, but it's easier to parallelize reviewing agentic work than it is to parallelize direct coding work. That's just not a usage pattern that's valuable to me personally - I rarely find myself in a situation where I have large number of well-scoped programming tasks I need to complete, and it's a fun treat to do myself when I do.
"Why can't you be more like your brother Mitchell?"
I mean, not to say he's not, but by what metric?
If by company success, then Zuckerberg and Musk are better than all of us.
If by millions made, as he likes to joke/brag about... Fabrice Bellard is an utter failure.
If by install base, the geniuses that made MS Teams are among the best.
None of this is to take away from the successes of the man, but this kind of statement is rather silly.
[flagged]
Ok, but please don't post unsubstantive comments to Hacker News.
>Underwhelming
Which is why I like this article. It's realistic in terms of describing the value-propositio of LLM-based coding assist tools (aka, AI agents).
The fact that it's underwhelming compared to the hype we see every day is a very, very good sign that it's practical.
most AI adoption journeys are
> a period of inefficiency
I think this is something people ignore, and is significant. The only way to get good at coding with LLMs is actually trying to do it. Even if it's inefficient or slower at first. It's just another skill to develop [0].
And it's not really about using all the plugins and features available. In fact, many plugins and features are counter-productive. Just learn how to prompt and steer the LLM better.
[0]: https://ricardoanderegg.com/posts/getting-better-coding-llms...