"In the METR study, developers predicted AI would make them 24% faster before starting. After finishing 19% slower, they still believed they'd been 20% faster."
I hadn't heard of this study before. Seems like it's been mentioned on HN before but not got much traction.
I see it brought up almost every week! It's a firm favorite of the "LLMs don't actually help write code" contingent, probably because there are very few other credible studies they can point to in support of their position.
Most people who cite it clearly didn't read as far as the table where METR themselves say:
> We do not provide evidence that:
> 1) AI systems do not currently speed up many or most software developers. Clarification: We do not claim that our developers or repositories represent a majority or plurality of software development work
> 2) AI systems do not speed up individuals or groups in domains other than software development. Clarification: We only study software development
> 3) AI systems in the near future will not speed up developers in our exact setting. Clarification: Progress is difficult to predict, and there has been substantial AI progress over the past five years [3]
> 4) There are not ways of using existing AI systems more effectively to achieve positive speedup in our exact setting. Clarification: Cursor does not sample many tokens from LLMs, it may not use optimal prompting/scaffolding, and domain/repository-specific training/finetuning/few-shot learning could yield positive speedup
Weird, you shouldn't really need to list the things your study doesn't prove! I guess they anticipated that the study might be misrepresented and wanted to get ahead of that.
Their study still shows something interesting, and quite surprising. But if you choose to extrapolate from this specific setting and say coding assistants don't work in general then that's not scientific and you need to be careful.
I think the studyshould probably decrease your prior that AI assistants actually speed up development, even if developers using AI tell you otherwise. The fact it feels faster when it is slower is super interesting.
The lesson I took from the study is that developers are terrible at estimating their own productivity based on a new tool.
Being armed with that knowledge is useful when thinking about my own productivity, as I know that there's a risk of me over-estimating the impact of this stuff.
But then I look at https://github.com/simonw which currently lists 530 commits over 46 repositories for the month of December, which is the month I started using Opus 4.5 in Claude Code. That looks pretty credible to me!
In the 80s, when the mouse was just becoming common, there was a study comparing programming using a mouse vs. just a keyboard. Programmers thought they were faster using a keyboard, but they were actually faster using a mouse.
That's certainly an impressive month! However, it's conceivable that you are an outlier (in the best possible way!)
I liked the way they did that study and I would be interested to see an updated version with new tools.
I'm not particularly sceptical myself and my guess is that using Opus 4.5 would probably have produced a different result to the one in the original study.
I'm definitely an outlier - I've been pushing the boundaries of these tools for three years now and this month I've been deliberately throwing some absurdly ambitious problems at Opus 4.5 (like this one: https://static.simonwillison.net/static/2025/claude-code-mic...) to see how far it can go.
Very interesting example. It's an insanely complex task even with a reference implementation in another language.
It's surprising that it manages the majority of the test cases but not all of them. That's not a very human-like result. I would expect humans to be bimodal with some people getting stuck earlier and the rest completing everything. Fractal intelligence strikes again I guess?
Do you think the way you specified the task at such a high level made it easier for Claude? I would have probably tried to be much more specific for example by translating on a file by file or function by function basis. But I've no idea if this is a good approach. I'm really tempted to try this now! Very inspiring.
In this case ripping off the MicroQuickJS test suite was the big unlock.
I have a WebAssembly runtime demo I need tonics where I used the WebAssembly specification itself, which it turns out has a comprehensive test suite built in as well.
The lesson I learned is that agentic coding uses intermittent reinforcement to mimic a slot machine.
It (along with the hundreds of billions in investments hinging on it), explains the legions of people online who passionately defend their "system". Every gambler has a "system" and they usually earnestly believe it is helping them.
Some people even write popular (and profitable!) blogs about playing slots machines where they share their tips and tricks.
Plenty of people have been (too) quick to dismiss that study as not generally applicable because it was about highly experienced OSS devs rather than your average corporation programmer drone.
The issue I have with the paper is that it seems (based on my skimming) that they did not pick developers who were already versed with AI tooling. So they're comparing (experienced dev working in the way they're comfortable) vs (experienced dev working with new tool for the first time and not having passed the productivity slump from onboarding).
Longitudinal studies are definitely needed, but of course at the time the research for this paper was done there weren't any programmers experienced with AI assist out there yet.
The thing I find interesting is that there is trillions of dollars in valuations hinging upon this question and yet the appetite to spend a little bit of money to repeat this study and then release the results publicly is apparently very low.
It reminds me of global warming where on one side of the debate there some scientists with very little money running experiments and on the other side there were some ridiculously wealthy corporations publicly poking holes in those experiments but who secretly knew they were valid since the 1960s.
That's interesting context for sure, but the fact these were experienced developers makes it all the more surprising that they didn't realise the LLM slowed them down.
Measuring programming productivity in general is notoriously difficult, subjectively measuring your own programming productivity is even worse. A magic LoC machine saying brrrrrt gives an overoptimistic sense of getting things done.
A lot of the time, AI allows you to exercise basic competence at tasks for which you'd otherwise be incompetent. I think this is why it feels so powerful. You can jump into more or less any task below a certain level of complexity. (eg: you're not going to write an operating system with an LLM but you can set up and configure Wordpress if you'd never done it before.)
I think for users this _feels_ incredibly powerful, however this also has its own pitfalls: Any topic which you're incompetent at is one which you're also unequipped to successfully review.
I think there are some other productivity pitfalls for LLMs:
- Employees use it to give their boss emails / summaries / etc in the language and style their boss wants. This makes their boss happy, but doesn't actually modify productivity whatsoever since the exercise was a waste of time in the first place.
- Employees send more emails, and summarize more emails. They look busier, but they're not actually writing the emails or really reading them. The email volume has increased, however the emails themselves were probably a waste of time in the first place.
- There is more work to review all around and much of it is of poor quality.
I think these issues play a smaller part than some of the general issues raised (eg: poor quality code / lack of code reviews / etc.) but are still worth noting.
It's like Excel: It's really powerful to enable someone who actually knows what needs done to build a little tool that does that thing. It often doesn't have to be professional-quality, let alone perfect. It just has to be better than doing the same thing manually. There are massive productivity gains to be had there... for people with that kind of problem.
This is completely orthogonal to productivity gains for full time professional developers.
If we take out most of frontend work, and the easy backend/Ops tasks where writing the code/config is 99% of the work, i think my overall productivity with the latest gen (basically Opus 4.5) improve by 15-20%. I also am _very_ sure that with the previous generation (Sonnet 4, sonnet 4.5, Codex 5.1), my team overall velocity decreased, even taking into account the frontend and the "easy" tasks. The amount of production bug we had to deal with this year is crazy. To much code is generated, and me and the other senior on my team just can't carefully review everything, we have to trust sometime (especially data structures).
The worse part is reading a PR, and catching a reintroduced bug that was fixed a few commit ago. The first time i almost lost my cool at work and said a negative thing to a coworker.
This would be my advice to juniors (and i mean basically: devs who don't yet understand the underlying business/architecture): use the AI to explain how stuff work, generate basic functions maybe, but write code logic/algorithm yourself until you are sure you understand what you're doing and why. Work and reflect on the data structures by yourself, even if generated by the AI, and ask for alternatives. Always ask for alternatives, it helps understanding.
You might not see huge productivity gains from AI, but you will improve first, and then productivity will improve very fast, from your brain first, then from AI.
> The worse part is reading a PR, and catching a reintroduced bug that was fixed a few commit ago. The first time i almost lost my cool at work and said a negative thing to a coworker.
Losing your cool is never a good idea, but this is absolutely a time when you should give negative feedback to that coworker.
Feedback is what reviews are for; in this case, this aspect of the feedback should neither be positive nor neutral.
Just to add to your advice to juniors working with AI:
* Force the AI to write tests for everything. Ensure those tests function. Writing boring unit tests used to be arduous. Now the machine can do it for you. There's no excuse for a code regression making it's way into a PR because you actually ran the tests before you did the commit, right? Right? RIGHT?
* Force the AI to write documentation and properly comment code, then (this is the tricky part) you actually read what it said it was doing and ensure that this is what you wanted it to do before you commit.
Just doing these two things will vastly improve the quality and prevent most of the dumb regressions that are common with AI generated code. Even if you're too busy/lazy to read every line of code the AI outputs just ensuring that it passes the tests and that the comments/docs describe the behavior you asked for will get you 90% of the way there.
Sometimes the AI is all too good at writing tests.
I agree with the idea, I do it too, but you need to make sure the test don't just validate the incorrect behavior or that the code is not updated to pass the test in a way that actually "misses the point".
I've had this happen to me on one or two tests every time
For some reason Gemini seems to be worse at it than Claude lately. Since mostly moving to 3 I've had it go back and change the tests rather than fixing the bug on what seems to be a regular basis. It's like it's gotten smart enough to "cheat" more. You really do still have to pay attention that the tests are valid.
Even more important, those tests need to be useful. Often unit tests are simply testing the code works as written which is generally doing more harm than good.
To give some further advice to juniors: if somebody is telling you writing unit tests is boring, they haven’t learned how to write good tests. There appears to be a large intersection between devs who think testing is a dull task and devs who see a self proclaimed speed up from AI. I don’t think this is a coincidence.
Writing useful tests is just as important as writing app code, and should be reviewed with equal scrutiny.
>> Kernighan's Law - Debugging is twice as hard as writing the code in the first place. Therefore, if you write the code as cleverly as possible, you are, by definition, not smart enough to debug it.
Now question is..
is AI providing solutions smarter than the developer using it might have produced?
And perhaps more importantly, How much time it takes AI to write code and human to debug it, even if both are producing equally smart solutions.
AI almost always reduces the time from "I need to implement this feature" to "there is some code that implements this feature".
However in my experience, the issue with AI is the potential hidden cost down the road. We either have to:
1. Code review the AI generated code line by line to ensure it's exactly what you'd have produced yourself when it is generated or
2. Pay an unknown amount of tech tebt down the road when it inevitably wasn't what you'd have done yourself and it isn't extensible, scalable, well written code.
Exactly. Optimizations in one area will simply move the bottleneck so in order to truly recognize gains you have to optimize the entire software pipeline.
Exactly right. It turns out that writing code is hardly ever the real bottleneck. People should spend some time learning the basics of queueing theory.
RE 2: It's not that far down the road either. Laziliy reviewed or unreviewed LLM code rapidly turns your codebase into an absolute mess that LLMs can't maintain either. Very quickly you find yourself with lots of redundant code and duplicated logic, random unused code that's called by other unused code that gets called inside a branch that only tests will trigger, stuff like that. Eventually LLMs start fixing the code that isn't used and then confidently report that they solved the problem, filling up a context window with redundant nonsense every prompt, so they can't get anywhere. Yolo AI coding is like the payday loan of tech debt.
This can happen sooner than you think too. I asked for what I thought was a simple feature and the AI wrote and rewrote a number of times trying to get it right, and eventually (not making this up) it told me the file was corrupt and could I please restore it from backup. This happened within about 20-30 minutes of asking for the change.
Have you considered having AI code review the AI code before giving them off to a human? I've been experimenting with having claude work on some code and commit it, and then having codex review the changes in the most recent git commit, then eyeballing the recommendations and either having codex work the changes, or giving them back to claude. That has seemed to be quite effective so far.
If anyone ever wonder why they don't see productivity improvement, they really need to read Mythical Man-Month.
Garage Duo can out-compete corporate because there is less overhead. But Garage Duo can't possibly output the sheer amount of work matching with corporate.
In my view the reasons why LLMs may be less effective in a corporate environment is quite different from the human factors in mythical man month.
I think that the reason LLMs don't work as well in a corporate environment with large codebases and complex business logic, but do work well in greenfield projects, is linked to the amount of context the agents can maintain.
Many types of corporate overhead can be reduced using an LLM. Especially following "well meant but inefficient" process around JIRA tickets, testing evidence, code review, documentation etc.
I've found that something very similar to those "inefficient" processes works incredibly well when applied to LLMs. All of those processes are designed to allow for seamless handoff to different people who may not be familiar with the project or code which is exactly what an LLM behaves like when you clear its context.
There have been methods to reduce overhead available over the history of our industry. Unfortunately almost all the times it involves using productive tools that would in some way reduce the head counts required to do large projects.
The way this works is you eventually have to work with languages like Lisp, Perl, Prolog, and then some one comes up with a theory that programming must be optimised for the mostly beginners and power tooling must be avoided. Now you are forced to use verbose languages, writing, maintaining and troubleshooting take a lot of people.
The thing is this time around, we have a way to make code by asking an AI tool questions. So you get the same effect but now with languages like JS and Python.
If people need AI assistance to handle all their "boilerplate" all the time, the much larger problem is needing so much damn boilerplate written all the time.
The job of anyone developing an application framework, whether that's off the shelf or in-house, is to reduce the amount of boilerplate any individual developer needs to write to an absolute bare minimum. The ultimate win isn't to get "AI to write all your boilerplate." It's to not need to write boilerplate at all.
I think coding agents require fundamentally different development practices in order to produce efficiency improvements. And just like any new tool, they benefit from wisdom in how they are applied, which we are just starting to develop as an industry. I expect that over time we will grow to understand and also expand the circumstances in which they are a net benefit, while also appreciating where they are a hindrance, leading to an overall efficiency increase as we avoid the productivity hit resulting from their misapplication.
Sounds like AI slopish article. A whole section about "Why most enterprises don't" with many words but no actual data or analysis. Just assumptions based on orthogonal report.
AI won't give you much productivity if the problem you're challenged with is the human problem. That could happen both to startups and enterprises.
This article simply reinforces existing (and outdated) biases.
Complex legacy refactoring + Systems with poor documentation or unusual patterns + Architectural decisions requiring deep context: These go hand in hand. LLMs are really good at pulling these older systems apart, documenting, then refactoring them, tests and all. Exacerbated by poor documentation of domain expectations. Get your experts in a room weekly and record their rambling ideas and history of the system. Synthesize with an LLM against existing codebase. You'll get to 80% system comprehension in a matter of months.
Novel problem-solving with high stakes: This is the true bottleneck, and where engineers can shine. Risk assessment and recombination of ideas, with rapid prototyping.
Coz I have always done coding this way with humans I started out using LLMs to do simple bits of refactoring where tests could be used to validate that the output still worked.
I did not get the impression from this that LLMs were great coders. They would frequently miss stuff, make mistakes and often just ignore the instructions i gave them.
Sometimes they would get it right but not enough. The agentic coding loop still slowed me down overall. Perhaps if i were more junior it would have been a net boost.
In my experience, it’s basically impossible to accurately measure productivity of knowledge work. Whenever I see a stat associated to productivity gain/loss I get skeptical.
If you go the pure subjective route, I’ve found that people conflate “speed” or “productivity” with “ease.”
A key point missing from a lot of the AI debate is how much work is useless. From as simple as a feature that’s never turned on to a more extreme version of a job that doesn’t need to exist.
We have a lot of useless work being done, and AI is absolutely going to be a 10x speed up for this kind of work.
Actually though. We had one device that was over 10 years older without any MDM etc. and it outperformed a new laptop building the same product because of the corporate anti virus crap.
In programming we've often embraced spending time to learn new tools. The AI tools are just another set of tools, and they're rapidly changing as well.
I've been experimenting seriously with the tools for ~3 years now, and I'm still learning a lot about their use. Just this past weekend I started using a whole new workflow, and it one-shotted building a PWA that implements a fully-featured calorie tracking app (with social features, pre-populating foods from online databases, weight tracking and graphing, avatars, it's on par with many I've used in the past that cost $30+/year).
Someone just starting out at chat.openai.com isn't going to get close to this. You absolutely have to spend time learning the tooling for it to be at all effective.
I've worked at a number of non-tech companies the past few years. They bought every SaaS product, Palantir, Databricks, multi-cloud, their dev teams adopted every pattern popularized by big tech and the results were always mixed. Any gains were wiped out by being buried under technical debt. They had all the data catalogs & 'ontologies' with none of the governance to go make it work. Turns out that benefiting from all this tech requires you to re-organize and change your culture. For a lot of companies, they're just not going to see big gains from AI or tech in general at this point.
The key issue is that the current version of AI has no concept of understanding anything. Without understanding anything is possible and bad outcomes are almost guaranteed outside of the trivial. Throw a non-trivial codebase at any AI tool and watch as it utterly destroys it, introduces lots of new bugs, add massive amounts of bloat and, in general, makes it incomprehensible and impossible to support.
I ran a three month experiment with two of our projects, one Django and the other embedded C and ARM assembler. You start with "oh wow, that's cool!" and not too long after that you end up in hell. I used both ChatGPT and Cursor for this.
The only way to use LLMs effectively was to carefully select small chunks of code to work on, have it write the code and then manually integrate into the codebase after carefully checking it and ensuring it didn't want to destroy 10 other files. It other words, use a very tight leash.
I'm about to run a six month LLM experiment now. This time it will be Verilog FPGA code (starting with an existing project). We'll see how that goes.
My conclusion at this instant in time is that LLMs are useful if you are knowledgeable and capable in the domain they are being applied to. If you are not, shit show potential is high.
I think AI would have better general acceptance if we stopped mythologizing it's utility. It's so wildly over exaggerated it can't ever live up to the hype. If AI can't adapt to a reality-based universe, the bubble is going to burst all the sooner.
The METR study cited here is very interesting.
"In the METR study, developers predicted AI would make them 24% faster before starting. After finishing 19% slower, they still believed they'd been 20% faster."
I hadn't heard of this study before. Seems like it's been mentioned on HN before but not got much traction.
I see it brought up almost every week! It's a firm favorite of the "LLMs don't actually help write code" contingent, probably because there are very few other credible studies they can point to in support of their position.
Most people who cite it clearly didn't read as far as the table where METR themselves say:
> We do not provide evidence that:
> 1) AI systems do not currently speed up many or most software developers. Clarification: We do not claim that our developers or repositories represent a majority or plurality of software development work
> 2) AI systems do not speed up individuals or groups in domains other than software development. Clarification: We only study software development
> 3) AI systems in the near future will not speed up developers in our exact setting. Clarification: Progress is difficult to predict, and there has been substantial AI progress over the past five years [3]
> 4) There are not ways of using existing AI systems more effectively to achieve positive speedup in our exact setting. Clarification: Cursor does not sample many tokens from LLMs, it may not use optimal prompting/scaffolding, and domain/repository-specific training/finetuning/few-shot learning could yield positive speedup
https://metr.org/blog/2025-07-10-early-2025-ai-experienced-o...
Weird, you shouldn't really need to list the things your study doesn't prove! I guess they anticipated that the study might be misrepresented and wanted to get ahead of that.
Their study still shows something interesting, and quite surprising. But if you choose to extrapolate from this specific setting and say coding assistants don't work in general then that's not scientific and you need to be careful.
I think the studyshould probably decrease your prior that AI assistants actually speed up development, even if developers using AI tell you otherwise. The fact it feels faster when it is slower is super interesting.
The lesson I took from the study is that developers are terrible at estimating their own productivity based on a new tool.
Being armed with that knowledge is useful when thinking about my own productivity, as I know that there's a risk of me over-estimating the impact of this stuff.
But then I look at https://github.com/simonw which currently lists 530 commits over 46 repositories for the month of December, which is the month I started using Opus 4.5 in Claude Code. That looks pretty credible to me!
In the 80s, when the mouse was just becoming common, there was a study comparing programming using a mouse vs. just a keyboard. Programmers thought they were faster using a keyboard, but they were actually faster using a mouse.
That's certainly an impressive month! However, it's conceivable that you are an outlier (in the best possible way!)
I liked the way they did that study and I would be interested to see an updated version with new tools.
I'm not particularly sceptical myself and my guess is that using Opus 4.5 would probably have produced a different result to the one in the original study.
I'm definitely an outlier - I've been pushing the boundaries of these tools for three years now and this month I've been deliberately throwing some absurdly ambitious problems at Opus 4.5 (like this one: https://static.simonwillison.net/static/2025/claude-code-mic...) to see how far it can go.
Very interesting example. It's an insanely complex task even with a reference implementation in another language.
It's surprising that it manages the majority of the test cases but not all of them. That's not a very human-like result. I would expect humans to be bimodal with some people getting stuck earlier and the rest completing everything. Fractal intelligence strikes again I guess?
Do you think the way you specified the task at such a high level made it easier for Claude? I would have probably tried to be much more specific for example by translating on a file by file or function by function basis. But I've no idea if this is a good approach. I'm really tempted to try this now! Very inspiring.
> Do you think the way you specified the task at such a high level made it easier for Claude?
Absolutely. The trick I've found works best for these longer tasks is to give it an existing test suite and a goal to get those tests to pass, see also: https://simonwillison.net/2025/Dec/15/porting-justhtml/
In this case ripping off the MicroQuickJS test suite was the big unlock.
I have a WebAssembly runtime demo I need tonics where I used the WebAssembly specification itself, which it turns out has a comprehensive test suite built in as well.
The lesson I learned is that agentic coding uses intermittent reinforcement to mimic a slot machine.
It (along with the hundreds of billions in investments hinging on it), explains the legions of people online who passionately defend their "system". Every gambler has a "system" and they usually earnestly believe it is helping them.
Some people even write popular (and profitable!) blogs about playing slots machines where they share their tips and tricks.
Plenty of people have been (too) quick to dismiss that study as not generally applicable because it was about highly experienced OSS devs rather than your average corporation programmer drone.
The issue I have with the paper is that it seems (based on my skimming) that they did not pick developers who were already versed with AI tooling. So they're comparing (experienced dev working in the way they're comfortable) vs (experienced dev working with new tool for the first time and not having passed the productivity slump from onboarding).
Longitudinal studies are definitely needed, but of course at the time the research for this paper was done there weren't any programmers experienced with AI assist out there yet.
The thing I find interesting is that there is trillions of dollars in valuations hinging upon this question and yet the appetite to spend a little bit of money to repeat this study and then release the results publicly is apparently very low.
It reminds me of global warming where on one side of the debate there some scientists with very little money running experiments and on the other side there were some ridiculously wealthy corporations publicly poking holes in those experiments but who secretly knew they were valid since the 1960s.
That's interesting context for sure, but the fact these were experienced developers makes it all the more surprising that they didn't realise the LLM slowed them down.
Measuring programming productivity in general is notoriously difficult, subjectively measuring your own programming productivity is even worse. A magic LoC machine saying brrrrrt gives an overoptimistic sense of getting things done.
A lot of the time, AI allows you to exercise basic competence at tasks for which you'd otherwise be incompetent. I think this is why it feels so powerful. You can jump into more or less any task below a certain level of complexity. (eg: you're not going to write an operating system with an LLM but you can set up and configure Wordpress if you'd never done it before.)
I think for users this _feels_ incredibly powerful, however this also has its own pitfalls: Any topic which you're incompetent at is one which you're also unequipped to successfully review.
I think there are some other productivity pitfalls for LLMs:
- Employees use it to give their boss emails / summaries / etc in the language and style their boss wants. This makes their boss happy, but doesn't actually modify productivity whatsoever since the exercise was a waste of time in the first place.
- Employees send more emails, and summarize more emails. They look busier, but they're not actually writing the emails or really reading them. The email volume has increased, however the emails themselves were probably a waste of time in the first place.
- There is more work to review all around and much of it is of poor quality.
I think these issues play a smaller part than some of the general issues raised (eg: poor quality code / lack of code reviews / etc.) but are still worth noting.
"There is more work to review all around and much of it is of poor quality."
This is the average software developer's experience of LLMs
It's like Excel: It's really powerful to enable someone who actually knows what needs done to build a little tool that does that thing. It often doesn't have to be professional-quality, let alone perfect. It just has to be better than doing the same thing manually. There are massive productivity gains to be had there... for people with that kind of problem.
This is completely orthogonal to productivity gains for full time professional developers.
If we take out most of frontend work, and the easy backend/Ops tasks where writing the code/config is 99% of the work, i think my overall productivity with the latest gen (basically Opus 4.5) improve by 15-20%. I also am _very_ sure that with the previous generation (Sonnet 4, sonnet 4.5, Codex 5.1), my team overall velocity decreased, even taking into account the frontend and the "easy" tasks. The amount of production bug we had to deal with this year is crazy. To much code is generated, and me and the other senior on my team just can't carefully review everything, we have to trust sometime (especially data structures).
The worse part is reading a PR, and catching a reintroduced bug that was fixed a few commit ago. The first time i almost lost my cool at work and said a negative thing to a coworker.
This would be my advice to juniors (and i mean basically: devs who don't yet understand the underlying business/architecture): use the AI to explain how stuff work, generate basic functions maybe, but write code logic/algorithm yourself until you are sure you understand what you're doing and why. Work and reflect on the data structures by yourself, even if generated by the AI, and ask for alternatives. Always ask for alternatives, it helps understanding. You might not see huge productivity gains from AI, but you will improve first, and then productivity will improve very fast, from your brain first, then from AI.
> The worse part is reading a PR, and catching a reintroduced bug that was fixed a few commit ago. The first time i almost lost my cool at work and said a negative thing to a coworker.
Losing your cool is never a good idea, but this is absolutely a time when you should give negative feedback to that coworker.
Feedback is what reviews are for; in this case, this aspect of the feedback should neither be positive nor neutral.
Just to add to your advice to juniors working with AI:
* Force the AI to write tests for everything. Ensure those tests function. Writing boring unit tests used to be arduous. Now the machine can do it for you. There's no excuse for a code regression making it's way into a PR because you actually ran the tests before you did the commit, right? Right? RIGHT?
* Force the AI to write documentation and properly comment code, then (this is the tricky part) you actually read what it said it was doing and ensure that this is what you wanted it to do before you commit.
Just doing these two things will vastly improve the quality and prevent most of the dumb regressions that are common with AI generated code. Even if you're too busy/lazy to read every line of code the AI outputs just ensuring that it passes the tests and that the comments/docs describe the behavior you asked for will get you 90% of the way there.
Sometimes the AI is all too good at writing tests.
I agree with the idea, I do it too, but you need to make sure the test don't just validate the incorrect behavior or that the code is not updated to pass the test in a way that actually "misses the point".
I've had this happen to me on one or two tests every time
I agree 100%.
For some reason Gemini seems to be worse at it than Claude lately. Since mostly moving to 3 I've had it go back and change the tests rather than fixing the bug on what seems to be a regular basis. It's like it's gotten smart enough to "cheat" more. You really do still have to pay attention that the tests are valid.
Even more important, those tests need to be useful. Often unit tests are simply testing the code works as written which is generally doing more harm than good.
To give some further advice to juniors: if somebody is telling you writing unit tests is boring, they haven’t learned how to write good tests. There appears to be a large intersection between devs who think testing is a dull task and devs who see a self proclaimed speed up from AI. I don’t think this is a coincidence.
Writing useful tests is just as important as writing app code, and should be reviewed with equal scrutiny.
And, you actually wrote the regression test when you fixed the bug, right? Right?
>> Kernighan's Law - Debugging is twice as hard as writing the code in the first place. Therefore, if you write the code as cleverly as possible, you are, by definition, not smart enough to debug it.
Now question is..
is AI providing solutions smarter than the developer using it might have produced?
And perhaps more importantly, How much time it takes AI to write code and human to debug it, even if both are producing equally smart solutions.
AI almost always reduces the time from "I need to implement this feature" to "there is some code that implements this feature".
However in my experience, the issue with AI is the potential hidden cost down the road. We either have to:
1. Code review the AI generated code line by line to ensure it's exactly what you'd have produced yourself when it is generated or
2. Pay an unknown amount of tech tebt down the road when it inevitably wasn't what you'd have done yourself and it isn't extensible, scalable, well written code.
Exactly. Optimizations in one area will simply move the bottleneck so in order to truly recognize gains you have to optimize the entire software pipeline.
Exactly right. It turns out that writing code is hardly ever the real bottleneck. People should spend some time learning the basics of queueing theory.
http://lpd2.com/
RE 2: It's not that far down the road either. Laziliy reviewed or unreviewed LLM code rapidly turns your codebase into an absolute mess that LLMs can't maintain either. Very quickly you find yourself with lots of redundant code and duplicated logic, random unused code that's called by other unused code that gets called inside a branch that only tests will trigger, stuff like that. Eventually LLMs start fixing the code that isn't used and then confidently report that they solved the problem, filling up a context window with redundant nonsense every prompt, so they can't get anywhere. Yolo AI coding is like the payday loan of tech debt.
This can happen sooner than you think too. I asked for what I thought was a simple feature and the AI wrote and rewrote a number of times trying to get it right, and eventually (not making this up) it told me the file was corrupt and could I please restore it from backup. This happened within about 20-30 minutes of asking for the change.
This is why I say LLMs are for idiots
>Code review the AI generated code line by line
Have you considered having AI code review the AI code before giving them off to a human? I've been experimenting with having claude work on some code and commit it, and then having codex review the changes in the most recent git commit, then eyeballing the recommendations and either having codex work the changes, or giving them back to claude. That has seemed to be quite effective so far.
Maybe it's turtles all the way down?
If anyone ever wonder why they don't see productivity improvement, they really need to read Mythical Man-Month.
Garage Duo can out-compete corporate because there is less overhead. But Garage Duo can't possibly output the sheer amount of work matching with corporate.
In my view the reasons why LLMs may be less effective in a corporate environment is quite different from the human factors in mythical man month.
I think that the reason LLMs don't work as well in a corporate environment with large codebases and complex business logic, but do work well in greenfield projects, is linked to the amount of context the agents can maintain.
Many types of corporate overhead can be reduced using an LLM. Especially following "well meant but inefficient" process around JIRA tickets, testing evidence, code review, documentation etc.
I've found that something very similar to those "inefficient" processes works incredibly well when applied to LLMs. All of those processes are designed to allow for seamless handoff to different people who may not be familiar with the project or code which is exactly what an LLM behaves like when you clear its context.
The limited LLM context windows could be an argument in favor of a microservices architecture with each service or library in its own repository.
>>there is less overhead.
There have been methods to reduce overhead available over the history of our industry. Unfortunately almost all the times it involves using productive tools that would in some way reduce the head counts required to do large projects.
The way this works is you eventually have to work with languages like Lisp, Perl, Prolog, and then some one comes up with a theory that programming must be optimised for the mostly beginners and power tooling must be avoided. Now you are forced to use verbose languages, writing, maintaining and troubleshooting take a lot of people.
The thing is this time around, we have a way to make code by asking an AI tool questions. So you get the same effect but now with languages like JS and Python.
the productivity improvement is the Big Lie
What the AI speed increase on Greenfield projects with modern stacks does do is reduce the cost of replacement.
Expect to see more “replace rather than repair” projects springing up
If people need AI assistance to handle all their "boilerplate" all the time, the much larger problem is needing so much damn boilerplate written all the time.
The job of anyone developing an application framework, whether that's off the shelf or in-house, is to reduce the amount of boilerplate any individual developer needs to write to an absolute bare minimum. The ultimate win isn't to get "AI to write all your boilerplate." It's to not need to write boilerplate at all.
I think coding agents require fundamentally different development practices in order to produce efficiency improvements. And just like any new tool, they benefit from wisdom in how they are applied, which we are just starting to develop as an industry. I expect that over time we will grow to understand and also expand the circumstances in which they are a net benefit, while also appreciating where they are a hindrance, leading to an overall efficiency increase as we avoid the productivity hit resulting from their misapplication.
Sounds like AI slopish article. A whole section about "Why most enterprises don't" with many words but no actual data or analysis. Just assumptions based on orthogonal report.
AI won't give you much productivity if the problem you're challenged with is the human problem. That could happen both to startups and enterprises.
This article simply reinforces existing (and outdated) biases.
Complex legacy refactoring + Systems with poor documentation or unusual patterns + Architectural decisions requiring deep context: These go hand in hand. LLMs are really good at pulling these older systems apart, documenting, then refactoring them, tests and all. Exacerbated by poor documentation of domain expectations. Get your experts in a room weekly and record their rambling ideas and history of the system. Synthesize with an LLM against existing codebase. You'll get to 80% system comprehension in a matter of months.
Novel problem-solving with high stakes: This is the true bottleneck, and where engineers can shine. Risk assessment and recombination of ideas, with rapid prototyping.
When producing code is cheap, you can spend more time on verification testing.
Force the LLM to follow a workflow, have it do TDD, use task lists, have it write implementation plans.
LLMs are great coders, but subpar developers, help them be a good developer and you will see massive returns.
Coz I have always done coding this way with humans I started out using LLMs to do simple bits of refactoring where tests could be used to validate that the output still worked.
I did not get the impression from this that LLMs were great coders. They would frequently miss stuff, make mistakes and often just ignore the instructions i gave them.
Sometimes they would get it right but not enough. The agentic coding loop still slowed me down overall. Perhaps if i were more junior it would have been a net boost.
In my experience, it’s basically impossible to accurately measure productivity of knowledge work. Whenever I see a stat associated to productivity gain/loss I get skeptical.
If you go the pure subjective route, I’ve found that people conflate “speed” or “productivity” with “ease.”
A key point missing from a lot of the AI debate is how much work is useless. From as simple as a feature that’s never turned on to a more extreme version of a job that doesn’t need to exist.
We have a lot of useless work being done, and AI is absolutely going to be a 10x speed up for this kind of work.
Lets give 99% of the company devices with 16gb of ram or less and force them to use 85% of it for security scans
- corporate
WHY CANT OUR DEVICES RUN TECHNOLOGIES ??????
- also corporate
Actually though. We had one device that was over 10 years older without any MDM etc. and it outperformed a new laptop building the same product because of the corporate anti virus crap.
If you don't exclude your build folders from the scan it will slow everything down tremendously.
>The AI fluency tax. This isn't free to learn.
In programming we've often embraced spending time to learn new tools. The AI tools are just another set of tools, and they're rapidly changing as well.
I've been experimenting seriously with the tools for ~3 years now, and I'm still learning a lot about their use. Just this past weekend I started using a whole new workflow, and it one-shotted building a PWA that implements a fully-featured calorie tracking app (with social features, pre-populating foods from online databases, weight tracking and graphing, avatars, it's on par with many I've used in the past that cost $30+/year).
Someone just starting out at chat.openai.com isn't going to get close to this. You absolutely have to spend time learning the tooling for it to be at all effective.
Another day, another evidently AI-written article about AI on the front page of HN...
Yup, closed as soon as I saw the classic "it's not x, it's y" pattern.
I've worked at a number of non-tech companies the past few years. They bought every SaaS product, Palantir, Databricks, multi-cloud, their dev teams adopted every pattern popularized by big tech and the results were always mixed. Any gains were wiped out by being buried under technical debt. They had all the data catalogs & 'ontologies' with none of the governance to go make it work. Turns out that benefiting from all this tech requires you to re-organize and change your culture. For a lot of companies, they're just not going to see big gains from AI or tech in general at this point.
The key issue is that the current version of AI has no concept of understanding anything. Without understanding anything is possible and bad outcomes are almost guaranteed outside of the trivial. Throw a non-trivial codebase at any AI tool and watch as it utterly destroys it, introduces lots of new bugs, add massive amounts of bloat and, in general, makes it incomprehensible and impossible to support.
I ran a three month experiment with two of our projects, one Django and the other embedded C and ARM assembler. You start with "oh wow, that's cool!" and not too long after that you end up in hell. I used both ChatGPT and Cursor for this.
The only way to use LLMs effectively was to carefully select small chunks of code to work on, have it write the code and then manually integrate into the codebase after carefully checking it and ensuring it didn't want to destroy 10 other files. It other words, use a very tight leash.
I'm about to run a six month LLM experiment now. This time it will be Verilog FPGA code (starting with an existing project). We'll see how that goes.
My conclusion at this instant in time is that LLMs are useful if you are knowledgeable and capable in the domain they are being applied to. If you are not, shit show potential is high.
I think AI would have better general acceptance if we stopped mythologizing it's utility. It's so wildly over exaggerated it can't ever live up to the hype. If AI can't adapt to a reality-based universe, the bubble is going to burst all the sooner.