The article asserts that the quality of human knowledge work was easier to judge based on proxy measures such as typos and errors, and that the lack of such "tells" in AI poses a problem.
I don't know if I agree with either assertion… I've seen plenty of human-generated knowledge work that was factually correct, well-formatted, and extremely low quality on a conceptual level.
And AI signatures are now easy for people to recognize. In fact, these turns of phrase aren't just recognizable—they're unmistakable. <-- See what I did there?
Having worked with corporate clients for 10 years, I don't view the pre-LLM era as a golden age of high-quality knowledge work. There was a lot of junk that I would also classify as a "working simulacrum of knowledge work."
For me the issue is the lack of human explanation for mistakes. With a person, low quality comes from a source. Sometimes the source is lack of knowledge, sometimes time pressure, sometimes selfish goals.
Most importantly, those sources of errors tend to be consistent. I can trust a certain intern to be careful but ignorant, or my senior colleague with a newborn daughter to be a well of knowledge who sometimes misses obvious things due to lack of sleep.
With AI it's anyone's guess. They implement a paper in code flawlessly and make freshman level mistakes in the same run. so you have to engage in the non intuitive task of reviewing assuming total incompetence, for a machine that shows extreme competence. Sometimes.
Absolutely. Our heuristics for judging human output are useless with LLMs. We can either trust it blindly, or tediously pick over every word (guess which one people do). I've watched this cause havoc over and over at my job (I work with many different teams, one at a time).
AI signatures don't mean low quality, they just mean AI. And humans do use them (I have always used the common AI signatures). And yes, humans produce good-looking garbage, but much more commonly they produce bad-looking garbage. This is all tangential to the point.
It was and still is a negative filter, not a positive one. Meaning it is easy to reject work because there typos and basic factual errors, absence of them is not a good measure of quality. Typically such checks is the first pass not the only criteria.
It is valuable to have this, because it the work passes the first check then it easier to identify the actual problems. Same reason we have code quality, lint style fixed before reasoning with the actual logic being written.
Yes, I don't think this matters. Much of "knowledge work" was always a proxy for something else.
High quality in terms of typos and errors is mainly a signal of respect in a similar way to wearing ironed white shirts with neck-ties. "Walls of text" that no one is expected to read in depth. Basically a symbolic demonstration of sacrifice and subservience (or something). LLMs remove this mode of signalling.
If quality of content wasn't examined before, it was probably never particularly important.
> And AI signatures are now easy for people to recognize. In fact, these turns of phrase aren't just recognizable—they're unmistakable. <-- See what I did there?
You might spot these very obvious constructs and still miss 99% of AI generated text because it has no tells. Yet you don’t know that 99% was generated, and since you spot 100% of the pattern you outlined you think no AI generated text makes it past you.
I’m also not sure I agree with the assertion that LLMs will produce a high quality (looking) report with correct time frames, lack of typos, and good looking figures. I’m just as willing to disregard human or LLM reports with obvious tells. An LLM or a person can produce work that’s shoddy or error filled. It may be getting harder to differentiate between a good or bad report, but that helps to shift the burden more onto the evaluator.
This is especially true if we start to see more of a split in usage between LLMs based on cost. High quality frontier models might produce better work at a higher cost, but there is also economic cost pressure from the bottom. And just like with human consultants or employees, you’ll pay more for higher quality work.
I’m not quite sure what I’m trying to argue here. But the idea that an LLM won’t produce a low quality report just seemed silly to me.
You’ve missed the point of original article about the proxy for quality disappearing. LLMs are trained adversarially, if that’s a word. They are trained to not have any “tells”.
Working in a team isn’t adversarial, if i’m reviewing my colleague’s PR they are not trying to skirt around a feature, or cheat on tests.
I can tell when a human PR needs more in depth reviewing because small things may be out of place, a mutex that may not be needed, etc. I can ask them about it and their response will tell me whether they know what they are on about, or whether they need help in this area.
I’ve had LLM PRs be defended by their creator until proven to be a pile of bullshit, unfortunately only deep analysis gets you there
> I don't know if I agree with either assertion… I've seen plenty of human-generated knowledge work that was factually correct, well-formatted, and extremely low quality on a conceptual level.
Putting a high level of polish on bad ideas is basically the grifter playbook. Throughout the business world you will find workers and entire businesses who get their success by dressing up poor ideas and bad products with all of the polish and trimmings associated with high quality work.
This is an already apparent problem in academia, though not for the reasons the article suggests.
It is not so much that the "tells" of a poor quality work are vanishing, but that even careful scrutiny of a work done with AI is going to become too costly to be done only by humans. One only has so much time to read while, say, in economics journals, the appendices extend to hundreds of pages.
Would love to hear if other fields' journals are experiencing a similar pressure in not only at the extensive margin (no of new submission) but the intensive margin (effort needed to check each work).
To be fair, a lot of academic fields are such that anything at a Master's level or above requires serious competence to judge and for anyone below there's no distinction between what's right and what looks right.
With AI, we‘re cargo-culting understanding. We‘re reproducing the surface of having understood something, but we‘re robbing ourselves the time and effort to truly do it.
i been telling my coworker this who's only use case he can conjure up with AI is simply "im going to give claude snowflake cortex, our integration code, all our documentation, jira tickets and its gonna make everything so much better. we'll be able to ask him anything and get the answer" and he's just lost the plot because there wasn't much of a plot. Sci-fi's infused him with how great it would be to have something to answer any question he had. he's hung up on this possibility of having his own tony stark jarvis at his disposal, in his head this is going to be the thing that speeds him up.
i'd say it's been a huge distraction for him and the obsession over using LLM for Big Wikiz hasn't yielded anything near what he thought the tech was for. few occasions now he's learned the hard way how imperfect the technology is.
between that and everyones grand visions for agentic workflows i've mostly just receded into being one of the few who is still regularly delivering stuff. i'm using AI to speed my delivery up quite a bit, i'm just not wasting my time taking it on some big grand adventure. the irony that a lot of people pushed back on companies who wanted to implement chat bots and they spend most of their credits/tokens making their own chat bots by collecting six trillion .md files and adding skill files.
my real takeaway is this: i've come to reason that there is some sort of loss in actual real institutional knowledge when we attempt to take shortcuts to growing the breadth of our own knowledge. i don't mean "hey claude give me some examples of how companies typically design x to solve for y" or "golang is new to me, what are the benefits of a compiled language versus something that requires a runtime going".
no, i'm talking about these kinds of questions:
"/somePersonalBigWikiProjectInvokedBySkill.md claude review our current tooling and infrastructure, how can we 5x our deployment speed, then search the web for <some SaaS company> and put a proposal together to get it implemented at the organization and include a 5 year cost benefit analysis and ... "
i look around and it feels like everyone is nerfing themselves. that latter question? people are just sending claude proposals left and right. my eyes have completely glazed over. is it really that hard to do some digging yourself? we're already ceding the ability to just go grab an architect or senior engineer and ask him what he thinks about how <some SaaS company> will fit with the broader suite of technologies and visions on the horizon. we're just skipping the pieces where we do a little discovery together and work together on an outcome. we're walking away with surface level understanding of many things.
this clearly has visible impacts on how we engage with each other, there's something there that I'm noticing and don't have the words for. it's mostly that people are less able to explain what it is they're talking about when pressed for deeper details, but also everyone's behavior is now different because AI sort of... makes them feel like they have definitive answers/strategies and they're no longer willing to have their ideas challenged. they no longer see that as a learning experience, a chance to learn from someone who has wisdoms who is already a walking wikipedia on something. the perfect technology for people who hate when someone with way more experience than them says "maybe not a good idea and here's why"
i've met some interesting people who are just... walking encyclopedias on some or many domains. incredibly smart people who have so much knowledge and wisdom and so many years of experience not just with tech but with people and failures and successes. i don't doubt for a second that the human brain is capable of holding an unbelievable index of information in a natural way that marries well with decision making processes that come from experience. i'm not sure what gap people are trying to close building themselves some proverbial great library here, but i would encourage people to just sit back and trust that their brain is still one of the greatest technologies at their disposal.
> im going to give claude snowflake cortex, our integration code, all our documentation, jira tickets and its gonna make everything so much better. we'll be able to ask him anything and get the answer
This is actually a good idea because it's a very cheap way to build your own industrial-strength search engine. We've forgotten how cool search engines are because Google's is so shit now.
(Although you don't need Claude, you can self-host this with minimal effort now.)
I feel the exact same way, it helps speed up development a lot (and eliminates a lot of really annoying grunt work). But I see people I work with doing shit with it that doesn't make any sense, e.g. writing 50k lines of code for a "compiler" when it's really just an interpreter under the hood. Like they never take the time to understand the domain more deeply, they just use claude to sling some shit that barely works
> i'm not sure what gap people are trying to close building themselves some proverbial great library here, but i would encourage people to just sit back and trust that their brain is still one of the greatest technologies at their disposal.
Culturally I think this is going to fuck things up significantly. If I take the time to read all of the latest papers in the LLM space, I'm damn well not going to summarize it or document what I've learned for anyone. (Maybe this is why there are not many high quality books aggregating all of this information in all the latest papers, all of the advancements, etc. All the people doing this work would rather (smartly) milk the cash cow and maintain the information asymmetry.)
Or think about open source, this will kill it for people trying to make money off a product and keep it open source. Because someone could spin up a competitor overnight.
AI is going to make the information easier to acquire for cheap. But it's going to absolutely destroy the incentive structure and trust required to have an open exchange of information. It was already bad enough because the industry is not incentivized to produce quality literature for educational purposes like academia is. But after this, it'll be a complete shit show
I think that AI can sometimes help a lot. But I think doing it correctly is a tightrope and one misstep can easily have terrible results.
First issue is this result from reinforcement learning that tells you that you really want to be doing a large fraction of stuff stuff on policy when possible.
It's true of RL agents, but I think it's actually just a universal learning result that applies to humans. Sure you could ask AI to solve a difficult math problem step by step, and what it can expose you to is tricks you had no idea about and the general method of solving such a problem.
But there is something about the work that you produced without external influence (the on policy epispde) that is sort of irreplaceably important.
The second is that there is something about the speed and conciseness of information AI presents to you. It seems like a super power but there are two problems I have with it.
A) It's too fast. Unless you are artificially slowing yourself down by reading like one sentence per minute there is something about how quickly all you want gets presented to you that seems to have a strong in one ear out the other sort of effect. You need to slow down. You need to appreciate the details.
B) It's also often too consise. There is something about doing research yourself that lets you stumble upon something new that you might not have thought was helpful. Lots of times I've found lots of amazing nuggets on missteps and tangents.
There are more issues as well, but these are the major two I get concerned about. Like you need to be cognizant of the work not being done when you are using AI to do research. And imo it's deeply problematic for young students who have literally never done the hard work of trying to answer questions themselves. Because they might not realize the problem.
> But if you are trying to understand something well, there is no better tool for helping you than AI
Could not disagree more.
The best way to understand something deeply is to practice it. AI is anti-practice. It's like trying to learn something by following a YouTube video step by step. It has an outcome and it feels productive but it's not going to stick in your head at all. It's not practice
I would say a better analogy is using Google… you can use it as a tool to seek information and deepen your understanding. But it requires your brain to be engaged and to be putting that stream of knowledge into practice.
you can use AI to get a faster explanation for what's happening in a big codebase, it makes the timelines on developing features much lower from my experience
am I losing out on something by not having to spend hours clicking through redundant parts of a large codebase to get a concrete answer on something? doesn't feel like it
I find AI code usually looks worse than it actually is. It's overly verbose, confusing, and littered with fallbacks that mean that if something goes wrong it falls through a million layers of try/catch and moves the stack trace somewhere completely unrelated to where the error actually happened, but in terms of the actual functionality it works much better than any similar-looking code written by a human would.
What if subatomic particles are actually whole universes, and their properties are a reflection of... what kind of peoples dominated, conquered their universe, and what kind of automation was left running after them themselves were gone. Some kinds of entropy harvesting automata that perpetually self build and become everything in their spacetime.
We're creating forces bigger than ourselves, and we may reach a point of no return.
I don't totally understand, but I like where you're going with this. I picture a cosmological history, the rise and fall of billions of subatomic universes and civilizations, many of them consumed by their own autonomous pseudo-intelligent technologies for better or worse, which on a macro scale are behaviors of particles. We're currently working on our particle, making collective decisions that will affect the super-universe we're a part of, in a tiny but significant way.
Ultimately to understand a thing is to do the thing. And to not understand (which is ok!) is to trust others to, proxy measures or not. Agreed that the future of work is in a precarious place: doing less and trusting more only works up to a point.
`simulacrum` is a great word, gotta add that to my vocabulary.
I think this is why middle managers seemed to be the first acolytes to the church of llm supremacy.
It's a weird space in middle management where all of the incentives other than true competency in the role push you to abstract the knowledge work that you're managing, and that abstraction seems to well describable in embedding space.
Everybody's output is someone else's input. When you generate quantity by using an LLM, the other person uses an LLM to parse it and generate their own output from their input. When the very last consumer of the product complains, no one can figure out which part went wrong.
I think this is pretty obvious for many of us in the industry. Unfortunately, there is so much money on the table that the big players will shove whatever they want down our throats
> The training doesn't evaluate "is the answer true" or "is the answer useful." It's either "is the answer likely to appear in the training corpus" or "is the RLHF judge happy with the answer." We are optimising LLMs to produce output which looks like high quality output.
It's not quite as dire as this. One of the main reasons why LLM's are getting better over time is that they are used themselves to bootstrap the next generation by sifting through the training set to do 'various things' to it.
People often forget that the training corpus contains everything humanity ever produced and anything new humanity will produce will likely come from it as well. Torturing it with current generation models is among the most productive things you can do to improve the next generation systems.
I think it's already out of date with verifiable reward based RL, e.g. on maths domain. When "correctness" arguments fall, the argument will probably just shift to whether it's just "intelligent brute force".
"They sound very confident," was a warning a gave a lot on a project a year ago, before I gave up trying to get developers to stop blindly trusting the output and submitting things that were just wrong. The documentation of that team went to absolute shit because the developers thought LLMs magically knew everything.
If you have a test that fails 50% times - is that test valuable or not? A 50% failure rate alone looks like a coin toss, but by itself that does not tell us whether the test is noise or whether it is separating bad states from good ones. For a test to be useful it needs to have positive Youden’s statistic (https://en.wikipedia.org/wiki/Youden%27s_J_statistic): sensitivity + specificity - 1. A 50% failure rate alone does not let us calculate sensitivity and specificity.
I can see a similar problem with this article - the author notices that LLMs produce a lot of errors - then concludes that they are useless and produce only simulacrum of work. The author has an interesting observation about how llms disrupt the way we judge knowledge work. But when he concludes that llms do only simulacrum of work - this is where his arguments fail.
Gee, a thing by a guy, with a name. What are you saying exactly? So the test in question is a test the LLM is asked to carry out, right? Then your point is that if it's a load of vacuous flannel 49% of the time, but meaningful 51% of the time, on average this is genuine work so we can't complain about the 49%?
Wait, you're probably talking about the test of discarding a report based on something superficial like spelling errors. Which fails with LLMs due to their basic conman personalities and smooth talking. And therefore ..?
> For a test to be useful it needs to have positive Youden’s statistic
This is not true as stated. I'd try to gloss over the absolutes relative to the context, but if I'm totally honest, I'm not sure I understand what idea you're trying to communicate.
"The simulacrum is never what hides the truth - it is truth that hides the fact that there is none. The simulacrum is true." - Jean Baudrillard
Aligned with the theory of Bullshit Jobs - LLMs expose the fact that the white collar work most of us have been doing at this point were actually bullshit. When LLMs "fake" work, it actually hides the reality that there was no meaningful work here in the first place.
Layers of reading internal docs to synthesize new docs to turn into slides to aggregate into docs, where a different set of people only partially read or understand what they're seeing at any given mutation cycle...... it's all a farce of earnest but ultimately useless Productivity. The LLMs are just making it more obvious.
"How do you know the output is good without redoing the work yourself?"
Verifying the correctness of solutions is often much easier than finding correct solutions yourself. Examples: Sudoku and most practical problems in just about any field.
-
"The training doesn't evaluate 'is the answer true' or "is the answer useful.'"
Lets pretend RLVF does not exist to give this argument a chance. Then, while the training loop does not validate accuracy directly I guess, the meta-training loop still does. When someone prompts a model, the resulting execution trace shows if the generated answer is correct or not, and this trace is kept for subsequent training runs. The way coding agents are used productively is not: a) generate code with AI and b) run it yourself; its a) ask the AI to do something, including generating the code and running it too, no step b. This naturally creates large training sets of correct and incorrect solutions.
-
"We spent billions to create systems used to perform a simulacrum of work."
Have you even tried using these systems to produce valuable work? How could this possibly be your conclusion after having tried them?
>"We spent billions to create systems used to perform a simulacrum of work."
>Have you even tried using these systems to produce valuable work? How could this possibly be your conclusion after having tried them?
The operative words there are used to, as opposed only able to. The conclusion isn't derived from using the tools, it's from observing how other people tend to use them.
> Verifying the correctness of solutions is often much easier than finding correct solutions yourself
In order to verify correctness you need to understand what correctness is in context, which is actually pretty hard to do if you can't actually find correct solutions yourself, or even if you can but haven't bothered to do so
I don't really agree with the premise of the article. Sure proxy measures are everywhere. But for knowledge work specifically you can usually check real quality. Of course it's not as extremely easy as "oh this report contains a few spelling errors", but it is doable. If you accepted work purely based on superficial proxy measures you were not fairly evaluating work at all.
I think there’s a weaker claim that holds true: we were able to ignore lots of content based on the superficial (and pay proper attention to work that passed this test) and now we are overwhelmed because everything meets the superficial criteria and we can’t pay proper attention to all of it.
That's what I had in mind! The whole post is a claim that evaluating knowledge work got more expensive because cheaper measures stopped correlating well with quality.
If someone was already evaluating the work output using a metric closer to the underlying quality then it might not have been a big shift for them (other than having much more work to evaluate).
You could however only do that if you were fine with unfairly judging the quality of work, as you now readily discarded quality work based on superficial proxies. Which admittedly is done in a lot of cases.
This does not however mean that progress is not being made.
It just means the progress is happening along such dimensions that are completely illegible in terms of the culture of the early XXI century Internet, which is to say in terms of the values of the society which produced it.
The FUD about LLM's will never get old. The way I know and trust LLM's is the same way a manager would trust their reportees to do good work.
For most tasks, the complexity/time required to verify a task is << the time required to do the task itself. Sure there can be hallucinations on the graph that the LLM made. But LLMs are hallucinating much less than before. And the time to verify is much lower than the time required for a human to do the task.
The article asserts that the quality of human knowledge work was easier to judge based on proxy measures such as typos and errors, and that the lack of such "tells" in AI poses a problem.
I don't know if I agree with either assertion… I've seen plenty of human-generated knowledge work that was factually correct, well-formatted, and extremely low quality on a conceptual level.
And AI signatures are now easy for people to recognize. In fact, these turns of phrase aren't just recognizable—they're unmistakable. <-- See what I did there?
Having worked with corporate clients for 10 years, I don't view the pre-LLM era as a golden age of high-quality knowledge work. There was a lot of junk that I would also classify as a "working simulacrum of knowledge work."
For me the issue is the lack of human explanation for mistakes. With a person, low quality comes from a source. Sometimes the source is lack of knowledge, sometimes time pressure, sometimes selfish goals.
Most importantly, those sources of errors tend to be consistent. I can trust a certain intern to be careful but ignorant, or my senior colleague with a newborn daughter to be a well of knowledge who sometimes misses obvious things due to lack of sleep.
With AI it's anyone's guess. They implement a paper in code flawlessly and make freshman level mistakes in the same run. so you have to engage in the non intuitive task of reviewing assuming total incompetence, for a machine that shows extreme competence. Sometimes.
It's not that pre-LLM era was a "golden age of quality", far form it. It's that LLMs have removed yet another tell-tale of rushed bullshit jobs.
Have they though?
Absolutely. Our heuristics for judging human output are useless with LLMs. We can either trust it blindly, or tediously pick over every word (guess which one people do). I've watched this cause havoc over and over at my job (I work with many different teams, one at a time).
AI signatures don't mean low quality, they just mean AI. And humans do use them (I have always used the common AI signatures). And yes, humans produce good-looking garbage, but much more commonly they produce bad-looking garbage. This is all tangential to the point.
For example, science articles written in Word vs. Latex helped filter out total cranks.
It was and still is a negative filter, not a positive one. Meaning it is easy to reject work because there typos and basic factual errors, absence of them is not a good measure of quality. Typically such checks is the first pass not the only criteria.
It is valuable to have this, because it the work passes the first check then it easier to identify the actual problems. Same reason we have code quality, lint style fixed before reasoning with the actual logic being written.
Ironic, you've got some typos but make a good point :)
> I don't know if I agree with either assertion…
Yes, I don't think this matters. Much of "knowledge work" was always a proxy for something else.
High quality in terms of typos and errors is mainly a signal of respect in a similar way to wearing ironed white shirts with neck-ties. "Walls of text" that no one is expected to read in depth. Basically a symbolic demonstration of sacrifice and subservience (or something). LLMs remove this mode of signalling.
If quality of content wasn't examined before, it was probably never particularly important.
> And AI signatures are now easy for people to recognize. In fact, these turns of phrase aren't just recognizable—they're unmistakable. <-- See what I did there?
You might spot these very obvious constructs and still miss 99% of AI generated text because it has no tells. Yet you don’t know that 99% was generated, and since you spot 100% of the pattern you outlined you think no AI generated text makes it past you.
I’m also not sure I agree with the assertion that LLMs will produce a high quality (looking) report with correct time frames, lack of typos, and good looking figures. I’m just as willing to disregard human or LLM reports with obvious tells. An LLM or a person can produce work that’s shoddy or error filled. It may be getting harder to differentiate between a good or bad report, but that helps to shift the burden more onto the evaluator.
This is especially true if we start to see more of a split in usage between LLMs based on cost. High quality frontier models might produce better work at a higher cost, but there is also economic cost pressure from the bottom. And just like with human consultants or employees, you’ll pay more for higher quality work.
I’m not quite sure what I’m trying to argue here. But the idea that an LLM won’t produce a low quality report just seemed silly to me.
You’ve missed the point of original article about the proxy for quality disappearing. LLMs are trained adversarially, if that’s a word. They are trained to not have any “tells”.
Working in a team isn’t adversarial, if i’m reviewing my colleague’s PR they are not trying to skirt around a feature, or cheat on tests.
I can tell when a human PR needs more in depth reviewing because small things may be out of place, a mutex that may not be needed, etc. I can ask them about it and their response will tell me whether they know what they are on about, or whether they need help in this area.
I’ve had LLM PRs be defended by their creator until proven to be a pile of bullshit, unfortunately only deep analysis gets you there
Yes. I think the main warning here is that it is an added risk. A little glitch here and there until something breaks.
> I don't know if I agree with either assertion… I've seen plenty of human-generated knowledge work that was factually correct, well-formatted, and extremely low quality on a conceptual level.
Putting a high level of polish on bad ideas is basically the grifter playbook. Throughout the business world you will find workers and entire businesses who get their success by dressing up poor ideas and bad products with all of the polish and trimmings associated with high quality work.
The goal of automation is to automate consistently perfect competence, not human failures.
You wouldn't use a calculator that is as good as a human and makes mistakes as often.
This is an already apparent problem in academia, though not for the reasons the article suggests.
It is not so much that the "tells" of a poor quality work are vanishing, but that even careful scrutiny of a work done with AI is going to become too costly to be done only by humans. One only has so much time to read while, say, in economics journals, the appendices extend to hundreds of pages.
Would love to hear if other fields' journals are experiencing a similar pressure in not only at the extensive margin (no of new submission) but the intensive margin (effort needed to check each work).
To be fair, a lot of academic fields are such that anything at a Master's level or above requires serious competence to judge and for anyone below there's no distinction between what's right and what looks right.
With AI, we‘re cargo-culting understanding. We‘re reproducing the surface of having understood something, but we‘re robbing ourselves the time and effort to truly do it.
i been telling my coworker this who's only use case he can conjure up with AI is simply "im going to give claude snowflake cortex, our integration code, all our documentation, jira tickets and its gonna make everything so much better. we'll be able to ask him anything and get the answer" and he's just lost the plot because there wasn't much of a plot. Sci-fi's infused him with how great it would be to have something to answer any question he had. he's hung up on this possibility of having his own tony stark jarvis at his disposal, in his head this is going to be the thing that speeds him up.
i'd say it's been a huge distraction for him and the obsession over using LLM for Big Wikiz hasn't yielded anything near what he thought the tech was for. few occasions now he's learned the hard way how imperfect the technology is.
between that and everyones grand visions for agentic workflows i've mostly just receded into being one of the few who is still regularly delivering stuff. i'm using AI to speed my delivery up quite a bit, i'm just not wasting my time taking it on some big grand adventure. the irony that a lot of people pushed back on companies who wanted to implement chat bots and they spend most of their credits/tokens making their own chat bots by collecting six trillion .md files and adding skill files.
my real takeaway is this: i've come to reason that there is some sort of loss in actual real institutional knowledge when we attempt to take shortcuts to growing the breadth of our own knowledge. i don't mean "hey claude give me some examples of how companies typically design x to solve for y" or "golang is new to me, what are the benefits of a compiled language versus something that requires a runtime going".
no, i'm talking about these kinds of questions:
"/somePersonalBigWikiProjectInvokedBySkill.md claude review our current tooling and infrastructure, how can we 5x our deployment speed, then search the web for <some SaaS company> and put a proposal together to get it implemented at the organization and include a 5 year cost benefit analysis and ... "
i look around and it feels like everyone is nerfing themselves. that latter question? people are just sending claude proposals left and right. my eyes have completely glazed over. is it really that hard to do some digging yourself? we're already ceding the ability to just go grab an architect or senior engineer and ask him what he thinks about how <some SaaS company> will fit with the broader suite of technologies and visions on the horizon. we're just skipping the pieces where we do a little discovery together and work together on an outcome. we're walking away with surface level understanding of many things.
this clearly has visible impacts on how we engage with each other, there's something there that I'm noticing and don't have the words for. it's mostly that people are less able to explain what it is they're talking about when pressed for deeper details, but also everyone's behavior is now different because AI sort of... makes them feel like they have definitive answers/strategies and they're no longer willing to have their ideas challenged. they no longer see that as a learning experience, a chance to learn from someone who has wisdoms who is already a walking wikipedia on something. the perfect technology for people who hate when someone with way more experience than them says "maybe not a good idea and here's why"
i've met some interesting people who are just... walking encyclopedias on some or many domains. incredibly smart people who have so much knowledge and wisdom and so many years of experience not just with tech but with people and failures and successes. i don't doubt for a second that the human brain is capable of holding an unbelievable index of information in a natural way that marries well with decision making processes that come from experience. i'm not sure what gap people are trying to close building themselves some proverbial great library here, but i would encourage people to just sit back and trust that their brain is still one of the greatest technologies at their disposal.
> im going to give claude snowflake cortex, our integration code, all our documentation, jira tickets and its gonna make everything so much better. we'll be able to ask him anything and get the answer
This is actually a good idea because it's a very cheap way to build your own industrial-strength search engine. We've forgotten how cool search engines are because Google's is so shit now.
(Although you don't need Claude, you can self-host this with minimal effort now.)
I feel the exact same way, it helps speed up development a lot (and eliminates a lot of really annoying grunt work). But I see people I work with doing shit with it that doesn't make any sense, e.g. writing 50k lines of code for a "compiler" when it's really just an interpreter under the hood. Like they never take the time to understand the domain more deeply, they just use claude to sling some shit that barely works
> i'm not sure what gap people are trying to close building themselves some proverbial great library here, but i would encourage people to just sit back and trust that their brain is still one of the greatest technologies at their disposal.
Culturally I think this is going to fuck things up significantly. If I take the time to read all of the latest papers in the LLM space, I'm damn well not going to summarize it or document what I've learned for anyone. (Maybe this is why there are not many high quality books aggregating all of this information in all the latest papers, all of the advancements, etc. All the people doing this work would rather (smartly) milk the cash cow and maintain the information asymmetry.)
Or think about open source, this will kill it for people trying to make money off a product and keep it open source. Because someone could spin up a competitor overnight.
AI is going to make the information easier to acquire for cheap. But it's going to absolutely destroy the incentive structure and trust required to have an open exchange of information. It was already bad enough because the industry is not incentivized to produce quality literature for educational purposes like academia is. But after this, it'll be a complete shit show
AI can do things on its own, without you understanding them yes.
But if you are trying to understand something well, there is no better tool for helping you than AI.
I think that AI can sometimes help a lot. But I think doing it correctly is a tightrope and one misstep can easily have terrible results.
First issue is this result from reinforcement learning that tells you that you really want to be doing a large fraction of stuff stuff on policy when possible.
It's true of RL agents, but I think it's actually just a universal learning result that applies to humans. Sure you could ask AI to solve a difficult math problem step by step, and what it can expose you to is tricks you had no idea about and the general method of solving such a problem.
But there is something about the work that you produced without external influence (the on policy epispde) that is sort of irreplaceably important.
The second is that there is something about the speed and conciseness of information AI presents to you. It seems like a super power but there are two problems I have with it.
A) It's too fast. Unless you are artificially slowing yourself down by reading like one sentence per minute there is something about how quickly all you want gets presented to you that seems to have a strong in one ear out the other sort of effect. You need to slow down. You need to appreciate the details.
B) It's also often too consise. There is something about doing research yourself that lets you stumble upon something new that you might not have thought was helpful. Lots of times I've found lots of amazing nuggets on missteps and tangents.
There are more issues as well, but these are the major two I get concerned about. Like you need to be cognizant of the work not being done when you are using AI to do research. And imo it's deeply problematic for young students who have literally never done the hard work of trying to answer questions themselves. Because they might not realize the problem.
> But if you are trying to understand something well, there is no better tool for helping you than AI
Could not disagree more.
The best way to understand something deeply is to practice it. AI is anti-practice. It's like trying to learn something by following a YouTube video step by step. It has an outcome and it feels productive but it's not going to stick in your head at all. It's not practice
I would say a better analogy is using Google… you can use it as a tool to seek information and deepen your understanding. But it requires your brain to be engaged and to be putting that stream of knowledge into practice.
you can use AI to get a faster explanation for what's happening in a big codebase, it makes the timelines on developing features much lower from my experience
am I losing out on something by not having to spend hours clicking through redundant parts of a large codebase to get a concrete answer on something? doesn't feel like it
I find AI code usually looks worse than it actually is. It's overly verbose, confusing, and littered with fallbacks that mean that if something goes wrong it falls through a million layers of try/catch and moves the stack trace somewhere completely unrelated to where the error actually happened, but in terms of the actual functionality it works much better than any similar-looking code written by a human would.
What if subatomic particles are actually whole universes, and their properties are a reflection of... what kind of peoples dominated, conquered their universe, and what kind of automation was left running after them themselves were gone. Some kinds of entropy harvesting automata that perpetually self build and become everything in their spacetime.
We're creating forces bigger than ourselves, and we may reach a point of no return.
I don't totally understand, but I like where you're going with this. I picture a cosmological history, the rise and fall of billions of subatomic universes and civilizations, many of them consumed by their own autonomous pseudo-intelligent technologies for better or worse, which on a macro scale are behaviors of particles. We're currently working on our particle, making collective decisions that will affect the super-universe we're a part of, in a tiny but significant way.
Ultimately to understand a thing is to do the thing. And to not understand (which is ok!) is to trust others to, proxy measures or not. Agreed that the future of work is in a precarious place: doing less and trusting more only works up to a point.
`simulacrum` is a great word, gotta add that to my vocabulary.
I think this is why middle managers seemed to be the first acolytes to the church of llm supremacy.
It's a weird space in middle management where all of the incentives other than true competency in the role push you to abstract the knowledge work that you're managing, and that abstraction seems to well describable in embedding space.
Everybody's output is someone else's input. When you generate quantity by using an LLM, the other person uses an LLM to parse it and generate their own output from their input. When the very last consumer of the product complains, no one can figure out which part went wrong.
Well the last consumer is holding it wrong of course. Why? The last consumer is present, and everyone else is behind 7 proxies.
I think this is pretty obvious for many of us in the industry. Unfortunately, there is so much money on the table that the big players will shove whatever they want down our throats
> The training doesn't evaluate "is the answer true" or "is the answer useful." It's either "is the answer likely to appear in the training corpus" or "is the RLHF judge happy with the answer." We are optimising LLMs to produce output which looks like high quality output.
It's not quite as dire as this. One of the main reasons why LLM's are getting better over time is that they are used themselves to bootstrap the next generation by sifting through the training set to do 'various things' to it.
People often forget that the training corpus contains everything humanity ever produced and anything new humanity will produce will likely come from it as well. Torturing it with current generation models is among the most productive things you can do to improve the next generation systems.
It's a funny thing to write, like an article in an old newspaper that aged quickly. I suspect that this will be wildly out of date within 2-3 years.
I think it's already out of date with verifiable reward based RL, e.g. on maths domain. When "correctness" arguments fall, the argument will probably just shift to whether it's just "intelligent brute force".
The set of tasks for which "correctness" is formally verifiable (in a way that doesn't put Goodharts Law in hyperdrive) is vanishingly small.
"stochastic genius"
A corollary of this could be that people interested in Serious Work will never use LLMs. Could be the new "tell".
"They sound very confident," was a warning a gave a lot on a project a year ago, before I gave up trying to get developers to stop blindly trusting the output and submitting things that were just wrong. The documentation of that team went to absolute shit because the developers thought LLMs magically knew everything.
If you have a test that fails 50% times - is that test valuable or not? A 50% failure rate alone looks like a coin toss, but by itself that does not tell us whether the test is noise or whether it is separating bad states from good ones. For a test to be useful it needs to have positive Youden’s statistic (https://en.wikipedia.org/wiki/Youden%27s_J_statistic): sensitivity + specificity - 1. A 50% failure rate alone does not let us calculate sensitivity and specificity.
I can see a similar problem with this article - the author notices that LLMs produce a lot of errors - then concludes that they are useless and produce only simulacrum of work. The author has an interesting observation about how llms disrupt the way we judge knowledge work. But when he concludes that llms do only simulacrum of work - this is where his arguments fail.
Gee, a thing by a guy, with a name. What are you saying exactly? So the test in question is a test the LLM is asked to carry out, right? Then your point is that if it's a load of vacuous flannel 49% of the time, but meaningful 51% of the time, on average this is genuine work so we can't complain about the 49%?
Wait, you're probably talking about the test of discarding a report based on something superficial like spelling errors. Which fails with LLMs due to their basic conman personalities and smooth talking. And therefore ..?
> For a test to be useful it needs to have positive Youden’s statistic
This is not true as stated. I'd try to gloss over the absolutes relative to the context, but if I'm totally honest, I'm not sure I understand what idea you're trying to communicate.
"The simulacrum is never what hides the truth - it is truth that hides the fact that there is none. The simulacrum is true." - Jean Baudrillard
Aligned with the theory of Bullshit Jobs - LLMs expose the fact that the white collar work most of us have been doing at this point were actually bullshit. When LLMs "fake" work, it actually hides the reality that there was no meaningful work here in the first place.
Layers of reading internal docs to synthesize new docs to turn into slides to aggregate into docs, where a different set of people only partially read or understand what they're seeing at any given mutation cycle...... it's all a farce of earnest but ultimately useless Productivity. The LLMs are just making it more obvious.
"How do you know the output is good without redoing the work yourself?"
Verifying the correctness of solutions is often much easier than finding correct solutions yourself. Examples: Sudoku and most practical problems in just about any field.
-
"The training doesn't evaluate 'is the answer true' or "is the answer useful.'"
Lets pretend RLVF does not exist to give this argument a chance. Then, while the training loop does not validate accuracy directly I guess, the meta-training loop still does. When someone prompts a model, the resulting execution trace shows if the generated answer is correct or not, and this trace is kept for subsequent training runs. The way coding agents are used productively is not: a) generate code with AI and b) run it yourself; its a) ask the AI to do something, including generating the code and running it too, no step b. This naturally creates large training sets of correct and incorrect solutions.
-
"We spent billions to create systems used to perform a simulacrum of work."
Have you even tried using these systems to produce valuable work? How could this possibly be your conclusion after having tried them?
>"We spent billions to create systems used to perform a simulacrum of work."
>Have you even tried using these systems to produce valuable work? How could this possibly be your conclusion after having tried them?
The operative words there are used to, as opposed only able to. The conclusion isn't derived from using the tools, it's from observing how other people tend to use them.
> Verifying the correctness of solutions is often much easier than finding correct solutions yourself
In order to verify correctness you need to understand what correctness is in context, which is actually pretty hard to do if you can't actually find correct solutions yourself, or even if you can but haven't bothered to do so
Why is it not more of a scandal that all these anti-AI articles are written, using large language models?
Why is that not an embarrassment for everyone who moans and carps and complains about the craft?
“/reliable-resources-skill Claude, using the list of approved resources, evaluate the report I’m attaching”
I don't really agree with the premise of the article. Sure proxy measures are everywhere. But for knowledge work specifically you can usually check real quality. Of course it's not as extremely easy as "oh this report contains a few spelling errors", but it is doable. If you accepted work purely based on superficial proxy measures you were not fairly evaluating work at all.
I think there’s a weaker claim that holds true: we were able to ignore lots of content based on the superficial (and pay proper attention to work that passed this test) and now we are overwhelmed because everything meets the superficial criteria and we can’t pay proper attention to all of it.
That's what I had in mind! The whole post is a claim that evaluating knowledge work got more expensive because cheaper measures stopped correlating well with quality.
If someone was already evaluating the work output using a metric closer to the underlying quality then it might not have been a big shift for them (other than having much more work to evaluate).
Yes, I agree that this is true!
You could however only do that if you were fine with unfairly judging the quality of work, as you now readily discarded quality work based on superficial proxies. Which admittedly is done in a lot of cases.
>"is the RLHF judge happy with the answer."
Reinforcement Learning with Verifiable Rewards (RLVR) to improve math and coding success rates seems like an exception.
>We've automated ourselves into Goodhart's law.
Yes.
This does not however mean that progress is not being made.
It just means the progress is happening along such dimensions that are completely illegible in terms of the culture of the early XXI century Internet, which is to say in terms of the values of the society which produced it.
Feels like a parallel with https://en.wikipedia.org/wiki/Constructivism_%28philosophy_o... where "it's not valid until you checked"
I didn't see the connection initially.
The FUD about LLM's will never get old. The way I know and trust LLM's is the same way a manager would trust their reportees to do good work.
For most tasks, the complexity/time required to verify a task is << the time required to do the task itself. Sure there can be hallucinations on the graph that the LLM made. But LLMs are hallucinating much less than before. And the time to verify is much lower than the time required for a human to do the task.
I wrote a post detailing this argument https://simianwords.bearblog.dev/the-generation-vs-verificat...
FUD ? You are missing the point entierly, and so does your blog post
Are LLM a good dictionary of synonyms ? Perhaps, but is it relevant ? Not at all
Are you biased when a solution is presented to you ? Yes, like all humans.
Is it damageful when said solution is brain-dead ? Obsiously.
Are you failing to understand that most (if not all) manager's work is human centric and, as such, cannot be applied to a non-human ? Obviously ..
You trust a machine's intent. Joke's on you, it has no intent at all, it will breaking that "trust" your pour in it without even realizing-it
You say that LLM does better job than you. Perhaps this says it all ?
Are you asking yourself questions and answering them without seeing my point? Yes