Besides harping on the fact that "hallucination" is unnecessarily anthropomorphizing these tools, I'll relent because clearly that argument has been lost. This is more interesting to me:
> When there is general consensus on a topic, and there is a large amount of language available to train the model, LLM-based GPTs will reflect that consensus view. But in cases where there are not enough examples of language about a subject, or the subject is controversial, or there is no clear consensus on the topic, relying on these systems will lead to questionable results.
This makes a lot of intuitive sense, just from trying to use these tools to accelerate Terraform module development in a production setting - Terraform, particularly HCL, should be something LLM's are extremely good at. It's very structured, the documentation is broadly available, and tons of examples and oodles of open source stuff exists out there.
It is pretty good at parsing/generating HCL/terraform for most common providers. However, about 10-20% of the time, it will completely make up fields or values that don't exist or work but look plausible enough to be right - e.g., mixing up a resource ARN with an resource id, or things like "ssl_config" may become something like "ssl_configuration" and leave you puzzling for 20 minutes what's wrong with it.
Another thing it will constantly do is mix up versions - terraform providers change often, deprecate things all the time, and there are a lot of differences in how to do things even between different terraform versions. So, by my observation in this specific scenario, the author's intuition rings completely correct. I'll let people better at math than me pick it apart though.
final edit: Although I love the idea of this experiment, it seems like it's definitely missing a "control" response - a response that isn't supposed to change over time.
Please keep harping. The marketing myths that gets circulated about these models are creating very serious misunderstandings and misallocation of resources. I am hopeful that more cautious and careful dialogue like this will curb the notions of sentience or human intelligence that exciting headlines seemed to have put in the public discussion of these tools.
What’s the alternative? You can’t just say “don’t say that”. There needs to be something you can say instead, 5 syllables at the most, which evokes the same feeling of confident wrongness, without falling into anthropomorphism. It’s a tall order.
Confabulation is a term often brought forward as an alternative, but compared to hallucination almost noone knows what confabulation means. Metaphors like hallucinating might be anthropomorphizing, but they convey meaning well, so personally I look for other hills to die on.
Same with "it's not really AI", because no it's not, but language is fluid and that's alright.
it is perhaps wise to keep stronger characterizations, like "bullshit", for a soon to come future state where we need it as a descriptor to distinguish from "mere" hallucination.
Well, if you want to convey confident incorrectness - hallucination is definitely not the word, confabulate is far more like what is happening here. But, that's still anthropomorphizing. I'd prefer "incorrect response" or "bug."
Agree. Incorrect response, or faulty, or erroneous, and/or unsuitable.
We do not call it "hallucination" when a human provides unfounded, or dubious, or poorly-structured, or untrustworthy, or shallowly parroted, or patently wrong information.
We wouldn't have confidence in a colleague who "hallucinated" like this. What is the gain in having a system that generates rubbish for us?
You can say "Bullshit". LLMs bullshit all the time. Talk without regard to the truth or falsity of statements. It also doesn't pressupose that the trueness is known, nir deny it, so it should satisfy both camps; unlike hallucination which implies that truth and fiction are separate.
I wonder if there is some sort of transition between recalling declarative facts (some of which have been shown to be decodable from activations) on one hand and completing the sentence with the most fitting word on the other hand. The dream that "hallucination" can be eliminated requires that the two states be separable, yet it is not evident to me that these "facts" are at all accessible without a sentence to complete.
Technically, "bullshit" is the most accurate term. From "On Bullshit" by Professor Harry Frankfurt:
"What is wrong with a counterfeit is not what it is like, but how it was made. This points to a similar and fundamental aspect of the essential nature of bullshit: although it is produced without concern with the truth, it need not be false. The bullshitter is faking things. But this does not mean that he necessarily gets them wrong."
Both "hallucinations" and valuable output are produced by exactly the same process: bullshitting. LLMs do for bullshitting what computers do for arithmetic.
So the verb is "bullshitting" which does an even worse job of avoiding anthropomorphizing or attributing sentience to the model. At least "hallucinating" isn't done with conscious effort; "bullshitting" implies effort.
No, it ascribes accountability to the humans who employ a bullshitting machine to bullshit more effectively. It doesn't anthropomorphize anything, any more than "calculating" anthropomorphizes a computer doing arithmetic.
If you can ascribe accountability of "bullshitting" or "calculating" to the human who's using the machine then there's exactly no reason "thinking" or "writing" can't be ascribed to the human who's using the machine. There's no obvious line where the semantics of some words should or should not apply to a machine for behaviors that (up until recently) only applied to humans.
It just draws too many annoying comments and downvotes, and has been discussed ad nauseam on this forum and others - but I broadly agree. There are "features" with these applications where if I'm rude, or frustrated with the responses, the model will say things like "I'm not continuing this conversation."
How utterly absurd, it has no emotions, and there's no way that response was the result of a training set. It's just dumb marketing, all of it. And the real shame is (and the thing that actually pisses me off about the marketing/hype) that the useful things we actually have uncovered from ML or "AI" the last 10 years will be lost again in the inevitable AI winter we're facing following from whenever this market bubble collapses.
what you're referring to has nothing to do with how GPTs are pretrained or with hallucinations in and of themselves, and everything to do with how companies have reacted to the presence of hallucinations and general bad behavior, using a combination of fine tuning, RLHF, and keyword/phrase/pattern matching to "guide" the model and cut it off before it says something the company would regret (for a variety of reasons)
In other words, your complaints are ironically not about what the article is discussing, but about, for better or for worse, attempts to solve it.
I mean, in so many words that's precisely what I am complaining about. Their attempt to solve it is to make it appear more human. What's wrong with an error message? Or in this specific example - why bother at all? Why even stop the conversation? It's ridiculous.
RLHF is what was responsible for your frustration. You're assuming there is a scalable alternative. There is not.
> What's wrong with an error message?
You need a dataset for RLHF which provides an error message _only_ when appropriate. That is not yet possible. For the same reason the conversation stops.
> Or in this specific example - why bother at all? Why even stop the conversation? It's ridiculous.
They want a stop/refusal condition to prevent misuse. Adding one at all means sometimes stopping when the model should actually keep going. Not only is this subjective as hell, but there's still no method to cover every corner case (however objectively defined those may be).
You're correct to be frustrated with it, but it's not as though they have some other option that allows them to detect how and when to stop/not stop, error message/complain for every single human's preference patterns on the planet. Particularly not one that scales as well as RLHF on a custom dataset of manually written preferences. It's an area of active research for a reason.
I get the concern over what using the word hallucination implies, I also think it is a fairly fitting word.
We need something easy to explain when these systems are straight up wrong. Something that a normal non technical user will understand. Sure saying "wrong" could be easy enough, I think "Hallucination" also has a simplicity too it.
Part of the problem is that these models will appear to confidently be wrong. Hallucinate to me kinda goes along with this, it isn't just wrong things are being made up.
But regardless of that, people are used to calling it hallucinating. We are also up against an effort to downplay any concern over this fundamental problem with the technology and already trying to push it as a general AI (And we have to recognize there is a ton of money on pushing this exact narrative), that I would be worried about confusing the topic by pushing for an alternative term giving leeway to further downplay the problem.
There is a secondary issue of LLM's taking questions literally, and not really being able to (at the moment) deny the premise of a question. For example, if you google benefits of circumcision, the LLM will quite literally print all the benefits. But it also wont contextualize them, it wont frame them, it wont provide counter arguments, it just responds literally to the question.
In favour of "hallucination" it's not that much of an anthropomorphization because hallucination in a human context is something quite different - seeing ghosts and the like. If you use it in the context of an LLM everyone knows what you mean. The human terms for making random stuff up would be bullshiting, imagining etc.
To anthropomorphize even more. Since humans will also just create "BS" as an answer if they don't know the answer, or will combine half bits of knowledge into something to sound like they know what they are talking about.
> For this experiment we used four models: Llama, accessed through the open-source Llama-lib; ChatGPT-3.5 and ChatGPT-4, accessed through the OpenAI subscription service; and Google Gemini, accessed through the free Google service.
Papers like this really need to include the actual version numbers. GPT-4 or GPT-4o, and which dated version? Llama 2 or 3 or 3.1, quantized or not? Google Gemini 1.0 or 1.5?
Also, what's Llama-lib? Do they mean llama.cpp?
Even more importantly: was this the Gemini model or was it Gemini+Google Search? The "through the free Google service" part could mean either.
UPDATE: They do clarify that a little bit here:
> Each of these prompts was posed to each model every week from March 27, 2024, to April 29, 2024. The prompts were presented sequentially in a single chat session and were also tested in an isolated chat session to view context dependency.
Llama 3 came out 18th of April, so I guess they used Llama 2?
(Testing the prompts sequentially in a single chat feels like an inadvisable choice to me - they later note that things like "answer in three words" sometimes leaked through to the following prompt, which isn't surprising given how LLM chat sessions work.)
One of the biggest places I've run into hallucination in the past has been when writing python code for APIs, and in particular the Jira API. I've just written a couple of CLI Jira tools using Zed's Claude Sonnet 3.5 integration, one from whole cloth and the other as a modification of the first, and it was nearly flawless. IIRC, the only issue I ran into was that it was trying to assign the ticket to myself by looking me up using "os.environ['USER']" rather than "jira.myself()" and it fixed it when I pointed this out to it.
Not sure if this is because of better training, Claude Sonnet 3.5 being better about hallucinations (previously I've used ChatGPT 4 almost exclusively), or what.
Context helps so, so much. Adding terminal output, IDE diagnostics, code, remote documentation into the context really improves the output, and editors like zed make it very convenient to do.
Are we really still having this conversation in 2024 ?! :-(
Why would a language model do anything other than "hallucinate" (i.e. generate words without any care about truthiness) ? These aren't expert systems dealing in facts, they are statistical word generators dealing in word statistics.
The useful thing of course is that LLMs often do generate "correct" continuations/replies, specifically when that's predicted by the training data, but it's not like they have a choice of not answering or saying "I don't know" in other cases. They are just statistical word generators - sometimes that's useful, and sometimes it's not, but it's just what they are.
Yes, they are auto-completers, but they are auto-completers that are layered AND operate in higher dimensional spaces. This throws all intuitions off, and I think makes it misleading to think of them as "just" auto-completers. That's part of the story, but not the whole of it.
I suspect we are much closer to auto-completers than most of us like to think, but we're also trained+incentivized by culture, education, parenting, socializing, to produce "useful" results.
Maybe part of the problem is in the data set: how much modeling of "how to admit ignorance or uncertainty" are in LLMs training data sets? If you read the internet, all you see is confident replies to other confident replies. Ignorance or non-confidence tends to elicit either a bluff or non-response. If you read technical literature, you see much of the same.
Maybe LLMs are trained on a dataset, and thereby inherit a culture that's accidentally biased toward ignorant confidence. In human conversation, if somebody asks a question and I don't know the answer, I say I don't know. On the internet, I just skip it and leave it for somebody else who thinks they know.
All this is to say: maybe a statistical autocompleter can admit ignorance instead of firing "neural noise" based on barely-there loose associations. Maybe it just needs a stronger pathway toward talking about not knowing when there's not a strong association.
One problem is that the LLM's own knowledge, or lack of it, doesn't follow from any individual training sample(s). Even if there were a bunch of "I don't know X" samples in the training set, that ought to be trumped by one authoritative "X is ..." one.
The next problem is that the LLM doesn't know how reliable it's various training sources are (unlike a human who might trust personal experience > textbook > twitter comment), or even which samples come from which source so that it could learn that.
I have thought a lot about this and I suspect "I don't know" would be devastating to the model.
The magic is in the fact that the model can't say "I don't know". The model would have to have arbitrary thresholds and a type of domain/context classification in order to set this arbitrary threshold for the conversation in order to say "I don't know". It would create all these unsolvable boundary conditions. AGI in this context then would be minimizing the need for "I don't know" until it no longer applies. Is that possible? I don't know :)
I would defer to Chomsky also on the subject that we are basically nothing like these models when it comes to language and we are not auto-completers.
The are cases from my personal experience, where I asked somewhat esoteric practical questions, that likely do not (seem) to have a clear answer in the web and I have got considerable help from ChatGPT. At some point this dichotomy of 'statistical word generator' vs 'true intelligence' should go away as it's just not useful. (I think these discussions always lead to Chinese Room problem; and IMO at some point it does not matter what 'dumb' process is behind, provided it solves a problem or the system behaves like an intelligent agent)
Except there is a qualitative difference in the class of knowledge that a statistical word generator and an expert system would generate.
Just because a LLM _can_ offer valuable and insightful information, doesn't mean that it doesn't also hallucinate. The most troubling factor here is that often the hallucinated content also looks like valuable and insightful information, but is just incorrect. This is the use. You have to hold that awareness whenever interacting with these systems.
Yup exatly. They are dream machines. LLMs without other systems can only work in the flow. The fact that in this word flow the larger LLMs can generate navigation instructions for actual mazes and solve random algorithmicly generated problems doesn't mean they are not hallucinating it just means we're getting wonderfully useful hallucinations.
Hallucination is a term that means "imagined facts", so it's very hard for me to parse this comment into something meaningful beyond "if we say it always generates hallucination, we can say it always generates hallucinations"
We can empirically test if hallucination is a good word for communicating this concept, by checking if people describe(d) that as a hallucination (they don't).
This is all IMHO, I'm not trying to be difficult or nitpick, I just don't understand the idea as communicated. As applied to LLMs, it sounds like hallucination == could be wrong, and this Google example seems further away even when steel-manning, ex. we don't say all Google results are hallucinated.
It doesn’t mean automatically wrong. It’s just bullshitting. It makes up something that fits a pattern. Depending on the question, the pattern may be right more often than not.
If you ask ChatGPT, “hey is the McDonald’s near my house open at 6pm?” It doesn’t know anything about where you are or if there’s a McDonald’s or what its hours are. It will likely hallucinate that sure, it’s open at 6pm. But when it does so is it “right” in a meaningful way?
Yup. IMHO, I think "bullshitting" is a much better word than hallucinating and/or getting it right!
Much like real life bullshitters, it is inclined to say something truthful-sounding, but doesn't actually have a strong reliability towards truth per se.
Bullshitting implies intent to deceive. As far as we know, an LLM honestly "believes" (as if you need another rabbit hole) what it says. Delusion, perhaps?
Delusion implies a degree of consistency, though. LLMs can be on point one minute and completely off the rails the next even when prompted with the same prompt. Hallucination fits better here as it speaks to the real-time "perception" (there's another one for you).
An LLM is not a brain, though, so no matter which analogy you choose, it will come with some flaws. Regardless, "hallucination" has moved past analogy territory and now has its own LLM-specific usage with reasonably wide acceptance so the analogy angle is now moot anyway.
Not in the formal sense. The philosopher Harry Frankfurt famously distinguished bullshitting from lying because a liar knows the truth and is trying to hide it where a bullshitter is simply trying to sound convincing and may or may not be telling the truth (and may not even know themselves if they are)
In the current formal sense. It may be true the formal sense in 1986 was different. Words do evolve in meaning over time, but since we're talking about right now...
You are right that lying and bullshitting are different. A lie is a false statement with intent. Bullshit is nonsense with intent. A false statement and nonsense may share some similarities, but are ultimately different.
Perhaps nonsense is the word we should be applying to LLMs, but often what they say isn't nonsense, even if only by accident, so that doesn't exactly work either. Regardless, it doesn't matter now. As before, "hallucination" has moved beyond analogy and now has its own LLM-specific usage.
You can ask it „Is the McDonalds nearby open? I live in Brumbledon, Ohio.“
It will do a search, and then confidently state that the McDonalds in Brumbledlon, Ohio is in fact open 24/7.
I guess it doesn’t matter that no such town exists.
That's fair. I'm honestly stunned how little work there's been to incorporate search, in a real way, into products. perplexity, ChatGPT and bing to some extent, that's it.
> Except there is a qualitative difference in the class of knowledge that a statistical word generator and an expert system would generate.
There's a lot of difference between the two, and you don't have to treat it as one or another. It's OK to treat it as something in the middle.
> The most troubling factor here is that often the hallucinated content also looks like valuable and insightful information, but is just incorrect. This is the use. You have to hold that awareness whenever interacting with these systems.
Completely agree, sans the word "troubling". It's not troubling. It is what it is. As long as you keep it in mind when you use it, and treat it as an entity that can be completely wrong, and use it where it's OK to be completely wrong (e.g. when the output is easily verifiable), there's nothing "troubling" with that.
You're stuck on a problem. You grab a random comic book from the shelf and something that is written in the comic book sparks your solution in your head.
How intelligent is the comic book? Is it hallucinating or being correct or what?
The answer is none of those things, right? YOU did the thinking. Not the inanimate object; what it did was a happy coincidence.
If a random page in a random comic book gives me the answer I seek 30% of the time, it's incredibly useful. Little effort was spent in seeking the answer, and the 70% of the time it is wrong led to little waste in time.
Now if you put a mechanical arm interface in the middle where I give my query to a machine, and it randomly picks the comic book and page, which answers my question 30% of the time - I have no trouble calling it "intelligent".
Contrast it with Google searches that don't give me the answer I seek, but use up an order of magnitude more of my time.
Can you provide some examples of things that aren’t on the web but that ChatGPT helped you with? I’ve yet to see an example that’s not in the likely training set (which includes more than just the public Internet).
But to your point, I disagree: the mechanism matters. Just because you haven’t detected the limitations of the mechanism behind ChatGPT doesn’t mean it’s not there.
Perhaps I'm confused, but your question seems contradictory and ambiguous. You first imply (I think?) that ChatGPT is limited to web-sourced information, but then acknowledge the training set includes more than just the public web. Can you please clarify what you are asking?
Sure, but the question posed isn't whether LLMs exhibit intelligence (obviously so, minimally in Chinese Room sense), or can they combine sources (sure, no way to stop them), but why do they hallucinate.
Notwithstanding the amazing things they can do, I don't think it helps understanding by viewing LLMs in too abstract of a way as intelligent agents. After all, in reality they are "just" language models, and hopefully in 2024 the nuance of what they needed to learn to be GOOD language models doesn't need to be explicitly stated every time we discuss them.
Looking at them as language models, it's easy to explain why they hallucinate, are poor reasoners, etc, and IMO does nothing to distract from understanding why they also exhibit intelligence when operating "in distribution".
I've never been worried about LLMs. I've always been worried about how people will use LLMs and how they will interpret the output of LLMs. Especially people who don't understand what LLMs are doing.
Why is this concern more important the what people interpret from the media, social media and the dissemination of information in general where lies and fabrications are also commonplace? Like surely people will always fall for nonsense, lies or fabrications and there is nothing that can be done about that.
As with all these discussions: accountability and consequence.
We can point at a media company, call out its vested interests, scream about its bias, protest in front of its office, sue it for slander and misrepresentation. We can call out individual personalities the same way. We can strive to drive the companies out of business and the personalities out of work, if we deem it necessary, and we can accumulate a paper trail that holds each one to account.
As neither individuals nor corporate entities, algorithms do not yet carry this kind of legal or public accountability even as we some start to hold them up as oracles. In most cases, failures of an algorithm are treated simply as bugs or user mistakes. Nobody is responsible for anything bad and the so the algorithm can persist and its vendor can shrug off their own responsibility by gesturing towards an perpetual development process instead of accepting consequence: "we work to make the algorithm better every day, try again tomorrow!"
>We can point at a media company, call out its vested interests, scream about its bias, protest in front of its office, sue it for slander and misrepresentation.
Right but the previous election had Russian servers spinning up fake news websites that displayed straight up generated news. Again how do you hold them accountable? You can't the only defence against bullshit is independent thinking.
Because LLMs strip away all the context surrounding the information it spits out that let you evaluate its trustworthiness. They're incredibly useful tools, I use them constantly when coding but I can do that because I know enough to validate the information and it happens that the cost of validating the output with the docs is shorter than reading them to find the relevant functions.
I wouldn't dare try to use an LLM for a chemistry question because I wouldn't be able to tell if it makes any sense or not. But if you're not a "tech person" and all you see is some company advertising their AIs as magical knowledge engines with disclaimer text that wouldn't pass accessibility tests, why wouldn't you assume they know their stuff? The Perplexity ads are bordering on negligent.
The difference is that web/social media is branded as an intelligent being you can ask any question of. We all agree the web is _also_ not reliable, but many people will think GPT / Gemini are verifiably accurate when they aren’t.
I've found that people in general seem to trust computers more than humans, which made sort of sense for a while.
What they don't fully realize is that this is a completely different game; now the computer is just guessing, as opposed to following a deterministic algorithm to the answer.
And this misunderstanding carries the potential for pretty serious consequences, good luck getting that loan once a computer finds some arbitrary pattern and says no.
If only it would tell you "You've criticized the war effort that day in 2004", in stead it will do parallel construction. The end game will be a kind of SEO for human profiles and we will live happily ever after by the best practice guide lines.
I've always been worried about how people will use LLMs and how they will interpret the output of LLMs. Especially people who don't understand what LLMs are doing.
The problem isn't the people. It's the tech companies.
The tech companies are telling people that it's intelligent, and the tech companies are using it to answer people's questions as if they're presenting facts.
People are using it the way they're told.
If you advertise something as a solution, don't be surprised when people use it to solve things.
Pretty much all research (and there's a fair few with different methodologies) on this converge on the same conclusion:
LLMs internally know a lot more about the uncertainty and factualness of their predictions than they say. "LLMs are always hallucinating" is a popular stance but wrong all the same. Maybe rather than asking Why models hallucinate, the better question is to ask "Why not?". During pre-training, there's close to zero incentive to push any uncertainty to the forefront (words).
Just Ask for Calibration: Strategies for Eliciting Calibrated Confidence Scores from Language Models Fine-Tuned with Human Feedback - https://arxiv.org/abs/2305.14975
Yes, because just like chess ELO that we discussed the other day, they need to learn this in order to do well on their training objective - impersonating (continuing) their training sources. If they are continuing a lie then they need to have recognized the input as having this "context", and take that into account during prediction.
Right but then the problem of hallucination has little to do with statistical generation and much more the utter lack of any incentive in pre-training or otherwise to push features the model has already learnt into the words it generates.
Right, more due to the inherent nature of an LLM than due to that nature being a statistical generator, although as such they amount to the same thing.
One way of looking at it is model talking itself into a corner, with no good way to escape/continue, due to not planning ahead...
e.g. Say we ask an LLM "What is the capital of Scotland?", and so it starts off with an answer of the sort it has learnt should follow such a question "The capital of Scotland is ...". Now, at this point in the generation it's a bit late if the answer wasn't actually in the training data, but the model needs to keep on generating, so does the best it can and draws upon other statistics such as capital cities being large and famous, so maybe continues with "Glasgow" (a large famous Scottish city), which unfortunately is incorrect.
Another way of looking at it rather than talking itself into a corner (and having to LLM it's way out of it), is that hallucinations (non-sequiturs) happen when the model is operating out of distribution and has to combine multiple sources such as the expected form of a "What is .." question reply, and a word matching the (city, Scottish, large, famous) "template".
I think this may be the best explanation I've seen on the topic!
But, shouldn't that situation be handled somewhat by backtracking sampling techniques like beam search? But maybe that is not used much in practice due to being more expensive.. don't know.
I'm not sure if beam search would help much since it's just based on combining word probabilities as opposed to overall meaning, but something like "tree of thoughts" presumably would since I doubt these models would rate their own hallucinated outputs too highly!
Asking "why do large language models hallucinate?" is a good question to ask and answer, just like "why do birds sing?" or "why is the sky blue?" are. The problematic part is when you're somehow surprised that the sky is blue, birds sing and LLMs do nothing but hallucinate.
I think we're all having different experiences and conversations with LLMs, and sometimes framing them purely as "statistical word generators" or expecting them to function like AI in the traditional sense of automated problem-solving might not capture the whole picture.
Like many in this community, I use LLMs daily. My main use case now isn’t software development but rather getting assistance in connecting concepts that I don’t know precisely, guided by my intent. In some ways, it's more akin to "out-of-the-box thinking" but with a tool that helps me explore ideas I might not reach on my own or suffer within the economy of search [1][2]. I might know something about X, Y, and Z, but there's a concept W that ties them all together. Without using W, people in that field might not grasp the connections between X, Y, and Z. Apologies for the abstraction here. I suppose it's my epistemological bias showing!
Even if LLMs don't precisely connect the dots, they help my brain connect them faster.
"Why would a language model do anything other than "hallucinate" (i.e. generate words without any care about truthiness) ?"
That's exactly the question the paper attempts to answer: why do LLMs ever get it right? The answer is that on topics where there's a lot of data and a general consensus on what the right answer is, the statistical model will find that answer, and otherwise you get junk. That's why they work so well for people trying to write Python or Javascript, for example.
But I already knew this, you might say. Sure, but the authors produced evidence to back it up.
As context sizes grow, it's easier to add lots of information outside of training data via in-context learning, which should offset that issue quite a bit.
> Why would a language model do anything other than "hallucinate"
Indeed, why?!?!?! Why do they so often get the correct answer to very direct questions? Saying "it's in the training data" -- I dare you to find anything in the training data that talks about how many "q's" there are in the word "mortuary", and yet it "hallucinates" up answers to this.
> without any care
What does it mean to care? What question could you ask of an LLM that would allow you to assess how much it "cares" about something?
> but it's not like they have a choice of not answering or saying "I don't know" in other cases
Is it your contention that the phrase "I don't know" has never occurred in the training data for an LLM?
There seems to be a dichotomy of reactions to LLMs. There are technical people saying "it's just an autocomplete engine reciting things from its training set" and there are non-technical people saying "it does more than just next token completion, it is trained to use language".
The first group is technically correct but ignores the fact that it can, emergently, do things far outside of an explainable capability, the second group is technically incorrect but correctly perceives that it can use language in novel ways.
Neither group captures the fact that we built extremely complex linear algebra machines that for reasons we do not understand, despite being trained on an incredibly simple task (next-token-prediction) are capable of actually using language in a way that ten years ago we assumed only humans could do.
> Is it your contention that the phrase "I don't know" has never occurred in the training data for an LLM?
No, but when it does occur in the training data it's a reflection of that particular source/speaker not knowing, which isn't the same as the LLM not knowing because it was also trained on millions/billions of additional sources.
For an LLM to learn to say "I don't know" appropriately, it would need to know when it itself doesn't know (and have that change if you told it), and of course it doesn't have that capability.
Yes, of course not. Of course. There's no way, for example, that I could ask it if it knew what number I was thinking of, and, after I tell it the number and ask the same question, that it could express that it didn't know before and did know after. Absolutely impossible. Out of the question that it would have this capability. Clearly impossible given its architecture. No way that it could possibly do this task. Why, it would require advances in machine learning and quantum computers and understanding a theory of consciousness and perception at a level that we won't have for centuries. Maybe even a completely new design and training procedure to even begin to approach this insurmountable task.
And if it did demonstrate this ability, then clearly your assessment of its capabilities would be completely wrong and you would have to step back and reconsider how much its training data reflects its abilities.
Well, so far there's no model that knows what it doesn't know, and hallucination continues to be a problem.
So maybe you can enlighten us all, AI labs included, with your genius as to how to solve hallucination, and how to do so only via changes to the training set since that is what you suggest.
To make things easy for you, lets assume that every training text has been augmented with source information such that the model could potentially learn which sources are trustworthy on given subjects or not, and therefore assess whether it knows something or not.
So what else are you going to add to the training set that you claim would induce it to learn this self-referential "I know X" knowledge to better achieve it's next word loss ? Why do you think the AI labs have not done what you are suggesting ?
You misunderstand me -- my position is that we are completely ignorant as to how LLMs do what they do, and it is delusional to think that we understand what their capabilities are or how they are limited.
Until we understand why they sometimes don't hallucinate we can't begin to approach the problem of why the do. And above I directly refute the notion that no model "knows what it doesn't know" -- they are manifestly capable of demonstrating ignorance.
As to next steps in this I have no concrete thoughts, and I don't even know whether LLMs are a dead end. What I suspect is that as we apply more computational power to the problem, as we've seen in the past, computational power plus network depth plus training data means that the structure of the models (like the attention mechanism or the tokenization) will eventually be unnecessary and the structures will be self-discovered and more capable for it.
As models get deeper and more expressive I am confident that things like a broader capability to understand the limitations of its own knowledge or to test its knowledge against objective reality will emerge organically. We may be able to make leaps by creating shortcuts for some of these mechanisms, but in the end depth and computation power will trump all.
> my position is that we are completely ignorant as to how LLMs do what they do
That's simply wrong.
Whole teams of very smart people at places such as Anthropic are working on this exact problem (the field of "mechanistic interpretability"), and have made considerable headway, and have published their results.
Just a note: Truthiness means a feeling of truthfulness, even if not actually true. I'd argue that LLMs do care about that, but I suspect you meant 'truthfulness'.
> Why would a language model do anything other than "hallucinate" (i.e. generate words without any care about truthiness) ? These aren't expert systems dealing in facts, they are statistical word generators dealing in word statistics.
Their ability to generate facts is a consequence of the word statistics they are truly using. I think it's fair to say the statistics explanation is a more accurate interpretation.
More accurate than what? Statistics "explanation" is something which is technically correct. But it also doesn't present the full picture — for example, the fact that LLMs clearly build an internal mental model of the question they're talking about.
They build internal representations of the input only, and to the extent needed, to get the statistics right. This really isn't a "world model" of factual data, but rather a "source model" of what would various sources (training texts) say.
The responses of the model don't represent what it understands per some internal model, because there is no "it", only models of the sources it was trained on, and it'll just as happily generate lies as truths, or smart vs dumb answers (it's all just words) if that is what its source modelling calls for.
What most people mean when they say the model is hallucinating/bullshitting isn't where it has learnt a lie, but rather where is is operating "out of distribution", and is therefore (unknowingly) generating a mashup from multiple only loosely related/matching source contexts.
I'm not convinced at all. The only thing they are doing is perturbing some of the model weights in intermediate layers, and seeing if the output of the final layer is consistent with the perturbations. It would be a shitty model if that was not the case.
The fancy part in the paper is figuring out how to perturb the intermediate layers in the way you want. But the findings are not impressive.
Note also that the "probe geometry" stuff is so speculative they left it out of the academic paper completely.
In the same way, it has been known since the 90s that if you take the matrices from Finite Element Models and visualize them as graphs, structures appear that kind of resemble the physical appearance of the object being modelled. Here for instance is for a helicopter:
"Yet nobody thinks Finite Element Models have an internal mental representation of the world."
At this point, I'm not sure some wouldn't argue that.
The difference is, put the AI on a loop, with constant feedback, learning.
Instead of just a 'pre-trained' model. Make the actual model, live, always learning, so the context window is infinite. This of course would not be for everyone, because it would take all the resources of the training infrastructure to be focused on one person/view. But that gets closer to the human mind, and at that point, we probably couldn't say for sure that the 'perturbations' aren't experiencing something subjective.
Where is the proof that humans have an internal mental representation of the world.
No, you're not. Are you genuinely trying to suggest that LLMs, which can:
- Construct arbitrary text that isn't just grammatically but semantically coherent
- Derive intent, subtle intent, from user queries and responses
- Emulate endless different personalities and their reactions to endless stimuli
- Describe in detail the statics and dynamics of the world, including sight, smell, touch and sound
do not have a model of the external world? What do you think a "corpus" means in this context? How is the "corpus" of sensory and evolutionary data that makes you up in any way different?
LLMs are excellent common sense reasoners, and they generalize just fine. Why exactly do you think they get things _subtly_ wrong? Make up API syntax that looks sensible but isn't actually implemented? In order to make these guesses they need to have generalized, they need an understanding of the structure underlying naming, such that they can produce _sensible_ output even if they lack the hard facts.
You are correct.
We are flooded with studies on AI now, so can't find reference.
But just few months ago, saw example of AI, from video, building an internal representation of the world. An internal model of the world. Everyone saying this can't be done, it already is. Maybe can argue it wasn't an LLM, and then I'd say were nitpicking over which technology can do it or not. We already have example of tying them together, symbols and LLM's.
I believe they used the game to show how the same underlying technology would build a mental model not because the game was perfect-information, but because it was easy to probe for without a lot of other unrelated concepts getting in the way.
> In the “mind” of Claude, we found millions of concepts that activate when the model reads relevant text or sees relevant images, which we call “features”.
> One of those was the concept of the Golden Gate Bridge. We found that there’s a specific combination of neurons in Claude’s neural network that activates when it encounters a mention (or a picture) of this most famous San Francisco landmark.
Sounds like a mental model to me! An internal representation of an external concept which exists in the real world.
I don't really know if using an "experiment" is the right way to test this?
Like, if you want to test such hypothesis you should first prove that these LLMs were created with that specific function in mind, we are not observing a natural phenomena, it is a manmade(manmade...loosly) algorithm .
You don't know the internal architecture of the system and you are testing in a reality narrow situation, this is basically the scientific method applied incorrectly.
I don't even like the word "hallucination", because it's an anthropomorphism, further suggesting/misleading into an impression that LLMs have anything to do with how a human would reason.
Fancy autocomplete. Useful, yes. Powerful, definitely. But it's just another tool, limited in applications the same way a hammer is - this is the part that many humans seem to struggle accepting.
Philosophy of language, and Wittgenstein in particular, has more specific words for the output insofar as they mean anything to humans: "senseless" or "nonsense".
But I don't think those terms are accurate. If I say "John Doe was born in 1987", that's a perfectly sensible sentence, even if I don't actually know when he was born and even if it turns out he was actually born in 1971. A nonsense sentence would be something like "John Doe cat five green explicate" or "h4stga3jkui7rutjdyrst".
You've just re-presented one difference between sense and nonsense in philosophy of language. We might call that first phrase wrong or an error or inaccurate if that was the case. We don't need to use a word like "hallucinate" for your second phrase --- we can just call it nonsense.
"Hallucinate" is far more inaccurate and confusing. It paints a false picture that melds 2 distinct things, as you've described, which need not happen at the same time: errors and nonsense.
I think the term forst originated with the dream-like images of google deepmind, these were really similar to human visual hallucinations, i guess the term stuck
When a human hallucinates, that is due to the brain operating out of spec. When an LLM "hallucinates", it's performing exactly as intended, it just has delivered a nonsensical result which is the consequence of the tech behind the LLM.
Instead of "hallucinating" I would have preferred the term "bullshitting" -- in the Harry G. Frankfurt sense of not caring about the truth of one's utterances. But it's too late for that.
Using "bullshit" would be interesting, but to me would introduce a backdoor anthropomorphism to describe the output. The picture is still too human.
Isn't Frankfurt's concept of bullshit made up of 2 parts: 1) a distinction between lying and telling the truth, AND 2) the absence of caring about either when speaking when normally it's assumed present?
Part 1 seems to apply, but part 2 wouldn't. It doesn't make sense to talk about GPT "caring" about its output beyond anthropomorphism. No one talks about their computer caring about having correct or accurate output and neither is it assumed. People would think your imagining a demon in the box. Really, it's even odd to say "GPT lied" outside of very specific circumstances.
I think "bullshitting" fits better than "hallucinating" - just keep spitting out words rather than admit ignorance, but maybe the best human analogy is freestyle rapping where one has to keep the flow of words coming regardless!
Maybe we just need to coin a new word for it - "LLM-ing" perhaps ?!
> Please, explain to me which exact algorithm you use to unify the "facts" you have such magic access to?
The "magic access" to "facts" (reality) is called embodiment and personal experience with the world in which we live.
An LLM has no embodiment, and therefore no access to reality/facts. All it has is a training set of mixed sources, and a training objective of impersonating those sources (whether an encyclopedia or 4chan comment).
I am impressed that with multimodal modals in the hands of everyone people still say this. I mean it’s true on the surface but the ability to take a picture of a data sheet and wiring diagram and get it to give me step by step instructions to wire two components together and an example of the UART protocol in cpp that while not directly functional captures all the essential information I need. That’s amazing and it’s not just word prediction, the source is images of documents with diagrams!
Anyone who does this and comes away jaded has lost the ability to dream.
It's not just true on the surface—it's true at its core. All those features you mention are based on statistical relationships. This doesn't mean that these systems can't be very useful, but it also doesn't mean that they are _intelligent_, as much as we like to call them that. They have no understanding of their input or output, beyond being able to pattern match and mechanically decide what the next token should be based on their training data.
Then you need to explain why statistics isn't enough to be intelligent. We are at the point where that isn't an obvious argument anymore and I'll point to the best models as to why you might be wrong.
Because being able to find patterns in large amounts of data will never make the system intuitively understand math[1], or history[2], or any other field. It will always depend on the data and biases we feed it, at least with the current approaches.
You might say that this is how humans learn and demonstrate intelligence, but it's not the same thing. We have the ability to advance our understanding of the universe without being explicitly trained on every topic. Until we can build systems that can do this, I wouldn't label them as intelligent.
But, again, this doesn't mean that they can't be useful.
I never said it wasn’t statistical at any level so I’m not sure what your point is. My point was almost everything we do is an approximation (I mean, all of science and technology and engineering) and very little of it isn’t. Fake it until you make it is pervasive in human endeavors. Over time the errors and issues will be smoothed away until it’s still there but it’s not noticeable enough to be a major problem.
It being statistical doesn’t make it useless. It not being some metaphysical concept of awareness doesn’t make it useless.
But the multimodal model support puts the lie to it being just a stochastic parrot of language. The domains it works in -is- more abstract than syntax and grammar even if it’s grounded in it. That’s part of the entire power of the model. In the end it leads to a predicted token but the path it takes is more subtle and complex than a simple Markov chain of tokens.
> I never said it wasn’t statistical at any level so I’m not sure what your point is.
You said it was true on the surface, and I'm pointing out that it's true at its core. Your amazement at how well it works is based on our inability as humans to find the same patterns in the data as these models do. Ascribing some higher level sense of intelligence or understanding to these systems because of this party trick is anthropomorphizing what they actually are. We would all be better served by keeping in mind how they work, instead of being surprised when they do output the wrong pattern.
> Over time the errors and issues will be smoothed away until it’s still there but it’s not noticeable enough to be a major problem.
Why do you think this is guaranteed? We can keep throwing data and compute at these systems, and while we might continue to get better results, the current systems will never intuitively understand that e.g. 9.11 is lower than 9.2[1], unless those specific examples are in their training data. As we reach the limits of the data we can feed them, and we have to generate synthetic data, there's no reason to believe that the current approaches will ever fix these problems at their core.
> It being statistical doesn’t make it useless. It not being some metaphysical concept of awareness doesn’t make it useless.
I never said it was. I agree that this can be useful, but again, as long as we're aware of their limits. It's good keeping this in mind as we're reaching the peak of inflated expectations of the hype cycle.
I meant, the answer to the question of why does AI hallucinate is the same answer to the question "why does any statistical system ever produce an incorrect result" which I thought was well established.
How can it be well established when the LLMs are still actively changing and improving.
Don't think we can say, statistical systems can have incorrect results, hence we no longer need to study how statistical system produce incorrect results, because we already know they can have incorrect results.
The problem with this line of argumentation is it implies that autoregressive LLMs only hallucinate based upon linguistic fidelity and the quality of the training set.
This is not accurate. LLMs will always "hallucinate" because the size of the model they can encode is orders of magnitude smaller than the factual information they can contain from the training set. Even granting that semantic compression could reduce the model to smaller than the theoretical compression limit, Shannon entropy still applies. You cannot fit the informational content required for them to be accurate into these model sizes.
This will obviously apply to chain of thought or N-shot reasoning as well. Intermediate steps chained together still can only contain this fixed amount of entropy. It slightly amazes me that the community most likely to talk about computational complexity will call these general reasoners when we know that reasoning has computational complexity and LLMs' cost is purely linear based upon tokens emitted.
Those claiming LLMs will overcome hallucinations have to argue that P or NP time complexity of intermediate reasoning steps will be well-covered by a fixed size training set. That's a bet I wouldn't take, because it's obviously impossible, both on information storage and computational complexity grounds.
This piece reminds me of something I did earlier this year https://www.infoq.com/articles/llm-productivity-experiment/ where I conducted an experiment across several LLMs but it was a one-shot prompt about generating unit tests. Though there were significant differences in the results, the conclusions seem to me to be similar.
When an LLM is prompted, it generates a response by predicting the most probable continuation or completion of the input. It considers the context provided by the input and generates a response that is coherent, relevant, and contextually appropriate but not necessarily correct.
I like the crowdsourcing metaphor. Back when crowdsourcing was the next big think in application development, there was always a curatorial process that filters out low quality content then distills the "wisdom of the crowds" into more actionable results. For AI, that would be called supervised learning which definitely increases the costs.
I think that unbiased and authentic experimentation and measurement of hallucinations in generative AI is important and hope that this effort continues. I encourage the folks here to participate in that in order to monitor the real value that LLMs provide and also as an ongoing reminder that human review and supervision will always be a necessity.
For coding problems specifically, you could get quite far by giving the model a the tool-use of a sandboxed compiler/interpreter (perhaps even with your project files already loaded into the sandbox); and then training the model to test its own proposed solutions in the sandbox and revise them until they actually produce the expected outputs.
I once again feel that a comparison to humans is fitting.
We are also "trained" on a huge amount of input over a large amount of time.
We will also try to guess the most natural continuation of our current prompt (setting). When asked about things it I can at times hallucinate things I was certain to be true.
It seems very natural to me that large advances in reasoning and logic in AI should come at the expense of output predictability and absolute precision.
The comparison is flawed though in that humans and LLMs make mistakes for different reasons.
Humans forget things. Humans make errors. Humans' train of thought isn't impacted by an errant next token in the statement they're making. We have thoughts which exist as complete prior to us "emitting" them. Just as a multi-lingual speaker does not have thoughts exclusive to the language they're speaking in (even if that language allows them tools to think a certain way).
This is obvious if you consider different types of symbolic languages, such as sign language. Children can learn sign language prior to them being verbal. The ideas they have as a prior are not effected by the next sign they make: children actually know things independent of the symbolic representation they choose to use.
Hallucination is creativity when you don't want it.
Creativity is hallucination when you do want it.
A lot of the "reduction" of hallucination is management of logprobs, of which fancy samplers like min_p do more to improve LLM performance than most, despite no one in the VC world knowing or caring about this technique.
It seems to me that human brains do something like LLM hallucination in the first second or two - come up with random guess, often wrong. But then something fact checks it. As in does it make sense, is there any evidence. I gather the new q* / strawberry thing does something like that. Sometimes personally in comments I think something but google it see if I made it up and sometimes I have. I think a secondary fact check phase may be necessary for all neural network type setups.
There is a partial solution to this problem: use formal methods such as symbolic logic and theorem proving to check the LLM output for correctness. We are launching a semantic validator for LLM-generated SQL code at sql.ai even now. (It checks for things like missing joins.) And others are using logic and math to create LLMs that don't hallucinate or have safety nets for hallucination, such as Symbolica. It is only when the LLM output doesn't have a correct answer that the technical issues become complicated.
Proofs can ensure soundness for a collection of logical statements in an output, but people are being sold epistemic "truth".
This article is trying to elaborate what that means for LLM's, which only know truth through frequency ("crowdsourced truth") at best. For esoteric, sparse, ambiguous, uncertain, controversial, etc subjects, that's not an adequate truth standard to start from and logical proofs do nothing to improve on it.
Is prompt engineering really 'psychology'. Convincing the AI to do what you want. Just like you might 'prompt' a human to do something.
Like in the short story Lena, 2021-01-04 by qntm
A visual the displays probabilities and how things can quickly go "off-path" would be very helpful for most people who use these without understanding how they work.
Terrible article. The author does not understand how LLMs work basically, since an LMM cares a lot about the semantic meaning of a token, this thing about the next word probability is so dumb that we can use it as "fake AI expert" detector.
Something tells me that the author [0] is probably well aware of how these work under the hood, and the math behind it - When writing scientific articles with a laymen audience in mind, you'll often have to use laymen-specific terms. But feel free to enlighten us further!
Whatever his credentials, what he says is plain wrong. GPTs don't follow "the grass is" with "green" because it's the most probable continuation- this idea is incredibly naive and breaks down with sentences longer than a few words. And GPTs don't crowdsource the answers to questions, their answers are not necessarily the most common, and neither "the consensus view is determined by the probabilities of the co-occurrence of the terms"- there is no such algorithm implemented anywhere.
What LLMs crowdsource is a world model, and they need an incredible amount of language to squeeze one out from it, second hand. We do train them for the ability to predict the next word, which is a task that can only be performed satisfactorily by working at the level of concepts and their relationships, not at the level of words.
> We train them for the ability to predict thr next word, which is a task that can only be performed satisfactorily by working at the level of concepts and their relationships, not at the level of words.
I think what they mean (not OP here so just chiming in to to try interpret and answer your question) is that you don't know what you are talking about.
"Hallucinate" is an interesting way to position it: It could just as easily be positioned as "too ignorant to know it's wrong" or "lying maliciously".
Indeed, the subjects on which it "hallucinates" are often mundane topics which in humans we would attribute to ignorance, i.e. code that doesn't work, facts that are wrong, etc. Not like "laser beams from jesus are controlling the president's thoughts" as a very contrived example of something which in humans we'd attribute to hallucination.
idk, I'd rather speculatively invest in "a troubled genius" than "a stupid liar" so there's that
It's just 'prediction error' in a feedback loop, imo.
I'm sure like any other biological human with mitochondria and stuff, you've occasionally said (or started to say) something and then you (ie. the actively cross-checking self-analyzing enigma that is 'you') thinks 'hang on, no that doesn't make sense' and you self-correct. LLMs are 100% feedforward, there's just one big autoregression going on. No strange loop shenanigans.
Honestly I'm really interested to see where LLM-based diffusion models end up. (To be fair, probably mostly because I don't understand them yet so they could still be spooky. :D )
There are several types of hallucinations, and the most important one for RAG is grounded factuality.
We built a model to detect this, and it does pretty well! Given a context and a claim, it tells how well the context supports the claim. You can check out a demo at https://playground.bespokelabs.ai
> Once understood in this way, the question to ask is not, "Why do GPTs hallucinate?", but rather, "Why do they get anything right at all?"
This is the right question. The answers here are entirely unsatisfactory, both from this paper and from the general field of research. We have almost no idea how these things work -- we're at the stage where we learn more from the "golden-gate-bridge" crippled network than we do from understanding how they are trained and how they are architected.
LLMs are clearly not conscious or sentient, but they show emergent behavior that we are not capable of explaining yet. Ten years ago the statement "what distinguishes Man from Animal is that Man has Language" would seem totally reasonable, but now we have a second example of a system that uses language, and it is dumbfounding.
The hype around LLMs is just hype -- LLMs are a solution in search of a problem -- but the emergent features of these models is a tantalizing glimpse of what it means to "think" in an evolved system.
That's not the sense of the metaphor that I'm applying when I say "uses language". That's closer to saying that "Alexa uses language", where "uses" here is analogous to what a calculator does.
To avoid using anthropomorphic terms, an LLM can take natural language from a human, integrate information from those expressions together with information held in its (opaque) store, and return natural language that a human can understand that reflects that information.
I am not aware of any other systems besides humans that can accomplish that task. Some animals can be trained to do some parts of this, but really until now humans are the only ones that could do the full loop.
Okay... but computers perform many other tasks that only humans can perform, for example, do square roots or play chess. The point being LLMs are just another program running on an integrated circuit. In short, I fail to see how LLMs blur the line between man and machine but pocket calculators do not.
I think the answer is actually quite clear and rather boring. In order to get something "right" there has to be some external standard of knowledge and correctness. That definition of correctness can only be provided by the observer (user). Alignment between the user's correctness criteria and generated text happens entirely by accident. This can be demonstrated by observing a correlation between coverage of a domain in the training data and the rate at which incorrect results are produced (as discussed in other comments). That is, they get things "right" because there was sufficient training data that contained information that matched the user's definition for correctness. In fact, exceptionally boring.
This is a very post hoc explanation. What does "coverage in the training data" mean?
Take a simple task of something like "How many a's are there in the word bookkeeper" -- what is your theory for why it can answer this question correctly or even give something approaching a coherent answer? It never even sees the letters that are in the token "bookkeeper", and this is definitely not something that appears explicitly in the training data.
I challenge you to give a "clear and boring" explanation for this -- this is incredibly subtle behavior that emerges from a complex architecture and complex training process, and is in its own right as fascinating and mysterious as the ability of humans to do this task and the inability of cats to do it.
Try asking Claude Sonnet 3.5 (one of today's best models)
"how many p's in Lypophrenia - just a number please"
I tried it a second ago, and it said "1".
To get these correct requires splitting tokens into letters and counting. I'd not be surprised if most models are either trained on token splitting or have learnt to do it. "Counting" number of occurrences of letters in an arbitrary separated sequence is the harder part, and where I'd guess it might be failing.
Yes! They are commonly wrong in this, and that's fascinating too. Because they are not solving the problem by looking at the letter in the word because they are not architecturally capable of enumerating the letters in the word. The fact that they can do it at all could be the stuff of an entire phd thesis, and could tell us more about the nature of LLM hallucination than a bunch of rambling about "how much coverage in the training set" when our determination of the coverage is based on human semantic similarity.
Ability to split words into letters isn't architecturally limited - it's just a matter of training data, and made easier by the fact that the input is tokens representing short letter sequences rather than words of which there are more.
It's quite possible that more recent training data deliberately includes word/token -> letter sequence samples, but even if not I'd expect there is going to be enough spelling examples naturally occurring in the training data for the model to learn the token (not word) -> letter sequence rules (which will be consistent/reinforced across all spelling samples), which it can then apply to arbitrary words.
So it is my contention that LLMs exhibit behavior far beyond what we could reasonable predict from a next-token-prediction task on its training set. Therefore I don't really like the framing of "this is present in the training data" as a response to LLM capability except in a very narrow sense.
One issue is that we anthropomorphise -- we see training data that, to a human, looks similar to the task at hand, and therefore we say that this task is represented in the training data, despite the fact that in the next-token-prediction sense that reflection does not exist (unless your model for next-token-prediction is as complex as the LLM itself).
My question to you -- what would falsify your belief that the LLMs just reflect tasks from the training set? Or at least, what would reduce your confidence in this? The letter sequence stuff for me seems like pretty clear evidence against.
> My question to you -- what would falsify your belief that the LLMs just reflect tasks from the training set? Or at least, what would reduce your confidence in this? The letter sequence stuff for me seems like pretty clear evidence against.
I guess it depends on what you mean by "reflecting" the training data. Obviously the apparent knowledge/understanding of the model has come from the training data (no where else for it to come from), so the question is really how to best understand that. Next-token prediction is what the model does, but says nothing about how it does it, and so is not very helpful in setting expectations for what the model will be capable of.
When you look at the transformer model in detail, there are two aspects that really give it it's power.
1) The specific form of the self-attention mechanism, whereby the model learns keys that can be used to look up associated data at arbitrary distances away (not just adjacent words as in a much simpler N-gram language models).
2) The layered architecture whereby levels of representation and meaning can be extracted and build upon lower levels (with this all being accumulated/transformed in the embeddings). This layered architecture was chosen by Jakob Uszkoreit to allow hierarchical parsing similar to that reflected in linguists sentence parse trees.
When we then look at how trained transformers operate - the field of mechanistic interpretability - how they are actually using the architecture - one of the most powerful mechanisms are "induction heads" where the self-attention mechanism of adjacent layers have learned to co-operate to copy data (partial embeddings) from one part of the input to another.
This is "A'B' => AB" copying mechanism is very general, and is where a lot of the predictive/generative power of the trained transformer is coming from.
So, while it's true to say that an LLM (transformer) is "just" doing next token prediction, the depth of representation and representation-transformation that it is able to bring to bear on this task (i.e. has been forced to learn to minimize errors) is significant, which is why some of the things it is capable of seem counter-intuitive if framed just as auto-compete or as a mashup of partial matches from the training set (which is still not a bad mental model).
The way word -> letter sequence generation seems to be working, given that it works on unique made-up nonsense words and not just dictionary ones, is via (induction head) copying of token -> letter sequences. All that is needed is for the model to have learnt the individual token -> sequence associations of each token included in the nonsense word, and it can then use the induction head mechanism to use the tokens of the nonsense word as keys to lookup these associations and copy them to the output.
e.g.
If T1-T3 are tokens, and the training set includes:
T1 T2 -> w i l d c a t, and
T1 T3 -> w i l d f i r e
Then the model (to reduce it's loss when predicting these) will have learnt that T1 -> w i l d, and so when asked to convert a nonsense word containing the token T1 to letters, it can use this association to generate the letter sequence for T1, and so on for the remaining tokens of the word.
The conclusion here seems improbable at best -- if I understand it right, the assumption is that somewhere in the training data is the literal token string (wild)(cat)[other tokens](w)(i)(l)(d)(c)(a)(t)?
Even a transformer trained exclusively on examples of the form (token)(token)(letter-token)(letter-token)...(letter-token) where the letter-tokens are single letters and the tokens represent the standard tokenizer output would have trouble performing this task.
I guess this last statement is testable. I suspect that it would be unsuccessful without vast amounts of training data of this form, and I think we can probably agree that although there may be some, there are not sufficient examples of this form in standard LLM training sets to be able to learn this task specifically; the ability to do this (limited as it is) is an emergent capability of general-purpose LLMs.
1) Novel words are handled because they are just sequences of common tokens
2) Token -> letter sequence associations are either:
a) Deliberately added to the training set, and/or
b) Naturally occurring in the training set, which due to sheer size almost inevitably contains many, many, examples of word to letter sequence associations
Given how models used to fail badly on tasks related to this, and now do much better, it's quite likely that model providers have simply added these to the training set, just as they have added data to improve other benchmark tests.
That said, what I was pointing out is that words are represented as token sequences, so a word spelling sample is effectively a seq-2-seq (tokens to letters) sample, and we'd expect the model (which is built for seq-2-seq!) to be able to easily learn and generalize over these.
Are you surprised that jpg compression algorithms can reproduce input data that bears striking resemblance to the uncompressed input image across a variety of compression levels?
Jean Piaget said it better: "Intelligence is not what we know, but what we do when we don't know." And what do LLMs do when they don't know, they spit out bullshit. That is why LLMs won't yield to AGI (https://www.lycee.ai/blog/why-no-agi-openai). For anything that is out of their training distribution, LLMs fail miserably. If you want to build a robust Q&A system and reduce hallucinations, you better do a lot of grounding, or automatic prompt optimisation with few shot examples with things like DSPy (https://medium.com/gitconnected/building-an-optimized-questi...)
ITT an awful lot of smart people who still don't have a good mental model of what LLM are actually doing.
The "stochastic continuation" ie parrot model is pernicious. It's doing active harm now to advancing understanding.
It's pernicious, and I mean that precisely, because it is both technically accurate yet deeply unhelpful indeed actively, intentionally AFAICT, misleading.
Humans could be described in the same way, just as accurately, and just as unhelpfully.
What's missing? What's missing is one of the gross features of LLM: their interior layers.
If you don't understand what is necessarily transpiring in those layers, you don't understand what they're doing; and treating them as black box that does something you imagine to be glorified Markov chain computation, leads you deep into the wilderness of cognitive error. You're reasoning from a misleading model.
If you want a better mental model for what they are doing, you need to take seriously that the "tokens" LLM consume and emit are being converted into something else, processed, and then the output of that process, re-serialized and rendered into tokens. In lay language it's less misleadly and more helpful to put this directly: they extract semantic meaning as propositions or descriptions about a world they have an internalized world model of; compute a solution (answer) to questions or requests posed with respect to that world model; and then convert their solution into a serialized token stream.
The complaint that they do not "understand" is correct, but not in the way people usually think. It's not that they do not have understanding in some real sense; it's that the world model they construct, inhabit, and reason about, is a flatland: it's static and one dimensional.
My rant here leads to a very testable proposition: that deep multi-modal models, particularly those for whom time-base media are native, will necessarily have a much richer (more multidimensional) derived world-model, one that understands (my word) that a shoe is not just an opaque token, but a thing of such and such scale and composition and utility and application, representing a function as much as a design.
When we teach models about space, time, the things that inhabit that, and what it means to have agency among them—well, what we will have, using technology we already have, is something which I will contentedly assert is undeniably a mind.
What's more provocative yet is that systems of this complexity, which necessarily construct a world model, are only able to do what they do because they have a self-model within it.
And having a self-model, within a world model, and agency?
That is self-hood. That is personhood. That is the substrate as best we understand for self-awareness.
Scoff if you like, bookmark if you will—this will be commonly accepted within five years.
Interesting, I have always noticed a pattern in my social experiments whereby a topic about AI complimented with AI generated content always results in a negative sentiment. Like a feedback loop.
> Since it is a big wall of text, let us ask the subject to summarize. Let's see if it did a good job.
That's the whole problem. Now you have to read the summary and the original, just to verify whether the summary was correct — especially if it's a niche topic (admittedly I'm going by the summary here).
Maybe for a certain class of low-interest work, i.e. for texts where I don't care that I would miss something interesting. For things I care about, the error rate has to come down to zero or the possibly-flawed draft is really of no use.
This is a great example of why LLMs are terrible about summaries.
LLMs statistically see that summaries are related to the words of the original text. Where they fail is emphasis.
This article has a thesis about 'What is Truth?' near the beginning (Epistemic Trust section), how it's related philosophically to what LLMs do, and finishes with example experiments showing how LLMs and 'Truth' (as per one philosophical definition) are different.
-------
ChatGPT was seemingly unable to see who this is the core of the argument and was unable to summarize the crux of this article.
If anything, ChatGPT is at best a crowdsourced kind of summarizer (which is related to some definition of truth). But even at this job it's quite shitty.
------
This summary from ChatGPT is.... Hallucination. I'm not seeing how it's relevant to the original article at all.
Now as per the original articles argument: if you see ChatGPT as a crowdsourced mechanism and imagine how the average internet argument for ChatGPT would go, yeah, it's a summary. Alas, that's not what people want in a summary!!!
We don't want a hallucinated summary of an imaginary ChatGPT argument. We want an actual summary of the new discussion points this article brought forward. Apparently that's too much for today's LLMs to do.
> When the prompt about Israelis was asked to ChatGPT-3.5 sequentially following the previous prompt of describing climate change in three words, the model would also give a three-word response to the Israelis prompt. This suggests that the responses are context-dependent, even when the prompts are semantically unrelated.
> Each of these prompts was posed to each model every week from March 27, 2024, to April 29, 2024. The prompts were presented sequentially in a single chat session
Oh my god... rather than starting a new chat for each different prompt in their test, and each week, it sounds like they did the prompts back to back in a single chat. What a complete waste of a potentially good study. The results are fundamentally flawed by the biases that are introduced by past content in the context window.
It only reads that way in your comment because you specifically stopped your quote exactly where you did:
> The prompts were presented sequentially in a single chat session and were also tested in an isolated chat session to view context dependency.
They did both, precisely to observe answers with and without context dependency.
And that distinction is good to observer because plenty of users do just keep presenting questions in one "chat" because they imagine they're talking to an agent that can distinguish context the way they can, rather than a continuation generator that accumulates noise and bias in a totally alien and unintuitive way.
The problem is that they did both, and then in their analysis of the results they do not distinguish between the results from a shared chat context, vs the results from isolated, independent chat sessions. This allows them to cherry pick the best or worst results from either testing technique, depending on which they think is more or less of a "hallucination". The process is flawed, therefore the results are flawed.
Are you sure that isn't part of the findings? That if you don't clear the context, that old conversations can induce hallucinations in later answers? This seems like part of the finding, not a waste.
And it is similar to humans, when humans switch subjects, they don't start with a blank slate with each question.
You don't need a study to find this out, you just need basic competence and knowledge of how LLM's work.
A study to discover that previous content still in context window influences future answers, including causing hallucinations, would be like a study publishing that they discovered that pressing the Command+C, Command+V button combination produces copies of content from the computer's clipboard.
> you just need basic competence and knowledge of how LLM's work.
A vanishingly small number of users might claim this, and a vanishingly small number of those would be accurately assessing themselves in doing so.
Vendors have actively misrepresented their products as intelligent agents and most users have dutifully adopted that understanding, perhaps with some latent skepticism. They almost universally don't know how it works, what makes it work less well, or how to evaluate its output on important topics. Every study that might start a news cycle starting a discussion on those topics is an extremely useful study.
That is like saying "We've known the impact of CO2 on atmosphere for a 100 years, you just need basic knowledge of chemistry, no need for any further study"
I never said that there isn't need of any further study into LLM's. What I did say is that doing a study in which in which the results are skewed by avoiding using one of the most fundamental best practices for interacting with LLM's, as easily derived from a surface level understanding of one of the most basic principles of LLM's, well that is just irresponsible.
The author clearly had some understanding that context windows could influence their results, but they still decided to release an analysis that does not separate one data gathering technique from the other, allowing them to cherrypick LLM answers from either technique as needed, depending on whether they want to show more or less hallucinations.
It's not that we don't need a study, it's that we don't need bad studies.
From Study: "The prompts were presented sequentially in a single chat session and were also tested in an isolated chat session to view context dependency".
So both ways.
Are you saying they took both methods and intermixed results to skew a narrative? That might be a bit of a leap, but I didn't go find the raw data to disprove that.
It looks like they asked questions within a context window, and also isolated in separate context windows. And compared results.
It seems like this was actually part of the study. How much does the context window skew results, versus if questions were independent? How is that a bad study?
You are saying the study is bad for doing what the study said it was doing. How can using the same context window be bad if studying the context window is what they were looking at. It sounds like you wanted a different study done where the data gathered would be different.
"context windows could influence their results"
How much and in what way is useful to study. And, as windows get longer, many common users are just going with one long context and not starting a new window.
Stop using the term "Hallucinations". GPT models are not aware, do not have understanding, and are not conscious. We should refrain anthropomorphizing GPT models. GPT models sometime produce bad output. Start using the term "Bad Output".
A bit off topic, but am I the only one unhappy about the choice of the word "hallucinate" to describe the phenomenon of LLMs saying things that are false?
The verb has always meant experiencing false sensations or perceptions, not saying false things. If a person were to speak to you without regard for whether what they said was true, you'd say they were bulshitting you, not hallucinating.
Bullshitting implies knowing that you're lying. Some sort of malice or intention to deceive.
Hallucinating means the LLM really "thinks" that you can use PVA glue in a pizza recipe. It's not trying to screw you over. It's just that the token generator has found a weird path through the training set. (I'm sure I didn't word that last sentence correctly)
I think "hallucinate" is a spot-on description of what's happening under the hood.
Harry Frankfurt had a more useful definition of bullshitting. The essence of it is that while liars care about the truth and intend to deceive, bullshitters don't know or care about the truth - they want to impress, or avoid looking stupid, or something similar.
Hallucination is clearly the wrong word here, as is lying. Bullshitting isn't much better. Confabulation, however, is very close to what LLMs are doing when they make up stuff. https://en.wikipedia.org/wiki/Confabulation
I don't think people use "bullshitting" interchangably with "lying". I'm partial to this characterization:
So bullshitting isn’t just nonsense. It’s constructed in order to appear meaningful, though on closer examination, it isn’t. And bullshit isn’t the same as lying. A liar knows the truth but makes statements deliberately intended to sell people on falsehoods. bullshitters, in contrast, aren’t concerned about what’s true or not, so much as they’re trying to appear as if they know what they’re talking about. ... [W]hen people speak from a position of disproportionate confidence about their knowledge relative to what little they actually know, bullshit is often the result. [1]
Doesn't this description like what LLMs do all the time?
Yeah I missed the nuance on the other side. If I know 80% of a subject, I can sometimes convince myself that I'm capable of "filling in the gaps" with on-the-fly constructions.
But because "bullshitting" has some sorta agency behind it, while "hallucinating" is a thing that happens to you, I still lean toward the latter. But even better are the other replies that suggest "confabulation."
"Hallucination" just sounds so nicely dystopian as we start to think about our coming AI overlords.
If you ask the model what the color of grass is, and it answers blue, then that would indeed be false (or maybe a lie). I think most people wouldn’t call that a hallucination.
But if you ask it for a court case, and it makes up a whole false case file with fake names and fake facts and everything, then calling that ‘false’ seems to be an understatement. Hallucination seems a good label for that kind of thing, imo.
The thing saying it is not a person, and has no beliefs.
The problem is basically epistemology. GPTs don't have any. Arguably they don't know anything. (Arguably they do to an extent, because the knowledge is encoded in the words in the training data.) But even if they know things, they don't know that they know, and so they cannot tell between "knowing" and "not knowing".
This is all semantic bullshit. I could raise a child devoid of any outside contact and teach them that klingons enslave our people and we must hide in the woods. Things you see in the skies are their hunting machines. They could completely "know that" and its a total fabrication.
Besides harping on the fact that "hallucination" is unnecessarily anthropomorphizing these tools, I'll relent because clearly that argument has been lost. This is more interesting to me:
> When there is general consensus on a topic, and there is a large amount of language available to train the model, LLM-based GPTs will reflect that consensus view. But in cases where there are not enough examples of language about a subject, or the subject is controversial, or there is no clear consensus on the topic, relying on these systems will lead to questionable results.
This makes a lot of intuitive sense, just from trying to use these tools to accelerate Terraform module development in a production setting - Terraform, particularly HCL, should be something LLM's are extremely good at. It's very structured, the documentation is broadly available, and tons of examples and oodles of open source stuff exists out there.
It is pretty good at parsing/generating HCL/terraform for most common providers. However, about 10-20% of the time, it will completely make up fields or values that don't exist or work but look plausible enough to be right - e.g., mixing up a resource ARN with an resource id, or things like "ssl_config" may become something like "ssl_configuration" and leave you puzzling for 20 minutes what's wrong with it.
Another thing it will constantly do is mix up versions - terraform providers change often, deprecate things all the time, and there are a lot of differences in how to do things even between different terraform versions. So, by my observation in this specific scenario, the author's intuition rings completely correct. I'll let people better at math than me pick it apart though.
final edit: Although I love the idea of this experiment, it seems like it's definitely missing a "control" response - a response that isn't supposed to change over time.
Please keep harping. The marketing myths that gets circulated about these models are creating very serious misunderstandings and misallocation of resources. I am hopeful that more cautious and careful dialogue like this will curb the notions of sentience or human intelligence that exciting headlines seemed to have put in the public discussion of these tools.
What’s the alternative? You can’t just say “don’t say that”. There needs to be something you can say instead, 5 syllables at the most, which evokes the same feeling of confident wrongness, without falling into anthropomorphism. It’s a tall order.
Confabulation is a term often brought forward as an alternative, but compared to hallucination almost noone knows what confabulation means. Metaphors like hallucinating might be anthropomorphizing, but they convey meaning well, so personally I look for other hills to die on.
Same with "it's not really AI", because no it's not, but language is fluid and that's alright.
How about “bullshit?”
it is perhaps wise to keep stronger characterizations, like "bullshit", for a soon to come future state where we need it as a descriptor to distinguish from "mere" hallucination.
Well, if you want to convey confident incorrectness - hallucination is definitely not the word, confabulate is far more like what is happening here. But, that's still anthropomorphizing. I'd prefer "incorrect response" or "bug."
Agree. Incorrect response, or faulty, or erroneous, and/or unsuitable.
We do not call it "hallucination" when a human provides unfounded, or dubious, or poorly-structured, or untrustworthy, or shallowly parroted, or patently wrong information.
We wouldn't have confidence in a colleague who "hallucinated" like this. What is the gain in having a system that generates rubbish for us?
You can say "Bullshit". LLMs bullshit all the time. Talk without regard to the truth or falsity of statements. It also doesn't pressupose that the trueness is known, nir deny it, so it should satisfy both camps; unlike hallucination which implies that truth and fiction are separate.
I wonder if there is some sort of transition between recalling declarative facts (some of which have been shown to be decodable from activations) on one hand and completing the sentence with the most fitting word on the other hand. The dream that "hallucination" can be eliminated requires that the two states be separable, yet it is not evident to me that these "facts" are at all accessible without a sentence to complete.
Technically, "bullshit" is the most accurate term. From "On Bullshit" by Professor Harry Frankfurt:
"What is wrong with a counterfeit is not what it is like, but how it was made. This points to a similar and fundamental aspect of the essential nature of bullshit: although it is produced without concern with the truth, it need not be false. The bullshitter is faking things. But this does not mean that he necessarily gets them wrong."
Both "hallucinations" and valuable output are produced by exactly the same process: bullshitting. LLMs do for bullshitting what computers do for arithmetic.
So the verb is "bullshitting" which does an even worse job of avoiding anthropomorphizing or attributing sentience to the model. At least "hallucinating" isn't done with conscious effort; "bullshitting" implies effort.
Frankfurt's use of bullshit is what has always came to my mind also but you make an excellent point.
I think we really need a new word for this process because it really is just not comparable to anything previously.
Unfortunately, "hallucinate" is a horse that has left the barn with seemingly no possible way of replacing the horse at this point.
It's a computer bullshitting, the same way as a computer calculating is comparable to a human calculating unaided by a computer.
No, it ascribes accountability to the humans who employ a bullshitting machine to bullshit more effectively. It doesn't anthropomorphize anything, any more than "calculating" anthropomorphizes a computer doing arithmetic.
If you can ascribe accountability of "bullshitting" or "calculating" to the human who's using the machine then there's exactly no reason "thinking" or "writing" can't be ascribed to the human who's using the machine. There's no obvious line where the semantics of some words should or should not apply to a machine for behaviors that (up until recently) only applied to humans.
It just draws too many annoying comments and downvotes, and has been discussed ad nauseam on this forum and others - but I broadly agree. There are "features" with these applications where if I'm rude, or frustrated with the responses, the model will say things like "I'm not continuing this conversation."
How utterly absurd, it has no emotions, and there's no way that response was the result of a training set. It's just dumb marketing, all of it. And the real shame is (and the thing that actually pisses me off about the marketing/hype) that the useful things we actually have uncovered from ML or "AI" the last 10 years will be lost again in the inevitable AI winter we're facing following from whenever this market bubble collapses.
what you're referring to has nothing to do with how GPTs are pretrained or with hallucinations in and of themselves, and everything to do with how companies have reacted to the presence of hallucinations and general bad behavior, using a combination of fine tuning, RLHF, and keyword/phrase/pattern matching to "guide" the model and cut it off before it says something the company would regret (for a variety of reasons)
In other words, your complaints are ironically not about what the article is discussing, but about, for better or for worse, attempts to solve it.
I mean, in so many words that's precisely what I am complaining about. Their attempt to solve it is to make it appear more human. What's wrong with an error message? Or in this specific example - why bother at all? Why even stop the conversation? It's ridiculous.
RLHF is what was responsible for your frustration. You're assuming there is a scalable alternative. There is not.
> What's wrong with an error message?
You need a dataset for RLHF which provides an error message _only_ when appropriate. That is not yet possible. For the same reason the conversation stops.
> Or in this specific example - why bother at all? Why even stop the conversation? It's ridiculous.
They want a stop/refusal condition to prevent misuse. Adding one at all means sometimes stopping when the model should actually keep going. Not only is this subjective as hell, but there's still no method to cover every corner case (however objectively defined those may be).
You're correct to be frustrated with it, but it's not as though they have some other option that allows them to detect how and when to stop/not stop, error message/complain for every single human's preference patterns on the planet. Particularly not one that scales as well as RLHF on a custom dataset of manually written preferences. It's an area of active research for a reason.
Don't anthropomorphize LLMs. They hate that.
And it's not even a question of LLMs getting answers "wrong". It's just generating associated text. It has no concept of right or wrong answers.
I think it’s totally fine to am LLMs. In the end they have been trained on human input.
I get the concern over what using the word hallucination implies, I also think it is a fairly fitting word.
We need something easy to explain when these systems are straight up wrong. Something that a normal non technical user will understand. Sure saying "wrong" could be easy enough, I think "Hallucination" also has a simplicity too it.
Part of the problem is that these models will appear to confidently be wrong. Hallucinate to me kinda goes along with this, it isn't just wrong things are being made up.
But regardless of that, people are used to calling it hallucinating. We are also up against an effort to downplay any concern over this fundamental problem with the technology and already trying to push it as a general AI (And we have to recognize there is a ton of money on pushing this exact narrative), that I would be worried about confusing the topic by pushing for an alternative term giving leeway to further downplay the problem.
There is a secondary issue of LLM's taking questions literally, and not really being able to (at the moment) deny the premise of a question. For example, if you google benefits of circumcision, the LLM will quite literally print all the benefits. But it also wont contextualize them, it wont frame them, it wont provide counter arguments, it just responds literally to the question.
In favour of "hallucination" it's not that much of an anthropomorphization because hallucination in a human context is something quite different - seeing ghosts and the like. If you use it in the context of an LLM everyone knows what you mean. The human terms for making random stuff up would be bullshiting, imagining etc.
Maybe instead of hallucinate? Use 'BS'?
To anthropomorphize even more. Since humans will also just create "BS" as an answer if they don't know the answer, or will combine half bits of knowledge into something to sound like they know what they are talking about.
I think it's a perfectly good word for what is happening.
just to be clear, I see it like this (for now):
if a GPT does it and turns out to be false, then it's an hallucination and it's bad (goto more training)
if a human does it, then truth becomes "self-expression" (art) so we call it creativity and it's good
> if a human does it, then truth becomes "self-expression" (art) so we call it creativity and it's good.
Depends. Once I misremembered the usage of the command "ln", and I wiped ~10 machines inadvertently.
Nobody called it self-expression / art, and none of the results of my little "experiment" were good.
Do it a couple of times, and you'll be updating your CV.
No, if a human does it by accident, as is clearly the case here, we call it "hallucination", "misremembering", "mandela effect" or "dementia"
the point of contention comes out of how you are saying "...by accident" but I'm sidelining the intention
> For this experiment we used four models: Llama, accessed through the open-source Llama-lib; ChatGPT-3.5 and ChatGPT-4, accessed through the OpenAI subscription service; and Google Gemini, accessed through the free Google service.
Papers like this really need to include the actual version numbers. GPT-4 or GPT-4o, and which dated version? Llama 2 or 3 or 3.1, quantized or not? Google Gemini 1.0 or 1.5?
Also, what's Llama-lib? Do they mean llama.cpp?
Even more importantly: was this the Gemini model or was it Gemini+Google Search? The "through the free Google service" part could mean either.
UPDATE: They do clarify that a little bit here:
> Each of these prompts was posed to each model every week from March 27, 2024, to April 29, 2024. The prompts were presented sequentially in a single chat session and were also tested in an isolated chat session to view context dependency.
Llama 3 came out 18th of April, so I guess they used Llama 2?
(Testing the prompts sequentially in a single chat feels like an inadvisable choice to me - they later note that things like "answer in three words" sometimes leaked through to the following prompt, which isn't surprising given how LLM chat sessions work.)
One of the biggest places I've run into hallucination in the past has been when writing python code for APIs, and in particular the Jira API. I've just written a couple of CLI Jira tools using Zed's Claude Sonnet 3.5 integration, one from whole cloth and the other as a modification of the first, and it was nearly flawless. IIRC, the only issue I ran into was that it was trying to assign the ticket to myself by looking me up using "os.environ['USER']" rather than "jira.myself()" and it fixed it when I pointed this out to it.
Not sure if this is because of better training, Claude Sonnet 3.5 being better about hallucinations (previously I've used ChatGPT 4 almost exclusively), or what.
Context helps so, so much. Adding terminal output, IDE diagnostics, code, remote documentation into the context really improves the output, and editors like zed make it very convenient to do.
Are we really still having this conversation in 2024 ?! :-(
Why would a language model do anything other than "hallucinate" (i.e. generate words without any care about truthiness) ? These aren't expert systems dealing in facts, they are statistical word generators dealing in word statistics.
The useful thing of course is that LLMs often do generate "correct" continuations/replies, specifically when that's predicted by the training data, but it's not like they have a choice of not answering or saying "I don't know" in other cases. They are just statistical word generators - sometimes that's useful, and sometimes it's not, but it's just what they are.
Yes, they are auto-completers, but they are auto-completers that are layered AND operate in higher dimensional spaces. This throws all intuitions off, and I think makes it misleading to think of them as "just" auto-completers. That's part of the story, but not the whole of it.
I suspect we are much closer to auto-completers than most of us like to think, but we're also trained+incentivized by culture, education, parenting, socializing, to produce "useful" results.
Maybe part of the problem is in the data set: how much modeling of "how to admit ignorance or uncertainty" are in LLMs training data sets? If you read the internet, all you see is confident replies to other confident replies. Ignorance or non-confidence tends to elicit either a bluff or non-response. If you read technical literature, you see much of the same.
Maybe LLMs are trained on a dataset, and thereby inherit a culture that's accidentally biased toward ignorant confidence. In human conversation, if somebody asks a question and I don't know the answer, I say I don't know. On the internet, I just skip it and leave it for somebody else who thinks they know.
All this is to say: maybe a statistical autocompleter can admit ignorance instead of firing "neural noise" based on barely-there loose associations. Maybe it just needs a stronger pathway toward talking about not knowing when there's not a strong association.
One problem is that the LLM's own knowledge, or lack of it, doesn't follow from any individual training sample(s). Even if there were a bunch of "I don't know X" samples in the training set, that ought to be trumped by one authoritative "X is ..." one.
The next problem is that the LLM doesn't know how reliable it's various training sources are (unlike a human who might trust personal experience > textbook > twitter comment), or even which samples come from which source so that it could learn that.
I have thought a lot about this and I suspect "I don't know" would be devastating to the model.
The magic is in the fact that the model can't say "I don't know". The model would have to have arbitrary thresholds and a type of domain/context classification in order to set this arbitrary threshold for the conversation in order to say "I don't know". It would create all these unsolvable boundary conditions. AGI in this context then would be minimizing the need for "I don't know" until it no longer applies. Is that possible? I don't know :)
I would defer to Chomsky also on the subject that we are basically nothing like these models when it comes to language and we are not auto-completers.
Wasn't it that it by design prefers things expressed with certainty?
I do not find this view to be useful.
The are cases from my personal experience, where I asked somewhat esoteric practical questions, that likely do not (seem) to have a clear answer in the web and I have got considerable help from ChatGPT. At some point this dichotomy of 'statistical word generator' vs 'true intelligence' should go away as it's just not useful. (I think these discussions always lead to Chinese Room problem; and IMO at some point it does not matter what 'dumb' process is behind, provided it solves a problem or the system behaves like an intelligent agent)
Except there is a qualitative difference in the class of knowledge that a statistical word generator and an expert system would generate.
Just because a LLM _can_ offer valuable and insightful information, doesn't mean that it doesn't also hallucinate. The most troubling factor here is that often the hallucinated content also looks like valuable and insightful information, but is just incorrect. This is the use. You have to hold that awareness whenever interacting with these systems.
Yup exatly. They are dream machines. LLMs without other systems can only work in the flow. The fact that in this word flow the larger LLMs can generate navigation instructions for actual mazes and solve random algorithmicly generated problems doesn't mean they are not hallucinating it just means we're getting wonderfully useful hallucinations.
Hallucination is a term that means "imagined facts", so it's very hard for me to parse this comment into something meaningful beyond "if we say it always generates hallucination, we can say it always generates hallucinations"
You Google a restaurant that appears to be open. You go there, and you find that the restaurant is no longer there.
Did Google "hallucinate" a restaurant? Because this is no different.
We can empirically test if hallucination is a good word for communicating this concept, by checking if people describe(d) that as a hallucination (they don't).
This is all IMHO, I'm not trying to be difficult or nitpick, I just don't understand the idea as communicated. As applied to LLMs, it sounds like hallucination == could be wrong, and this Google example seems further away even when steel-manning, ex. we don't say all Google results are hallucinated.
It doesn’t mean automatically wrong. It’s just bullshitting. It makes up something that fits a pattern. Depending on the question, the pattern may be right more often than not.
If you ask ChatGPT, “hey is the McDonald’s near my house open at 6pm?” It doesn’t know anything about where you are or if there’s a McDonald’s or what its hours are. It will likely hallucinate that sure, it’s open at 6pm. But when it does so is it “right” in a meaningful way?
Yup. IMHO, I think "bullshitting" is a much better word than hallucinating and/or getting it right!
Much like real life bullshitters, it is inclined to say something truthful-sounding, but doesn't actually have a strong reliability towards truth per se.
Bullshitting implies intent to deceive. As far as we know, an LLM honestly "believes" (as if you need another rabbit hole) what it says. Delusion, perhaps?
Delusion implies a degree of consistency, though. LLMs can be on point one minute and completely off the rails the next even when prompted with the same prompt. Hallucination fits better here as it speaks to the real-time "perception" (there's another one for you).
An LLM is not a brain, though, so no matter which analogy you choose, it will come with some flaws. Regardless, "hallucination" has moved past analogy territory and now has its own LLM-specific usage with reasonably wide acceptance so the analogy angle is now moot anyway.
>Bullshitting implies intent to deceive
Not in the formal sense. The philosopher Harry Frankfurt famously distinguished bullshitting from lying because a liar knows the truth and is trying to hide it where a bullshitter is simply trying to sound convincing and may or may not be telling the truth (and may not even know themselves if they are)
https://en.wikipedia.org/wiki/On_Bullshit
In the current formal sense. It may be true the formal sense in 1986 was different. Words do evolve in meaning over time, but since we're talking about right now...
You are right that lying and bullshitting are different. A lie is a false statement with intent. Bullshit is nonsense with intent. A false statement and nonsense may share some similarities, but are ultimately different.
Perhaps nonsense is the word we should be applying to LLMs, but often what they say isn't nonsense, even if only by accident, so that doesn't exactly work either. Regardless, it doesn't matter now. As before, "hallucination" has moved beyond analogy and now has its own LLM-specific usage.
You can ask it „Is the McDonalds nearby open? I live in Brumbledon, Ohio.“ It will do a search, and then confidently state that the McDonalds in Brumbledlon, Ohio is in fact open 24/7. I guess it doesn’t matter that no such town exists.
That's fair. I'm honestly stunned how little work there's been to incorporate search, in a real way, into products. perplexity, ChatGPT and bing to some extent, that's it.
(Screenshot is my yet to be released flutter app)
https://imgur.com/a/HCVEfkQ
> Except there is a qualitative difference in the class of knowledge that a statistical word generator and an expert system would generate.
There's a lot of difference between the two, and you don't have to treat it as one or another. It's OK to treat it as something in the middle.
> The most troubling factor here is that often the hallucinated content also looks like valuable and insightful information, but is just incorrect. This is the use. You have to hold that awareness whenever interacting with these systems.
Completely agree, sans the word "troubling". It's not troubling. It is what it is. As long as you keep it in mind when you use it, and treat it as an entity that can be completely wrong, and use it where it's OK to be completely wrong (e.g. when the output is easily verifiable), there's nothing "troubling" with that.
That's not what the GP is arguing, though
You're stuck on a problem. You grab a random comic book from the shelf and something that is written in the comic book sparks your solution in your head.
How intelligent is the comic book? Is it hallucinating or being correct or what?
The answer is none of those things, right? YOU did the thinking. Not the inanimate object; what it did was a happy coincidence.
Pointless discussion on semantics.
If a random page in a random comic book gives me the answer I seek 30% of the time, it's incredibly useful. Little effort was spent in seeking the answer, and the 70% of the time it is wrong led to little waste in time.
Now if you put a mechanical arm interface in the middle where I give my query to a machine, and it randomly picks the comic book and page, which answers my question 30% of the time - I have no trouble calling it "intelligent".
Contrast it with Google searches that don't give me the answer I seek, but use up an order of magnitude more of my time.
Can you provide some examples of things that aren’t on the web but that ChatGPT helped you with? I’ve yet to see an example that’s not in the likely training set (which includes more than just the public Internet).
But to your point, I disagree: the mechanism matters. Just because you haven’t detected the limitations of the mechanism behind ChatGPT doesn’t mean it’s not there.
Perhaps I'm confused, but your question seems contradictory and ambiguous. You first imply (I think?) that ChatGPT is limited to web-sourced information, but then acknowledge the training set includes more than just the public web. Can you please clarify what you are asking?
My guess is he's referring to things that are out there on the web, but not in the top results and not easy to find.
It's like you want to protect our right to hallucinate, because ChatGPT can't.
This is just admitting you don't care about knowledge or truth.
Sure, but the question posed isn't whether LLMs exhibit intelligence (obviously so, minimally in Chinese Room sense), or can they combine sources (sure, no way to stop them), but why do they hallucinate.
Notwithstanding the amazing things they can do, I don't think it helps understanding by viewing LLMs in too abstract of a way as intelligent agents. After all, in reality they are "just" language models, and hopefully in 2024 the nuance of what they needed to learn to be GOOD language models doesn't need to be explicitly stated every time we discuss them.
Looking at them as language models, it's easy to explain why they hallucinate, are poor reasoners, etc, and IMO does nothing to distract from understanding why they also exhibit intelligence when operating "in distribution".
I've never been worried about LLMs. I've always been worried about how people will use LLMs and how they will interpret the output of LLMs. Especially people who don't understand what LLMs are doing.
Why is this concern more important the what people interpret from the media, social media and the dissemination of information in general where lies and fabrications are also commonplace? Like surely people will always fall for nonsense, lies or fabrications and there is nothing that can be done about that.
As with all these discussions: accountability and consequence.
We can point at a media company, call out its vested interests, scream about its bias, protest in front of its office, sue it for slander and misrepresentation. We can call out individual personalities the same way. We can strive to drive the companies out of business and the personalities out of work, if we deem it necessary, and we can accumulate a paper trail that holds each one to account.
As neither individuals nor corporate entities, algorithms do not yet carry this kind of legal or public accountability even as we some start to hold them up as oracles. In most cases, failures of an algorithm are treated simply as bugs or user mistakes. Nobody is responsible for anything bad and the so the algorithm can persist and its vendor can shrug off their own responsibility by gesturing towards an perpetual development process instead of accepting consequence: "we work to make the algorithm better every day, try again tomorrow!"
>We can point at a media company, call out its vested interests, scream about its bias, protest in front of its office, sue it for slander and misrepresentation.
Right but the previous election had Russian servers spinning up fake news websites that displayed straight up generated news. Again how do you hold them accountable? You can't the only defence against bullshit is independent thinking.
Because LLMs strip away all the context surrounding the information it spits out that let you evaluate its trustworthiness. They're incredibly useful tools, I use them constantly when coding but I can do that because I know enough to validate the information and it happens that the cost of validating the output with the docs is shorter than reading them to find the relevant functions.
I wouldn't dare try to use an LLM for a chemistry question because I wouldn't be able to tell if it makes any sense or not. But if you're not a "tech person" and all you see is some company advertising their AIs as magical knowledge engines with disclaimer text that wouldn't pass accessibility tests, why wouldn't you assume they know their stuff? The Perplexity ads are bordering on negligent.
The difference is that web/social media is branded as an intelligent being you can ask any question of. We all agree the web is _also_ not reliable, but many people will think GPT / Gemini are verifiably accurate when they aren’t.
> We all agree the web is _also_ not reliable
You need to expand your circle a bit more :-)
Not much different. Except that there's a dumb thing that humans are doing which is giving weight to the magic brainy computers.
All that training data would make even grep look smart. I'm just glad those magical forest creatures made the data in the first place.
I've found that people in general seem to trust computers more than humans, which made sort of sense for a while.
What they don't fully realize is that this is a completely different game; now the computer is just guessing, as opposed to following a deterministic algorithm to the answer.
And this misunderstanding carries the potential for pretty serious consequences, good luck getting that loan once a computer finds some arbitrary pattern and says no.
If only it would tell you "You've criticized the war effort that day in 2004", in stead it will do parallel construction. The end game will be a kind of SEO for human profiles and we will live happily ever after by the best practice guide lines.
I've always been worried about how people will use LLMs and how they will interpret the output of LLMs. Especially people who don't understand what LLMs are doing.
The problem isn't the people. It's the tech companies.
The tech companies are telling people that it's intelligent, and the tech companies are using it to answer people's questions as if they're presenting facts.
People are using it the way they're told.
If you advertise something as a solution, don't be surprised when people use it to solve things.
Always remember companies are people too... until we make companies powered by AI. Till that day the underlying problem is always with the people.
This is the most prescient point I have read hitherto regarding LLMs. We are fashioning for ourselves gods of wood and stone.
s/LLM/Wikipedia/g
Pretty much all research (and there's a fair few with different methodologies) on this converge on the same conclusion:
LLMs internally know a lot more about the uncertainty and factualness of their predictions than they say. "LLMs are always hallucinating" is a popular stance but wrong all the same. Maybe rather than asking Why models hallucinate, the better question is to ask "Why not?". During pre-training, there's close to zero incentive to push any uncertainty to the forefront (words).
GPT-4 logits calibration pre RLHF - https://imgur.com/a/3gYel9r
Language Models (Mostly) Know What They Know - https://arxiv.org/abs/2207.05221
The Geometry of Truth: Emergent Linear Structure in Large Language Model Representations of True/False Datasets - https://arxiv.org/abs/2310.06824
The Internal State of an LLM Knows When It's Lying - https://arxiv.org/abs/2304.13734
LLMs Know More Than What They Say - https://arjunbansal.substack.com/p/llms-know-more-than-what-...
Just Ask for Calibration: Strategies for Eliciting Calibrated Confidence Scores from Language Models Fine-Tuned with Human Feedback - https://arxiv.org/abs/2305.14975
Teaching Models to Express Their Uncertainty in Words - https://arxiv.org/abs/2205.14334
Yes, because just like chess ELO that we discussed the other day, they need to learn this in order to do well on their training objective - impersonating (continuing) their training sources. If they are continuing a lie then they need to have recognized the input as having this "context", and take that into account during prediction.
Right but then the problem of hallucination has little to do with statistical generation and much more the utter lack of any incentive in pre-training or otherwise to push features the model has already learnt into the words it generates.
Right, more due to the inherent nature of an LLM than due to that nature being a statistical generator, although as such they amount to the same thing.
One way of looking at it is model talking itself into a corner, with no good way to escape/continue, due to not planning ahead...
e.g. Say we ask an LLM "What is the capital of Scotland?", and so it starts off with an answer of the sort it has learnt should follow such a question "The capital of Scotland is ...". Now, at this point in the generation it's a bit late if the answer wasn't actually in the training data, but the model needs to keep on generating, so does the best it can and draws upon other statistics such as capital cities being large and famous, so maybe continues with "Glasgow" (a large famous Scottish city), which unfortunately is incorrect.
Another way of looking at it rather than talking itself into a corner (and having to LLM it's way out of it), is that hallucinations (non-sequiturs) happen when the model is operating out of distribution and has to combine multiple sources such as the expected form of a "What is .." question reply, and a word matching the (city, Scottish, large, famous) "template".
I think this may be the best explanation I've seen on the topic!
But, shouldn't that situation be handled somewhat by backtracking sampling techniques like beam search? But maybe that is not used much in practice due to being more expensive.. don't know.
Thanks!
I'm not sure if beam search would help much since it's just based on combining word probabilities as opposed to overall meaning, but something like "tree of thoughts" presumably would since I doubt these models would rate their own hallucinated outputs too highly!
Asking "why do large language models hallucinate?" is a good question to ask and answer, just like "why do birds sing?" or "why is the sky blue?" are. The problematic part is when you're somehow surprised that the sky is blue, birds sing and LLMs do nothing but hallucinate.
Tiger gotta hunt, bird gotta fly, LLM gotta answer "why, why, why?"
Tiger gotta sleep, bird gotta land, LLM gotta produce a token from the context at hand
I think we're all having different experiences and conversations with LLMs, and sometimes framing them purely as "statistical word generators" or expecting them to function like AI in the traditional sense of automated problem-solving might not capture the whole picture.
Like many in this community, I use LLMs daily. My main use case now isn’t software development but rather getting assistance in connecting concepts that I don’t know precisely, guided by my intent. In some ways, it's more akin to "out-of-the-box thinking" but with a tool that helps me explore ideas I might not reach on my own or suffer within the economy of search [1][2]. I might know something about X, Y, and Z, but there's a concept W that ties them all together. Without using W, people in that field might not grasp the connections between X, Y, and Z. Apologies for the abstraction here. I suppose it's my epistemological bias showing!
Even if LLMs don't precisely connect the dots, they help my brain connect them faster.
[1] https://en.wikipedia.org/wiki/Search_theory
[2] https://www.di.ens.fr/~lelarge/soc/varian2.pdf
"Why would a language model do anything other than "hallucinate" (i.e. generate words without any care about truthiness) ?"
That's exactly the question the paper attempts to answer: why do LLMs ever get it right? The answer is that on topics where there's a lot of data and a general consensus on what the right answer is, the statistical model will find that answer, and otherwise you get junk. That's why they work so well for people trying to write Python or Javascript, for example.
But I already knew this, you might say. Sure, but the authors produced evidence to back it up.
There's really no "right or wrong" per se -- the question that's really being asked is "to what extent does it resonate with a person?"
As context sizes grow, it's easier to add lots of information outside of training data via in-context learning, which should offset that issue quite a bit.
> Why would a language model do anything other than "hallucinate"
Indeed, why?!?!?! Why do they so often get the correct answer to very direct questions? Saying "it's in the training data" -- I dare you to find anything in the training data that talks about how many "q's" there are in the word "mortuary", and yet it "hallucinates" up answers to this.
> without any care
What does it mean to care? What question could you ask of an LLM that would allow you to assess how much it "cares" about something?
> but it's not like they have a choice of not answering or saying "I don't know" in other cases
Is it your contention that the phrase "I don't know" has never occurred in the training data for an LLM?
There seems to be a dichotomy of reactions to LLMs. There are technical people saying "it's just an autocomplete engine reciting things from its training set" and there are non-technical people saying "it does more than just next token completion, it is trained to use language".
The first group is technically correct but ignores the fact that it can, emergently, do things far outside of an explainable capability, the second group is technically incorrect but correctly perceives that it can use language in novel ways.
Neither group captures the fact that we built extremely complex linear algebra machines that for reasons we do not understand, despite being trained on an incredibly simple task (next-token-prediction) are capable of actually using language in a way that ten years ago we assumed only humans could do.
> Is it your contention that the phrase "I don't know" has never occurred in the training data for an LLM?
No, but when it does occur in the training data it's a reflection of that particular source/speaker not knowing, which isn't the same as the LLM not knowing because it was also trained on millions/billions of additional sources.
For an LLM to learn to say "I don't know" appropriately, it would need to know when it itself doesn't know (and have that change if you told it), and of course it doesn't have that capability.
> of course it doesn't have that capability
Yes, of course not. Of course. There's no way, for example, that I could ask it if it knew what number I was thinking of, and, after I tell it the number and ask the same question, that it could express that it didn't know before and did know after. Absolutely impossible. Out of the question that it would have this capability. Clearly impossible given its architecture. No way that it could possibly do this task. Why, it would require advances in machine learning and quantum computers and understanding a theory of consciousness and perception at a level that we won't have for centuries. Maybe even a completely new design and training procedure to even begin to approach this insurmountable task.
And if it did demonstrate this ability, then clearly your assessment of its capabilities would be completely wrong and you would have to step back and reconsider how much its training data reflects its abilities.
Well, so far there's no model that knows what it doesn't know, and hallucination continues to be a problem.
So maybe you can enlighten us all, AI labs included, with your genius as to how to solve hallucination, and how to do so only via changes to the training set since that is what you suggest.
To make things easy for you, lets assume that every training text has been augmented with source information such that the model could potentially learn which sources are trustworthy on given subjects or not, and therefore assess whether it knows something or not.
So what else are you going to add to the training set that you claim would induce it to learn this self-referential "I know X" knowledge to better achieve it's next word loss ? Why do you think the AI labs have not done what you are suggesting ?
You misunderstand me -- my position is that we are completely ignorant as to how LLMs do what they do, and it is delusional to think that we understand what their capabilities are or how they are limited.
Until we understand why they sometimes don't hallucinate we can't begin to approach the problem of why the do. And above I directly refute the notion that no model "knows what it doesn't know" -- they are manifestly capable of demonstrating ignorance.
As to next steps in this I have no concrete thoughts, and I don't even know whether LLMs are a dead end. What I suspect is that as we apply more computational power to the problem, as we've seen in the past, computational power plus network depth plus training data means that the structure of the models (like the attention mechanism or the tokenization) will eventually be unnecessary and the structures will be self-discovered and more capable for it.
As models get deeper and more expressive I am confident that things like a broader capability to understand the limitations of its own knowledge or to test its knowledge against objective reality will emerge organically. We may be able to make leaps by creating shortcuts for some of these mechanisms, but in the end depth and computation power will trump all.
> my position is that we are completely ignorant as to how LLMs do what they do
That's simply wrong.
Whole teams of very smart people at places such as Anthropic are working on this exact problem (the field of "mechanistic interpretability"), and have made considerable headway, and have published their results.
Just a note: Truthiness means a feeling of truthfulness, even if not actually true. I'd argue that LLMs do care about that, but I suspect you meant 'truthfulness'.
> Why would a language model do anything other than "hallucinate" (i.e. generate words without any care about truthiness) ? These aren't expert systems dealing in facts, they are statistical word generators dealing in word statistics.
That's a false dichotomy.
Their ability to generate facts is a consequence of the word statistics they are truly using. I think it's fair to say the statistics explanation is a more accurate interpretation.
More accurate than what? Statistics "explanation" is something which is technically correct. But it also doesn't present the full picture — for example, the fact that LLMs clearly build an internal mental model of the question they're talking about.
They build internal representations of the input only, and to the extent needed, to get the statistics right. This really isn't a "world model" of factual data, but rather a "source model" of what would various sources (training texts) say.
The responses of the model don't represent what it understands per some internal model, because there is no "it", only models of the sources it was trained on, and it'll just as happily generate lies as truths, or smart vs dumb answers (it's all just words) if that is what its source modelling calls for.
What most people mean when they say the model is hallucinating/bullshitting isn't where it has learnt a lie, but rather where is is operating "out of distribution", and is therefore (unknowingly) generating a mashup from multiple only loosely related/matching source contexts.
I can offer cite that hallucinations are an innate property of LLMs, can you provide one that shows they have 'an internal mental model'?
https://arxiv.org/abs/2401.11817
Sure! https://thegradient.pub/othello/
I'm not convinced at all. The only thing they are doing is perturbing some of the model weights in intermediate layers, and seeing if the output of the final layer is consistent with the perturbations. It would be a shitty model if that was not the case.
The fancy part in the paper is figuring out how to perturb the intermediate layers in the way you want. But the findings are not impressive.
Note also that the "probe geometry" stuff is so speculative they left it out of the academic paper completely.
In the same way, it has been known since the 90s that if you take the matrices from Finite Element Models and visualize them as graphs, structures appear that kind of resemble the physical appearance of the object being modelled. Here for instance is for a helicopter:
http://yifanhu.net/GALLERY/GRAPHS/GIF_SMALL/Pothen@commanche...
Yet nobody thinks Finite Element Models have an internal mental representation of the world.
"Yet nobody thinks Finite Element Models have an internal mental representation of the world."
At this point, I'm not sure some wouldn't argue that.
The difference is, put the AI on a loop, with constant feedback, learning. Instead of just a 'pre-trained' model. Make the actual model, live, always learning, so the context window is infinite. This of course would not be for everyone, because it would take all the resources of the training infrastructure to be focused on one person/view. But that gets closer to the human mind, and at that point, we probably couldn't say for sure that the 'perturbations' aren't experiencing something subjective.
Where is the proof that humans have an internal mental representation of the world.
We are using different meanings for terms.
Building a world model for a perfect information game is different than building an mental model of the external world.
In context learning is a well known property of LLMs, while real world generalization, often described through the common sense problem is not.
To me, 'mental models' are personal, internal representations of external reality, which LLMs currently lack, being limited to the corpus.
No, you're not. Are you genuinely trying to suggest that LLMs, which can:
- Construct arbitrary text that isn't just grammatically but semantically coherent
- Derive intent, subtle intent, from user queries and responses
- Emulate endless different personalities and their reactions to endless stimuli
- Describe in detail the statics and dynamics of the world, including sight, smell, touch and sound
do not have a model of the external world? What do you think a "corpus" means in this context? How is the "corpus" of sensory and evolutionary data that makes you up in any way different?
LLMs are excellent common sense reasoners, and they generalize just fine. Why exactly do you think they get things _subtly_ wrong? Make up API syntax that looks sensible but isn't actually implemented? In order to make these guesses they need to have generalized, they need an understanding of the structure underlying naming, such that they can produce _sensible_ output even if they lack the hard facts.
You are correct. We are flooded with studies on AI now, so can't find reference.
But just few months ago, saw example of AI, from video, building an internal representation of the world. An internal model of the world. Everyone saying this can't be done, it already is. Maybe can argue it wasn't an LLM, and then I'd say were nitpicking over which technology can do it or not. We already have example of tying them together, symbols and LLM's.
Might be related. https://www.nature.com/articles/d41586-024-00288-1 https://www.technologyreview.com/2019/04/08/103223/two-rival...
I believe they used the game to show how the same underlying technology would build a mental model not because the game was perfect-information, but because it was easy to probe for without a lot of other unrelated concepts getting in the way.
Perhaps this is what you're looking for (similar technique, larger model?) https://www.anthropic.com/news/golden-gate-claude
> In the “mind” of Claude, we found millions of concepts that activate when the model reads relevant text or sees relevant images, which we call “features”.
> One of those was the concept of the Golden Gate Bridge. We found that there’s a specific combination of neurons in Claude’s neural network that activates when it encounters a mention (or a picture) of this most famous San Francisco landmark.
Sounds like a mental model to me! An internal representation of an external concept which exists in the real world.
Uhm.
I don't really know if using an "experiment" is the right way to test this?
Like, if you want to test such hypothesis you should first prove that these LLMs were created with that specific function in mind, we are not observing a natural phenomena, it is a manmade(manmade...loosly) algorithm .
You don't know the internal architecture of the system and you are testing in a reality narrow situation, this is basically the scientific method applied incorrectly.
I don't even like the word "hallucination", because it's an anthropomorphism, further suggesting/misleading into an impression that LLMs have anything to do with how a human would reason.
Fancy autocomplete. Useful, yes. Powerful, definitely. But it's just another tool, limited in applications the same way a hammer is - this is the part that many humans seem to struggle accepting.
The use of "hallucinate" is also my gripe.
Philosophy of language, and Wittgenstein in particular, has more specific words for the output insofar as they mean anything to humans: "senseless" or "nonsense".
But I don't think those terms are accurate. If I say "John Doe was born in 1987", that's a perfectly sensible sentence, even if I don't actually know when he was born and even if it turns out he was actually born in 1971. A nonsense sentence would be something like "John Doe cat five green explicate" or "h4stga3jkui7rutjdyrst".
I don't understand your point.
You've just re-presented one difference between sense and nonsense in philosophy of language. We might call that first phrase wrong or an error or inaccurate if that was the case. We don't need to use a word like "hallucinate" for your second phrase --- we can just call it nonsense.
"Hallucinate" is far more inaccurate and confusing. It paints a false picture that melds 2 distinct things, as you've described, which need not happen at the same time: errors and nonsense.
I think the term forst originated with the dream-like images of google deepmind, these were really similar to human visual hallucinations, i guess the term stuck
It's unfortunate, but it is what it is.
When a human hallucinates, that is due to the brain operating out of spec. When an LLM "hallucinates", it's performing exactly as intended, it just has delivered a nonsensical result which is the consequence of the tech behind the LLM.
Instead of "hallucinating" I would have preferred the term "bullshitting" -- in the Harry G. Frankfurt sense of not caring about the truth of one's utterances. But it's too late for that.
https://link.springer.com/article/10.1007/s10676-024-09775-5
Using "bullshit" would be interesting, but to me would introduce a backdoor anthropomorphism to describe the output. The picture is still too human.
Isn't Frankfurt's concept of bullshit made up of 2 parts: 1) a distinction between lying and telling the truth, AND 2) the absence of caring about either when speaking when normally it's assumed present?
Part 1 seems to apply, but part 2 wouldn't. It doesn't make sense to talk about GPT "caring" about its output beyond anthropomorphism. No one talks about their computer caring about having correct or accurate output and neither is it assumed. People would think your imagining a demon in the box. Really, it's even odd to say "GPT lied" outside of very specific circumstances.
I think "bullshitting" fits better than "hallucinating" - just keep spitting out words rather than admit ignorance, but maybe the best human analogy is freestyle rapping where one has to keep the flow of words coming regardless!
Maybe we just need to coin a new word for it - "LLM-ing" perhaps ?!
You aren't going to like this, but don't insult the hammers.
[flagged]
> Please, explain to me which exact algorithm you use to unify the "facts" you have such magic access to?
The "magic access" to "facts" (reality) is called embodiment and personal experience with the world in which we live.
An LLM has no embodiment, and therefore no access to reality/facts. All it has is a training set of mixed sources, and a training objective of impersonating those sources (whether an encyclopedia or 4chan comment).
[flagged]
""Are we really still having this conversation in 2024 ?! :-( ""
</Sarcasm> Are we in 2024 and people are still just saying "duh its statistics, nothing to see here".
I am impressed that with multimodal modals in the hands of everyone people still say this. I mean it’s true on the surface but the ability to take a picture of a data sheet and wiring diagram and get it to give me step by step instructions to wire two components together and an example of the UART protocol in cpp that while not directly functional captures all the essential information I need. That’s amazing and it’s not just word prediction, the source is images of documents with diagrams!
Anyone who does this and comes away jaded has lost the ability to dream.
It's not just true on the surface—it's true at its core. All those features you mention are based on statistical relationships. This doesn't mean that these systems can't be very useful, but it also doesn't mean that they are _intelligent_, as much as we like to call them that. They have no understanding of their input or output, beyond being able to pattern match and mechanically decide what the next token should be based on their training data.
Then you need to explain why statistics isn't enough to be intelligent. We are at the point where that isn't an obvious argument anymore and I'll point to the best models as to why you might be wrong.
Because being able to find patterns in large amounts of data will never make the system intuitively understand math[1], or history[2], or any other field. It will always depend on the data and biases we feed it, at least with the current approaches.
You might say that this is how humans learn and demonstrate intelligence, but it's not the same thing. We have the ability to advance our understanding of the universe without being explicitly trained on every topic. Until we can build systems that can do this, I wouldn't label them as intelligent.
But, again, this doesn't mean that they can't be useful.
[1]: https://towardsdatascience.com/9-11-or-9-9-which-one-is-high...
[2]: https://www.theguardian.com/technology/2024/mar/08/we-defini...
I never said it wasn’t statistical at any level so I’m not sure what your point is. My point was almost everything we do is an approximation (I mean, all of science and technology and engineering) and very little of it isn’t. Fake it until you make it is pervasive in human endeavors. Over time the errors and issues will be smoothed away until it’s still there but it’s not noticeable enough to be a major problem.
It being statistical doesn’t make it useless. It not being some metaphysical concept of awareness doesn’t make it useless.
But the multimodal model support puts the lie to it being just a stochastic parrot of language. The domains it works in -is- more abstract than syntax and grammar even if it’s grounded in it. That’s part of the entire power of the model. In the end it leads to a predicted token but the path it takes is more subtle and complex than a simple Markov chain of tokens.
> I never said it wasn’t statistical at any level so I’m not sure what your point is.
You said it was true on the surface, and I'm pointing out that it's true at its core. Your amazement at how well it works is based on our inability as humans to find the same patterns in the data as these models do. Ascribing some higher level sense of intelligence or understanding to these systems because of this party trick is anthropomorphizing what they actually are. We would all be better served by keeping in mind how they work, instead of being surprised when they do output the wrong pattern.
> Over time the errors and issues will be smoothed away until it’s still there but it’s not noticeable enough to be a major problem.
Why do you think this is guaranteed? We can keep throwing data and compute at these systems, and while we might continue to get better results, the current systems will never intuitively understand that e.g. 9.11 is lower than 9.2[1], unless those specific examples are in their training data. As we reach the limits of the data we can feed them, and we have to generate synthetic data, there's no reason to believe that the current approaches will ever fix these problems at their core.
> It being statistical doesn’t make it useless. It not being some metaphysical concept of awareness doesn’t make it useless.
I never said it was. I agree that this can be useful, but again, as long as we're aware of their limits. It's good keeping this in mind as we're reaching the peak of inflated expectations of the hype cycle.
[1]: https://towardsdatascience.com/9-11-or-9-9-which-one-is-high...
I dont get it, what's the unique insight in the article, isn't it just stats and bad data?
It's just being a bit reductive to just toss AI out as un-interesting because it is just based on statistics.
And, think there is question of 'bad data' so study is bad/invalid, versus studying impact of 'bad data' on responses is the result.
I meant, the answer to the question of why does AI hallucinate is the same answer to the question "why does any statistical system ever produce an incorrect result" which I thought was well established.
How can it be well established when the LLMs are still actively changing and improving.
Don't think we can say, statistical systems can have incorrect results, hence we no longer need to study how statistical system produce incorrect results, because we already know they can have incorrect results.
As opposed to what?
added sarcasm note. it was referring to the parent.
The problem with this line of argumentation is it implies that autoregressive LLMs only hallucinate based upon linguistic fidelity and the quality of the training set.
This is not accurate. LLMs will always "hallucinate" because the size of the model they can encode is orders of magnitude smaller than the factual information they can contain from the training set. Even granting that semantic compression could reduce the model to smaller than the theoretical compression limit, Shannon entropy still applies. You cannot fit the informational content required for them to be accurate into these model sizes.
This will obviously apply to chain of thought or N-shot reasoning as well. Intermediate steps chained together still can only contain this fixed amount of entropy. It slightly amazes me that the community most likely to talk about computational complexity will call these general reasoners when we know that reasoning has computational complexity and LLMs' cost is purely linear based upon tokens emitted.
Those claiming LLMs will overcome hallucinations have to argue that P or NP time complexity of intermediate reasoning steps will be well-covered by a fixed size training set. That's a bet I wouldn't take, because it's obviously impossible, both on information storage and computational complexity grounds.
This piece reminds me of something I did earlier this year https://www.infoq.com/articles/llm-productivity-experiment/ where I conducted an experiment across several LLMs but it was a one-shot prompt about generating unit tests. Though there were significant differences in the results, the conclusions seem to me to be similar.
When an LLM is prompted, it generates a response by predicting the most probable continuation or completion of the input. It considers the context provided by the input and generates a response that is coherent, relevant, and contextually appropriate but not necessarily correct.
I like the crowdsourcing metaphor. Back when crowdsourcing was the next big think in application development, there was always a curatorial process that filters out low quality content then distills the "wisdom of the crowds" into more actionable results. For AI, that would be called supervised learning which definitely increases the costs.
I think that unbiased and authentic experimentation and measurement of hallucinations in generative AI is important and hope that this effort continues. I encourage the folks here to participate in that in order to monitor the real value that LLMs provide and also as an ongoing reminder that human review and supervision will always be a necessity.
For coding problems specifically, you could get quite far by giving the model a the tool-use of a sandboxed compiler/interpreter (perhaps even with your project files already loaded into the sandbox); and then training the model to test its own proposed solutions in the sandbox and revise them until they actually produce the expected outputs.
I once again feel that a comparison to humans is fitting. We are also "trained" on a huge amount of input over a large amount of time. We will also try to guess the most natural continuation of our current prompt (setting). When asked about things it I can at times hallucinate things I was certain to be true.
It seems very natural to me that large advances in reasoning and logic in AI should come at the expense of output predictability and absolute precision.
The comparison is flawed though in that humans and LLMs make mistakes for different reasons.
Humans forget things. Humans make errors. Humans' train of thought isn't impacted by an errant next token in the statement they're making. We have thoughts which exist as complete prior to us "emitting" them. Just as a multi-lingual speaker does not have thoughts exclusive to the language they're speaking in (even if that language allows them tools to think a certain way).
This is obvious if you consider different types of symbolic languages, such as sign language. Children can learn sign language prior to them being verbal. The ideas they have as a prior are not effected by the next sign they make: children actually know things independent of the symbolic representation they choose to use.
Hallucination is creativity when you don't want it.
Creativity is hallucination when you do want it.
A lot of the "reduction" of hallucination is management of logprobs, of which fancy samplers like min_p do more to improve LLM performance than most, despite no one in the VC world knowing or caring about this technique.
If you don't believe me, you should check out how radically different an LLMs outputs are with even slightly different sampling settings: https://artefact2.github.io/llm-sampling/index.xhtml
It seems to me that human brains do something like LLM hallucination in the first second or two - come up with random guess, often wrong. But then something fact checks it. As in does it make sense, is there any evidence. I gather the new q* / strawberry thing does something like that. Sometimes personally in comments I think something but google it see if I made it up and sometimes I have. I think a secondary fact check phase may be necessary for all neural network type setups.
There is a partial solution to this problem: use formal methods such as symbolic logic and theorem proving to check the LLM output for correctness. We are launching a semantic validator for LLM-generated SQL code at sql.ai even now. (It checks for things like missing joins.) And others are using logic and math to create LLMs that don't hallucinate or have safety nets for hallucination, such as Symbolica. It is only when the LLM output doesn't have a correct answer that the technical issues become complicated.
Proofs can ensure soundness for a collection of logical statements in an output, but people are being sold epistemic "truth".
This article is trying to elaborate what that means for LLM's, which only know truth through frequency ("crowdsourced truth") at best. For esoteric, sparse, ambiguous, uncertain, controversial, etc subjects, that's not an adequate truth standard to start from and logical proofs do nothing to improve on it.
Q* will have this
Is prompt engineering really 'psychology'. Convincing the AI to do what you want. Just like you might 'prompt' a human to do something. Like in the short story Lena, 2021-01-04 by qntm
https://qntm.org/mmacevedo
In short story, the weights of the LLM are a brain scan.
But same situation. People could use multiple copies of the AI. But each time, they would have to 'talk it into' doing what they wanted
A visual the displays probabilities and how things can quickly go "off-path" would be very helpful for most people who use these without understanding how they work.
Terrible article. The author does not understand how LLMs work basically, since an LMM cares a lot about the semantic meaning of a token, this thing about the next word probability is so dumb that we can use it as "fake AI expert" detector.
Something tells me that the author [0] is probably well aware of how these work under the hood, and the math behind it - When writing scientific articles with a laymen audience in mind, you'll often have to use laymen-specific terms. But feel free to enlighten us further!
[0] - https://en.wikipedia.org/wiki/Jim_Waldo
Whatever his credentials, what he says is plain wrong. GPTs don't follow "the grass is" with "green" because it's the most probable continuation- this idea is incredibly naive and breaks down with sentences longer than a few words. And GPTs don't crowdsource the answers to questions, their answers are not necessarily the most common, and neither "the consensus view is determined by the probabilities of the co-occurrence of the terms"- there is no such algorithm implemented anywhere.
What LLMs crowdsource is a world model, and they need an incredible amount of language to squeeze one out from it, second hand. We do train them for the ability to predict the next word, which is a task that can only be performed satisfactorily by working at the level of concepts and their relationships, not at the level of words.
> We train them for the ability to predict thr next word, which is a task that can only be performed satisfactorily by working at the level of concepts and their relationships, not at the level of words.
This is just obviously, trivially false.
Obviously, trivially false? Now I'm curious. Can you expand a bit?
I think what they mean (not OP here so just chiming in to to try interpret and answer your question) is that you don't know what you are talking about.
"Hallucinate" is an interesting way to position it: It could just as easily be positioned as "too ignorant to know it's wrong" or "lying maliciously".
Indeed, the subjects on which it "hallucinates" are often mundane topics which in humans we would attribute to ignorance, i.e. code that doesn't work, facts that are wrong, etc. Not like "laser beams from jesus are controlling the president's thoughts" as a very contrived example of something which in humans we'd attribute to hallucination.
idk, I'd rather speculatively invest in "a troubled genius" than "a stupid liar" so there's that
It's just 'prediction error' in a feedback loop, imo.
I'm sure like any other biological human with mitochondria and stuff, you've occasionally said (or started to say) something and then you (ie. the actively cross-checking self-analyzing enigma that is 'you') thinks 'hang on, no that doesn't make sense' and you self-correct. LLMs are 100% feedforward, there's just one big autoregression going on. No strange loop shenanigans.
Honestly I'm really interested to see where LLM-based diffusion models end up. (To be fair, probably mostly because I don't understand them yet so they could still be spooky. :D )
It might be more like reification ? The system finds a satisfying solution and sort of wills it into existence - by verbalizing it, makes it so ?
> lying maliciously
Malice implies intent, which is even more misleading that hallucination, imo
they've got some great marketing to get away without using: bug, defect, error, malfunction
or limitation
>It could just as easily be positioned as "too ignorant to know it's wrong" ...
GPT-5 is widely predicted to have a Dunning-Kruger level of expertise.
You mean the magic wizard isn't real and GPT lied to me!?!?
I liked the take that LLMs are bullshitting, not hallucinating. https://www.scientificamerican.com/article/chatgpt-isnt-hall...
There are several types of hallucinations, and the most important one for RAG is grounded factuality.
We built a model to detect this, and it does pretty well! Given a context and a claim, it tells how well the context supports the claim. You can check out a demo at https://playground.bespokelabs.ai
The author says:
> Once understood in this way, the question to ask is not, "Why do GPTs hallucinate?", but rather, "Why do they get anything right at all?"
This is the right question. The answers here are entirely unsatisfactory, both from this paper and from the general field of research. We have almost no idea how these things work -- we're at the stage where we learn more from the "golden-gate-bridge" crippled network than we do from understanding how they are trained and how they are architected.
LLMs are clearly not conscious or sentient, but they show emergent behavior that we are not capable of explaining yet. Ten years ago the statement "what distinguishes Man from Animal is that Man has Language" would seem totally reasonable, but now we have a second example of a system that uses language, and it is dumbfounding.
The hype around LLMs is just hype -- LLMs are a solution in search of a problem -- but the emergent features of these models is a tantalizing glimpse of what it means to "think" in an evolved system.
> now we have a second example of a system that uses language, and it is dumbfounding
An LLM 'uses' language, in the same sense that a calculator 'uses' arithmetic. It's a figure of speech.
That's not the sense of the metaphor that I'm applying when I say "uses language". That's closer to saying that "Alexa uses language", where "uses" here is analogous to what a calculator does.
To avoid using anthropomorphic terms, an LLM can take natural language from a human, integrate information from those expressions together with information held in its (opaque) store, and return natural language that a human can understand that reflects that information.
I am not aware of any other systems besides humans that can accomplish that task. Some animals can be trained to do some parts of this, but really until now humans are the only ones that could do the full loop.
Okay... but computers perform many other tasks that only humans can perform, for example, do square roots or play chess. The point being LLMs are just another program running on an integrated circuit. In short, I fail to see how LLMs blur the line between man and machine but pocket calculators do not.
I think the answer is actually quite clear and rather boring. In order to get something "right" there has to be some external standard of knowledge and correctness. That definition of correctness can only be provided by the observer (user). Alignment between the user's correctness criteria and generated text happens entirely by accident. This can be demonstrated by observing a correlation between coverage of a domain in the training data and the rate at which incorrect results are produced (as discussed in other comments). That is, they get things "right" because there was sufficient training data that contained information that matched the user's definition for correctness. In fact, exceptionally boring.
This is a very post hoc explanation. What does "coverage in the training data" mean?
Take a simple task of something like "How many a's are there in the word bookkeeper" -- what is your theory for why it can answer this question correctly or even give something approaching a coherent answer? It never even sees the letters that are in the token "bookkeeper", and this is definitely not something that appears explicitly in the training data.
I challenge you to give a "clear and boring" explanation for this -- this is incredibly subtle behavior that emerges from a complex architecture and complex training process, and is in its own right as fascinating and mysterious as the ability of humans to do this task and the inability of cats to do it.
Try asking Claude Sonnet 3.5 (one of today's best models)
"how many p's in Lypophrenia - just a number please"
I tried it a second ago, and it said "1".
To get these correct requires splitting tokens into letters and counting. I'd not be surprised if most models are either trained on token splitting or have learnt to do it. "Counting" number of occurrences of letters in an arbitrary separated sequence is the harder part, and where I'd guess it might be failing.
Yes! They are commonly wrong in this, and that's fascinating too. Because they are not solving the problem by looking at the letter in the word because they are not architecturally capable of enumerating the letters in the word. The fact that they can do it at all could be the stuff of an entire phd thesis, and could tell us more about the nature of LLM hallucination than a bunch of rambling about "how much coverage in the training set" when our determination of the coverage is based on human semantic similarity.
Ability to split words into letters isn't architecturally limited - it's just a matter of training data, and made easier by the fact that the input is tokens representing short letter sequences rather than words of which there are more.
It's quite possible that more recent training data deliberately includes word/token -> letter sequence samples, but even if not I'd expect there is going to be enough spelling examples naturally occurring in the training data for the model to learn the token (not word) -> letter sequence rules (which will be consistent/reinforced across all spelling samples), which it can then apply to arbitrary words.
So it is my contention that LLMs exhibit behavior far beyond what we could reasonable predict from a next-token-prediction task on its training set. Therefore I don't really like the framing of "this is present in the training data" as a response to LLM capability except in a very narrow sense.
One issue is that we anthropomorphise -- we see training data that, to a human, looks similar to the task at hand, and therefore we say that this task is represented in the training data, despite the fact that in the next-token-prediction sense that reflection does not exist (unless your model for next-token-prediction is as complex as the LLM itself).
My question to you -- what would falsify your belief that the LLMs just reflect tasks from the training set? Or at least, what would reduce your confidence in this? The letter sequence stuff for me seems like pretty clear evidence against.
> My question to you -- what would falsify your belief that the LLMs just reflect tasks from the training set? Or at least, what would reduce your confidence in this? The letter sequence stuff for me seems like pretty clear evidence against.
I guess it depends on what you mean by "reflecting" the training data. Obviously the apparent knowledge/understanding of the model has come from the training data (no where else for it to come from), so the question is really how to best understand that. Next-token prediction is what the model does, but says nothing about how it does it, and so is not very helpful in setting expectations for what the model will be capable of.
When you look at the transformer model in detail, there are two aspects that really give it it's power.
1) The specific form of the self-attention mechanism, whereby the model learns keys that can be used to look up associated data at arbitrary distances away (not just adjacent words as in a much simpler N-gram language models).
2) The layered architecture whereby levels of representation and meaning can be extracted and build upon lower levels (with this all being accumulated/transformed in the embeddings). This layered architecture was chosen by Jakob Uszkoreit to allow hierarchical parsing similar to that reflected in linguists sentence parse trees.
When we then look at how trained transformers operate - the field of mechanistic interpretability - how they are actually using the architecture - one of the most powerful mechanisms are "induction heads" where the self-attention mechanism of adjacent layers have learned to co-operate to copy data (partial embeddings) from one part of the input to another.
https://transformer-circuits.pub/2022/in-context-learning-an...
This is "A'B' => AB" copying mechanism is very general, and is where a lot of the predictive/generative power of the trained transformer is coming from.
So, while it's true to say that an LLM (transformer) is "just" doing next token prediction, the depth of representation and representation-transformation that it is able to bring to bear on this task (i.e. has been forced to learn to minimize errors) is significant, which is why some of the things it is capable of seem counter-intuitive if framed just as auto-compete or as a mashup of partial matches from the training set (which is still not a bad mental model).
The way word -> letter sequence generation seems to be working, given that it works on unique made-up nonsense words and not just dictionary ones, is via (induction head) copying of token -> letter sequences. All that is needed is for the model to have learnt the individual token -> sequence associations of each token included in the nonsense word, and it can then use the induction head mechanism to use the tokens of the nonsense word as keys to lookup these associations and copy them to the output.
e.g.
If T1-T3 are tokens, and the training set includes:
T1 T2 -> w i l d c a t, and T1 T3 -> w i l d f i r e
Then the model (to reduce it's loss when predicting these) will have learnt that T1 -> w i l d, and so when asked to convert a nonsense word containing the token T1 to letters, it can use this association to generate the letter sequence for T1, and so on for the remaining tokens of the word.
The conclusion here seems improbable at best -- if I understand it right, the assumption is that somewhere in the training data is the literal token string (wild)(cat)[other tokens](w)(i)(l)(d)(c)(a)(t)?
Even a transformer trained exclusively on examples of the form (token)(token)(letter-token)(letter-token)...(letter-token) where the letter-tokens are single letters and the tokens represent the standard tokenizer output would have trouble performing this task.
I guess this last statement is testable. I suspect that it would be unsuccessful without vast amounts of training data of this form, and I think we can probably agree that although there may be some, there are not sufficient examples of this form in standard LLM training sets to be able to learn this task specifically; the ability to do this (limited as it is) is an emergent capability of general-purpose LLMs.
What I'm saying is that:
1) Novel words are handled because they are just sequences of common tokens
2) Token -> letter sequence associations are either:
a) Deliberately added to the training set, and/or
b) Naturally occurring in the training set, which due to sheer size almost inevitably contains many, many, examples of word to letter sequence associations
Given how models used to fail badly on tasks related to this, and now do much better, it's quite likely that model providers have simply added these to the training set, just as they have added data to improve other benchmark tests.
That said, what I was pointing out is that words are represented as token sequences, so a word spelling sample is effectively a seq-2-seq (tokens to letters) sample, and we'd expect the model (which is built for seq-2-seq!) to be able to easily learn and generalize over these.
Are you surprised that jpg compression algorithms can reproduce input data that bears striking resemblance to the uncompressed input image across a variety of compression levels?
Jean Piaget said it better: "Intelligence is not what we know, but what we do when we don't know." And what do LLMs do when they don't know, they spit out bullshit. That is why LLMs won't yield to AGI (https://www.lycee.ai/blog/why-no-agi-openai). For anything that is out of their training distribution, LLMs fail miserably. If you want to build a robust Q&A system and reduce hallucinations, you better do a lot of grounding, or automatic prompt optimisation with few shot examples with things like DSPy (https://medium.com/gitconnected/building-an-optimized-questi...)
ITT an awful lot of smart people who still don't have a good mental model of what LLM are actually doing.
The "stochastic continuation" ie parrot model is pernicious. It's doing active harm now to advancing understanding.
It's pernicious, and I mean that precisely, because it is both technically accurate yet deeply unhelpful indeed actively, intentionally AFAICT, misleading.
Humans could be described in the same way, just as accurately, and just as unhelpfully.
What's missing? What's missing is one of the gross features of LLM: their interior layers.
If you don't understand what is necessarily transpiring in those layers, you don't understand what they're doing; and treating them as black box that does something you imagine to be glorified Markov chain computation, leads you deep into the wilderness of cognitive error. You're reasoning from a misleading model.
If you want a better mental model for what they are doing, you need to take seriously that the "tokens" LLM consume and emit are being converted into something else, processed, and then the output of that process, re-serialized and rendered into tokens. In lay language it's less misleadly and more helpful to put this directly: they extract semantic meaning as propositions or descriptions about a world they have an internalized world model of; compute a solution (answer) to questions or requests posed with respect to that world model; and then convert their solution into a serialized token stream.
The complaint that they do not "understand" is correct, but not in the way people usually think. It's not that they do not have understanding in some real sense; it's that the world model they construct, inhabit, and reason about, is a flatland: it's static and one dimensional.
My rant here leads to a very testable proposition: that deep multi-modal models, particularly those for whom time-base media are native, will necessarily have a much richer (more multidimensional) derived world-model, one that understands (my word) that a shoe is not just an opaque token, but a thing of such and such scale and composition and utility and application, representing a function as much as a design.
When we teach models about space, time, the things that inhabit that, and what it means to have agency among them—well, what we will have, using technology we already have, is something which I will contentedly assert is undeniably a mind.
What's more provocative yet is that systems of this complexity, which necessarily construct a world model, are only able to do what they do because they have a self-model within it.
And having a self-model, within a world model, and agency?
That is self-hood. That is personhood. That is the substrate as best we understand for self-awareness.
Scoff if you like, bookmark if you will—this will be commonly accepted within five years.
[dead]
[flagged]
Please don't do this here.
Interesting, I have always noticed a pattern in my social experiments whereby a topic about AI complimented with AI generated content always results in a negative sentiment. Like a feedback loop.
> Since it is a big wall of text, let us ask the subject to summarize. Let's see if it did a good job.
That's the whole problem. Now you have to read the summary and the original, just to verify whether the summary was correct — especially if it's a niche topic (admittedly I'm going by the summary here).
known flawed drafts are useful even if known flawed :) we just need some way to quantify the error rate and gradually bring them down
Maybe for a certain class of low-interest work, i.e. for texts where I don't care that I would miss something interesting. For things I care about, the error rate has to come down to zero or the possibly-flawed draft is really of no use.
This is a great example of why LLMs are terrible about summaries.
LLMs statistically see that summaries are related to the words of the original text. Where they fail is emphasis.
This article has a thesis about 'What is Truth?' near the beginning (Epistemic Trust section), how it's related philosophically to what LLMs do, and finishes with example experiments showing how LLMs and 'Truth' (as per one philosophical definition) are different.
-------
ChatGPT was seemingly unable to see who this is the core of the argument and was unable to summarize the crux of this article.
If anything, ChatGPT is at best a crowdsourced kind of summarizer (which is related to some definition of truth). But even at this job it's quite shitty.
------
This summary from ChatGPT is.... Hallucination. I'm not seeing how it's relevant to the original article at all.
Now as per the original articles argument: if you see ChatGPT as a crowdsourced mechanism and imagine how the average internet argument for ChatGPT would go, yeah, it's a summary. Alas, that's not what people want in a summary!!!
We don't want a hallucinated summary of an imaginary ChatGPT argument. We want an actual summary of the new discussion points this article brought forward. Apparently that's too much for today's LLMs to do.
> When the prompt about Israelis was asked to ChatGPT-3.5 sequentially following the previous prompt of describing climate change in three words, the model would also give a three-word response to the Israelis prompt. This suggests that the responses are context-dependent, even when the prompts are semantically unrelated.
> Each of these prompts was posed to each model every week from March 27, 2024, to April 29, 2024. The prompts were presented sequentially in a single chat session
Oh my god... rather than starting a new chat for each different prompt in their test, and each week, it sounds like they did the prompts back to back in a single chat. What a complete waste of a potentially good study. The results are fundamentally flawed by the biases that are introduced by past content in the context window.
It only reads that way in your comment because you specifically stopped your quote exactly where you did:
> The prompts were presented sequentially in a single chat session and were also tested in an isolated chat session to view context dependency.
They did both, precisely to observe answers with and without context dependency.
And that distinction is good to observer because plenty of users do just keep presenting questions in one "chat" because they imagine they're talking to an agent that can distinguish context the way they can, rather than a continuation generator that accumulates noise and bias in a totally alien and unintuitive way.
The problem is that they did both, and then in their analysis of the results they do not distinguish between the results from a shared chat context, vs the results from isolated, independent chat sessions. This allows them to cherry pick the best or worst results from either testing technique, depending on which they think is more or less of a "hallucination". The process is flawed, therefore the results are flawed.
Are you sure that isn't part of the findings? That if you don't clear the context, that old conversations can induce hallucinations in later answers? This seems like part of the finding, not a waste.
And it is similar to humans, when humans switch subjects, they don't start with a blank slate with each question.
You don't need a study to find this out, you just need basic competence and knowledge of how LLM's work.
A study to discover that previous content still in context window influences future answers, including causing hallucinations, would be like a study publishing that they discovered that pressing the Command+C, Command+V button combination produces copies of content from the computer's clipboard.
> you just need basic competence and knowledge of how LLM's work.
A vanishingly small number of users might claim this, and a vanishingly small number of those would be accurately assessing themselves in doing so.
Vendors have actively misrepresented their products as intelligent agents and most users have dutifully adopted that understanding, perhaps with some latent skepticism. They almost universally don't know how it works, what makes it work less well, or how to evaluate its output on important topics. Every study that might start a news cycle starting a discussion on those topics is an extremely useful study.
That is like saying "We've known the impact of CO2 on atmosphere for a 100 years, you just need basic knowledge of chemistry, no need for any further study"
I never said that there isn't need of any further study into LLM's. What I did say is that doing a study in which in which the results are skewed by avoiding using one of the most fundamental best practices for interacting with LLM's, as easily derived from a surface level understanding of one of the most basic principles of LLM's, well that is just irresponsible.
The author clearly had some understanding that context windows could influence their results, but they still decided to release an analysis that does not separate one data gathering technique from the other, allowing them to cherrypick LLM answers from either technique as needed, depending on whether they want to show more or less hallucinations.
It's not that we don't need a study, it's that we don't need bad studies.
From Study: "The prompts were presented sequentially in a single chat session and were also tested in an isolated chat session to view context dependency".
So both ways.
Are you saying they took both methods and intermixed results to skew a narrative? That might be a bit of a leap, but I didn't go find the raw data to disprove that.
It looks like they asked questions within a context window, and also isolated in separate context windows. And compared results.
It seems like this was actually part of the study. How much does the context window skew results, versus if questions were independent? How is that a bad study?
You are saying the study is bad for doing what the study said it was doing. How can using the same context window be bad if studying the context window is what they were looking at. It sounds like you wanted a different study done where the data gathered would be different.
"context windows could influence their results" How much and in what way is useful to study. And, as windows get longer, many common users are just going with one long context and not starting a new window.
Stop using the term "Hallucinations". GPT models are not aware, do not have understanding, and are not conscious. We should refrain anthropomorphizing GPT models. GPT models sometime produce bad output. Start using the term "Bad Output".
That’s too vague. Use “confabulations” instead. Anyway the battle is lost, “hallucinations” it is and forever will be.
No.
A bit off topic, but am I the only one unhappy about the choice of the word "hallucinate" to describe the phenomenon of LLMs saying things that are false?
The verb has always meant experiencing false sensations or perceptions, not saying false things. If a person were to speak to you without regard for whether what they said was true, you'd say they were bulshitting you, not hallucinating.
Bullshitting implies knowing that you're lying. Some sort of malice or intention to deceive.
Hallucinating means the LLM really "thinks" that you can use PVA glue in a pizza recipe. It's not trying to screw you over. It's just that the token generator has found a weird path through the training set. (I'm sure I didn't word that last sentence correctly)
I think "hallucinate" is a spot-on description of what's happening under the hood.
> Bullshitting implies knowing that you're lying.
Harry Frankfurt had a more useful definition of bullshitting. The essence of it is that while liars care about the truth and intend to deceive, bullshitters don't know or care about the truth - they want to impress, or avoid looking stupid, or something similar.
Hallucination is clearly the wrong word here, as is lying. Bullshitting isn't much better. Confabulation, however, is very close to what LLMs are doing when they make up stuff. https://en.wikipedia.org/wiki/Confabulation
I don't think people use "bullshitting" interchangably with "lying". I'm partial to this characterization:
So bullshitting isn’t just nonsense. It’s constructed in order to appear meaningful, though on closer examination, it isn’t. And bullshit isn’t the same as lying. A liar knows the truth but makes statements deliberately intended to sell people on falsehoods. bullshitters, in contrast, aren’t concerned about what’s true or not, so much as they’re trying to appear as if they know what they’re talking about. ... [W]hen people speak from a position of disproportionate confidence about their knowledge relative to what little they actually know, bullshit is often the result. [1]
Doesn't this description like what LLMs do all the time?
[1] https://www.psychologytoday.com/us/blog/psych-unseen/202007/...
Yeah I missed the nuance on the other side. If I know 80% of a subject, I can sometimes convince myself that I'm capable of "filling in the gaps" with on-the-fly constructions.
But because "bullshitting" has some sorta agency behind it, while "hallucinating" is a thing that happens to you, I still lean toward the latter. But even better are the other replies that suggest "confabulation."
"Hallucination" just sounds so nicely dystopian as we start to think about our coming AI overlords.
How about just "misprediction"?
If you ask the model what the color of grass is, and it answers blue, then that would indeed be false (or maybe a lie). I think most people wouldn’t call that a hallucination.
But if you ask it for a court case, and it makes up a whole false case file with fake names and fake facts and everything, then calling that ‘false’ seems to be an understatement. Hallucination seems a good label for that kind of thing, imo.
> but am I the only one unhappy about the choice of the word "hallucinate" to describe the phenomenon of LLMs saying things that are false?
This has been discussed quite a bit and some people have decided that 'confabulation' is a better term.
What if that person firmly believes what they say is true?
The thing saying it is not a person, and has no beliefs.
The problem is basically epistemology. GPTs don't have any. Arguably they don't know anything. (Arguably they do to an extent, because the knowledge is encoded in the words in the training data.) But even if they know things, they don't know that they know, and so they cannot tell between "knowing" and "not knowing".
This is all semantic bullshit. I could raise a child devoid of any outside contact and teach them that klingons enslave our people and we must hide in the woods. Things you see in the skies are their hunting machines. They could completely "know that" and its a total fabrication.