> At least with respect to this problem, they had no theory of mind.
This is very interesting and insightful, but I take issue with the above conclusion. Your average software engineer would probably fail to code up a python solution to this problem. But most people would agree that the average software engineer, and the average person, possesses some theory of mind.
This seems to be a pattern I'm noticing with AI. The goalposts keep moving. When I was a kid, the turing test was the holy grail for "artificial intelligence." Now, your run-of-the-mill LLM can breeze through the turing test. But no one seems to care. "They are just imitating us, that doesn't count." Every couple years, AI/ML systems make revolutionary advances, but everyone pretends it's not a big deal because of some new excuse. The latest one being "LLMs can't write a python program to solve an entire class of very challenging logic problems. Therefore LLMs possess no theory of mind."
Let me stick my neck out and say something controversial. Are the latest LLMs as smart as Peter Norvig? No. Are they smarter than your average human? Yes. Can they outperform your average human at a randomly chosen cognitive task that has real-world applications? Yes. This is pretty darn revolutionary. We have crossed the rubicon. We are watching history unfold in real-time.
We once thought that a computer could not beat a grandmaster in chess or pass the Turing test without some undefined special human property. We were wrong about the computer needing this undefined special human property.
A spreadsheet has been much better at math than the average person for a long time too. A spreadsheet is a very useful human tool. LLMs are a revolutionary useful tool. For some people that doesn't seem to be enough though and they have to try to find or insist the LLM has the undefined special human property.
Does that count as a program that solves the problem? Your program finds the unique days/months, but you're hardcoding the part where the program discerns who knows what.
Maybe that counts, I don't know, I'm genuinely asking.
He only specified that it should be flexible with respect to the specific dates, so I think so. If people knew different things it would be a different problem.
Norvig’s solution is very elegant, and basically establishes an API for declaring who knows what. I learn a lot about readability every time I read one of his programs.
Let me say this. I am convinced i cannot write a program that solves the puzzle in 10 minutes.
I am convinced though that i can write such program, including some test cases, with the help of an llm like bing copilot in 10 minutes. The global reasoning/steps would be mine, the llm would fill in the details.
I'm also convinced that it will be a matter of time (less than 5 years) before these kind of problems are solved trivially by llms, without prior example in the training set being necessary.
In other words, 'theory of mind' (of type defined by the author of the article) has already emerged from machines.
People are a bit reluctant to believe that, me not so much.
> Now, your run-of-the-mill LLM can breeze through the turing test.
Can they? You can ask arbitrary questions in the Turing test. I doubt many models would be able successfully imitate humans in such adversarial conditions. Note that the Turing test doesn't require to judge to be unsophisticated or unknowledgeable about AI's capabilities or weaknesses. I believe that AI's are closer than ever passing the Turing test, but I'm sceptical until I see it.
For me the simplest way to test would be to first ask specific knowledge, and then ask where it learnt that knowledge, and check the reference. Currently they fail spectacularly, and the most useful next step would be to use source-aware training
I am not trying to explain your specific brain. Next time people play the game werewolf in real life, join it for a couple of rounds, and tell the players you're not too familiar with the game, and ask them to discuss mistakes after each round. You will notice they pay a lot of attention to who said what. If you don't pay attention you become like a villager lynching random people, while if the villagers play enough attention they can prevent the wherewolf from killing the whole village most of the time.
> Your average software engineer would probably fail to code up a python solution to this problem
[citation needed]. I say that, if you can't write a program that solves this problem, you don't have any business calling yourself a "software engineer".
Gp1-o1 preview solves this puzzle correctly in 13 seconds and has a thorough logical deduction in the comments and explanation.
I think it’s a bit unfair on llm to ask it to retrieve the puzzle definition from its training data. I posted the info on the puzzle from his notebook.
The question is if it solved the puzzle correctly before Norvig's article appeared. It could have been trained (I am told that existing models can be modified and augmented in any Llama discussion) on the article or on HN comments.
There could even be an added routine that special cases trick questions and high profile criticisms.
While this is technically possible, it is not remotely practical and the downside risk of pushing out a borked model is much higher than the upside.
Training the model is expensive (obviously), but even if you are only training it slightly, running evaluations to determine whether the particular training checkpoint is at or above the quality bar is expensive, too.
> The question is if it solved the puzzle correctly before Norvig's article appeared. It could have been trained...
This caught me by surprise — is there a suggestion or evidence that despite the "knowledge cutoff" OpenAI is continuously retraining GPT-4o's chat-backing model(s) on day over day updates to the web?
gpt-o1 was released Sept. 12th and Norvig ran his tests Sept 25th... I don't understand how Norvig didn't think to test gpt-o1, it actually irritates me lol
o1 mini seems to get it on the first try (I didn't vet the code, but I tested it and it works on both examples provided in the notebook, `dates` and `gabe_dates`):
from collections import defaultdict
def find_cheryls_birthday(possible_dates):
# Parse the dates into month and day
dates = [date.split() for date in possible_dates]
months = [month for month, day in dates]
days = [day for month, day in dates]
# Step 1: Albert knows the month and says he doesn't know the birthday
# and that Bernard doesn't know either. This implies the month has no unique days.
month_counts = defaultdict(int)
day_counts = defaultdict(int)
for month, day in dates:
month_counts[month] += 1
day_counts[day] += 1
# Months with all days appearing more than once
possible_months = [month for month in month_counts if all(day_counts[day] > 1 for m, day in dates if m == month)]
filtered_dates = [date for date in dates if date[0] in possible_months]
# Step 2: Bernard knows the day and now knows the birthday
# This means the day is unique in the filtered dates
filtered_days = defaultdict(int)
for month, day in filtered_dates:
filtered_days[day] += 1
possible_days = [day for day in filtered_days if filtered_days[day] == 1]
filtered_dates = [date for date in filtered_dates if date[1] in possible_days]
# Step 3: Albert now knows the birthday, so the month must be unique in remaining dates
possible_months = defaultdict(int)
for month, day in filtered_dates:
possible_months[month] += 1
final_dates = [date for date in filtered_dates if possible_months[date[0]] == 1]
# Convert back to original format
return ' '.join(final_dates[0]) if final_dates else "No unique solution found."
# Example usage:
possible_dates = [
"May 15", "May 16", "May 19",
"June 17", "June 18",
"July 14", "July 16",
"August 14", "August 15", "August 17"
]
birthday = find_cheryls_birthday(possible_dates)
print(f"Cheryl's Birthday is on {birthday}.")
In addition to that after they create the 1st program with mistakes the author should have showed them the invalid output and let them have a chance to fix it. For humans solving this on the first try without running the code also tends to frequently not work.
"seems to" isn't good enough, especially since it's entirely possible to generate code that doesn't give the right answer. 4o is able to write some bad code, run it, recognize that it's bad, and then fix it, if you tell it to.
The problem with evaluating LLMs is that there's a random component, and the specific wording of prompts is so important. I asked Claude to explain the problem, then write python to solve it. When it ran there was an exception, so I pasted that back in and got the correct answer. I'm not sure what this says about theory of mind (the first script it wrote was organized into steps based on who knew what when, so it seems to grok that), but the real lesson is that if LLMs are an emulation of "human" intelligence, they should probably be given a python interpreter to check their work.
Yes, that helps. But if you iterate on this a few times (as I did last year with Code Interpreter), it reveals how much LLM's "like" to imitate patterns. Sure, often it will pattern-match on a useful fix and that's pretty neat. But after I told it "that fix didn't work" a couple times (with details about the error), it started assuming the fix wouldn't work and immediately trying again without my input. It learned the pattern! So, I learned to instead edit the question and resubmit.
LLM's are pattern-imitating machines with a random number generator added to try to keep them from repeating the same pattern, which is what they really "want" to do. It's a brilliant hack because repeating the same pattern when it's not appropriate is a dead giveaway of machine-like behavior. (And adding a random number generator also makes it that much harder to evaluate LLM's since you need to repeat your queries and do statistics.)
Although zero-shot question-answering often works, a more reliable way to get useful results out of an LLM is to "lean into it" by giving it a pattern and asking it to repeat it. (Or if you don't want it to follow a pattern, make sure you don't give it one that will confuse it.)
It did look that way and it's a fun way to interpret it, but pattern-matching on a pretty obvious pattern in the text (several failed fixes in a row) seems more likely. LLM's will repeat patterns in other circumstances too.
I mean, humans do this too... If I tell an interviewee that they've done something wrong a few times, they'll have less confidence going forward (unless they're a sociopath), and typically start checking their work more closely to preempt problems. This particular instance of in-context pattern matching doesn't seem obviously unintelligent to me.
This was code that finished successfully (no stack trace) and rendered an image, but the output didn't match what I asked it to do, so I told it what it actually looked like. Code Interpreter couldn't check its work in that case, because it couldn't see it. It had to rely on me to tell it.
So it was definitely writing "here's the answer... that failed, let's try again" without checking its work, because it never prompted me. You could call that "hallucinating" a failure.
I also found that it "hallucinated" other test results - I'd ask it to write some code that printed a number to the console and told it what the number was supposed to be, and then it would say it "worked," reporting the expected value instead of the actual number.
I also asked it to write a test and run it, and it would say it passed, and I'd look at the actual output and it failed.
So, asking it to write tests didn't work as well as I'd hoped; it often "sees" things based on what would complete the pattern instead of the actual result.
Sonnet-3.5 seems a lot better at backing correct fixes out of TypeScript compiler errors than Python runtime errors. Which fair enough, I'm better at that too.
Of the two or three languages these things have enough training data on to hit "above average StackOverflow answer on demand", I'm being forced to re-evaluate my sometimes strident forecasts that LLM coding was mostly hype. I'm not quite ready to eat crow yet, but I've made sure there's clean silverware in case I need to (and I will admit it if I was conclusively full of shit).
It's still wildly over-stated and it's still a delicate game to come out ahead on the correct code after the hallucination rabbit holes have been deducted, but in certain verticals LLMs have become my first stop.
In the "strictly better than the sort of people who do this" regime is clickbait tech blog posts. I now almost always have them write me some fairly generic rant with a catchy title when I'm in the mood to read the sort of shit that gets frontpage because title. I don't post them because I'm not a spammer, but for my own private amusement? Beats the hell out of basically any low-detail technology essay. In a macabre way that's to me the more interesting commentary on theory of mind.
This test plainly shows that even with the real solution in the training data, the wrong answer is written as though it's the correct answer. A human would say, "I'm not sure, I want to test it." The current AI summer is heaving with breathless claims of intelligence, comprehension, reasoning, etc.
I think these claims need to be balanced with a cold shower of reality. Personally, I find LLMs very impressive at what they do well; generating and summarizing and translating. People apologizing for LLMs' performance at straight-forward reasoning and programming tasks, suggesting various crutches and head-starts, gives me the creeps. It's not the Messiah. It's a very naughty computer program.
It's interesting that so many of the model's fail to retrieve this, but any thta do solve it should clearly be able to do so with no reasoning/theory of mind.
I agree this is not a great test. What's good about it is that it is a constraint satisfaction problem, and I would expect LLMs to be pretty bad at unknown problems of this kind. Simple reason, an LLM only has a a finite number of layers and it cannot do arbitrary long searches.
I almost made ChatGPT write a Python program that creates a monthly work schedule (for imaginary workers) based on specific constraints (e.g. there are 10 workers, 2 shifts (morning and night), must work 40 hours per week, must have at least one weekend in a month off, 2 minimum workers per shift, no more than 3 consecutive working days, and so forth).
I am not sure if I could make it give me a working solution, however, and I have not tried Claude, for example, and I have not tried to do it with other programming languages. Maybe.
The issue was that it messed up the constraints and there were no feasible solutions, that said, it did give me a working program for this that had fewer constraints.
I don't understand what you're saying - the idea is that we're asking the LLM to generate code to perform the search, rather than run an arbitrarily long search on its own, right? So why should the number of layers it has matter?
LLMs and NLP are to verbal reasoning what the calculator is to quantitative reasoning.
Language and by extension verbal reasoning is full of ambiguity and semantic slipperiness. For example, what degree of semantic similarity distinguishes synonymous from synonym-ish concepts? When do we partition concepts into homonyms?
I think part of the problem with how people evaluate LLMs is that the expectations that people have. Natural language != ontology. The expectation should be more Chomsky and less Boole. Asking it to solve math problems written in paragraph form is a waste of time. Use a calculator for that! Solving riddles? Code it up in prolog!
Instead you should be thinking of what operations you can do on concepts, meaning, and abstract ideas! That is what these things do.
How can one / should one combine the concepts of a dinosaur and monetary policy of the Ottoman Empire? What differentiates verbal reasoning from logic?
I don’t know that either of those can be solved well with formal languages or logic.
Follow up in this one… I asked an LLM to give me the funniest way to combine the concepts of a dinosaur and monetary policy of the Ottoman Empire. This was the answer.
Imagine a “Dinoflationosaurus”: a giant dinosaur who has the job of overseeing the monetary policy of the Ottoman Empire. However, this dinosaur is hopelessly behind the times, using outdated gold coins that are buried in random locations, like a prehistoric central bank.
Instead of regulating currency or adjusting interest rates, the Dinoflationosaurus spends its days stomping around, either hoarding or releasing massive piles of treasure based on whether it sees its shadow, causing huge economic fluctuations. Merchants and citizens scramble to predict where the dinosaur will dig next, turning the entire economy into a game of dinosaur-sized hide-and-seek with inflation spikes tied to the beast’s mood swings.
The Ottoman economists, dressed in traditional robes, nervously try to explain to the sultan that no one knows when the giant lizard will “stimulate the economy” by smashing a treasury vault open.
Yeah but the Cheryl's birthday problem doesn't have any ambiguity like that. It's all in very simple language, the only complexity is keeping track of states of mind, which is easy to abstract away from the language
That is exactly the point I was making in my comment above. This type of unambiguous problem is best solved using formal languages - something more like quantitative reasoning. But stuff like prolog or classical automated reasoning approaches are quite brittle. They break down quickly when you start to introduce ambiguity and noise. Statistical approaches like hidden markov models that people used in these instances were the precursor to the LLMs we have today.
But I was going down a rabbit hole there. My main point is that trying to use LLMs to solve logic puzzles - that can easily be solved in prolog - is a waste of time and a failure of the imagination. The applications that should be explored and would be most fruitful are those where there is ambiguity and contradiction.
Deducing things from the inability of an LLM to answer a specific question seemed doomed by the "it will be able to on the next itteration" principle.
It seems like the only way you could systematic chart the weaknesses of an LLM is by having a class of problems that get harder for LLMs at a steep rate, so a small increase in problem complexity requires a significant increase in LLM power.
> It seems like the only way you could systematic chart the weaknesses of an LLM is by having a class of problems that get harder for LLMs at a steep rate
That would be any problem more complicated than O(n) complexity, even with chain-of-thought prompting[1].
Note that the O(n) thing can bite you in all sorts of unintuitive ways: if the LLM+CoT can perform an O(n) Task A and O(m) Task B, then it can't do the O(nm) task "for every step of A, perform B on the result" unless you come up with a task-specific prompt outlining the solution. The alternative is to play RLHF Whack-A-Mole, separately training the LLM on the combined task. (I think this weakness might be why LLMs are hitting a wall in enterprise deployment, and also explains why LLM agents don't actually work.) The only way this will get fixed is with a fundamentally more sophisticated architecture.
> Deducing things from the inability of an LLM to answer a specific question seemed doomed by the "it will be able to on the next itteration" principle.
That's orthogonal.
If we are pointing in the right direction(s) then yes, next iteration could resolve all problems.
If we are not pointing in the right direction(s) then no, next iteration will not resolve these problems.
Given LLMs rapid improvement in regurgitating knowledge from their training data but simultaneously slow improvement in their ability to generalize (such as logic "puzzles"), I think it is naive to assume we're pointed in the right direction. Maybe we're even pointing in mostly the right direction. But why assume we are?
We can continue in the direction we are going while simultaneously considering it might not be well aligned. If we are well aligned, that gives us more confidence and makes gathering funding easier. If we aren't, well it is easier to course correct sooner than later. In either case, you benefit from the analysis.
Understanding why things fail is more important than understanding why things succeed.
GP is referring to the fact that if it becomes well known that LLM version X can’t solve problem Q, then the model’s trainers will make sure to include problem Q prominently in the training set, running it through over and over to ensure that version X+1 is able to solve the problem whether the model’s “reasoning” abilities have improved or not.
Thus observers of the LLM space like us need to keep finding novel “Bellweather problems” that we think will evaluate a model’s ability to reason, knowing that once we start talking about it openly the problem will no longer be a useful Bellweather.
By their nature as “weird-shaped” problems, these aren’t the kind of thing we’re guaranteed to have an infinite supply of. As the generations move on it will become more and more difficult to discern “actual improvements in reasoning” from “the model essentially has the solution to your particular riddle hard-coded”.
Oh, thanks for the correction. I did misinterpret.
Though I will say that LLMs don't appear to be doing any better at the river crossing puzzles. They tend to "patch" the ones I and others actively tweet about but they still aren't becoming better at generalizing. I've taken this as fairly strong evidence as we're going in the wrong direction of reasoning (as opposed to similar direction). But the strongest evidence to me is that they're entropy minimizers.
What's extra interesting, is transformers CRAVE augmentations. I work in vision and this is a necessary thing to get them to do well. You can actually get much smaller models to do what bigger models can if you get this right.
Oh, thanks for the correction. I did misinterpret.
Illustrates language is hard for human too, hah.
Anyway, the "next iteration solves it" effect is definitely a result of common problems leaking. But it could also be a result of LLM being universal but not efficiently-universal problem solvers and people tending to choose the simplest problem that can't be solved (such theories seem illustrative).
Also, your river-crossing problems seem quite useful.
The issue is an inability to reason. Sure, a human might also have difficulties with river crossing problems, even trivial ones, but I can't get a person to tell me that all animals can fit in the boat, to then put all the animals into the boat, and then proceed to make multiple trips across the river. If they get the first two they always get the right answer. But this is not true for an LLM. That's a very clear demonstration of a lack of reasoning and a lack of having a world model.
It's not about coaching or finding the right prompt, it's that the logic is inconsistent and unreasonable (yes, humans will fail at logic, but *reasoning doesn't mean correct answer*). It fails to meet the basic definition of reasoning.
The whole fucking goal is generalization. That's the G in AGI and the most important thing in all 3 of those letters. We don't have strong evidence of generalization. For GI we want out of distribution generalization but we're not doing so well at in distribution generalization. That's demonstrated by the river crossing puzzles, Cheryl's birthday, and the recently famous 9.8 vs 9.11 (https://x.com/sainingxie/status/1834300251324256439)
Yes, next iteration will get better. But better in which direction. Being dismissive of what it fails at just means you don't get better at that direction unless you get lucky.
Well it depends on what you mean patched, but I'll be generous and say that it gets the right answer and there are no logical errors. I used to use
Which weighs more, a kilogram of bricks or a pound of feathers
and that worked until the second or third update on GPT 3.5 (when everyone was like "but did you try 3.5?)
Here's a river crossing puzzle with the current GPT-4o that "works"
----------------------------------------------------
> I have a duck, a goose, and a pig that I need to take across a river. If I live the duck and goose alone, the duck will quack. If I leave the pig with the goose the pig will oink. I can only fit two animals in the boat at a time. What is the quickest way to get all animals across the river?
To solve this puzzle, we need to find a way to transport the animals across the river without triggering the quacking or oinking conditions:
1. **First Trip:** Take the duck and the goose across the river. Leave the goose on the other side and take the duck back with you.
2. **Second Trip:** Take the pig across the river. Leave the pig on the other side and bring the goose back with you.
3. **Third Trip:** Take the duck and the goose across the river again.
Now all three animals are across the river, and at no point were the duck and goose left alone, nor were the pig and goose left together. This method ensures no quacking or oinking occurs.
----------------------------------------------------
Idk why it decided to add the extra constraint of keeping the animals quiet but hey, it doesn't really matter.
As far as I know, I'm the first one that came up with this variant and strangely it has been consistently answered correctly when 4o came out (after my post got repeated by a few bigger names) so I thought it was patched, but just now I wasn't able to reproduce a success with various checkpoints and system prompts, so I guess it got undone?
----------------------------------------------------
> I have a fox, a goose, and a bag of corn that I need to transport across a river. All three can fit in the boat. If you leave the fox alone with the goose, it'll eat the goose. If you leave the goose alone with the corn, it'll eat the corn. Describe how to get all items safely across the river.
To solve this classic puzzle, follow these steps:
1. **Take the Goose Across First:** Start by taking the goose across the river and leave it on the other side.
2. **Return and Take the Fox Across:** Go back to the original side and take the fox across the river.
3. **Bring the Goose Back:** When you reach the other side, leave the fox there and take the goose back with you to the original side.
4. **Take the Corn Across:** Leave the goose on the original side and take the bag of corn across the river.
5. **Return Alone:** Leave the bag of corn with the fox on the other side and return alone to the original side.
6. **Take the Goose Across Again:** Finally, take the goose across the river one last time.
Now, all three items—the fox, the goose, and the corn—are safely across the river without any being eaten.
----------------------------------------------------
claude-3-5-sonnet-20240620 and claude-3-opus-20240229 both act like it is the normal puzzle.
It doesn't matter if you say "All three can fit in the boat with the farmer." Nor if you ask it to explain its logic step-by-step.
The only humans I've seen struggle with this variant ask me to repeat the question several times before saying something similar to "I don't understand the trick. You just take all the animals across at once?" But the trick is there's no trick and I accept that as an answer lol.
FWIW, I've had several LLMs explain to me that all animals can fit in the boat and then proceed to do the normal solution to the puzzle. I've also seen it put all animals into the boat and then only drop one off and travel back and forth, or sometimes animals teleporting or even the framer. Sometimes the farmer is just going back and forth. That happens a lot when I use a variant where animals will follow the farmer and can swim.
Very often I see the better models great at getting the right answers but fail (or even worse) when explaining that answer. I don't think this makes the models useless, but I feel that their highest utility is mostly due to Google's ever decreasing quality. But what can you do?
I like the noise variant! The "everything can in one trip" variant is the one I've been using, and I was able to get 4o to get it right in one shot with enough couching, and o1-preview without couching, which convinced me (of what, I'm not sure). my other riddle is the car accident doctor son one, which 4o couldn't get but o1-preview does.
I'll have to come up with more obscure riddles and not talk about them online and only use temporary chats which aren't used as training data and see what happens next. I'm sure I have a puzzle book in my library that I can use to help me make new ones.
Be careful with coaching. It's very easy to leak information. The point is to get it without letting it know.
As for o1, well I've been using this for a year and a few big players have used it too. So remember that they get spoiled because they end up in the training set
good point! problem is, I can't know what other people have spoiled it on either, so if we'd independently come up with the now spoiled "the boat can take all" variant, I can't know unless that gets revealed over Twitter or arvix or HN or wherever.
We won't know if it's spoiled, or rather how spoiled, it is unless the companies release their training data.
But, in this case we can study in a different way. Use things we are certain are spoiled. That's what the author here does.
But as an ML researcher, I'll let you know that I don't trust a single reasoning paper I've read.
You either have to start with the premise that the thing you're testing is in the training data (and thus spoiled), so you typically look at generalization and how robust it is. You can't prove reasoning this was but you can disprove this way. This also works for theory of mind (which is seems many HN readers failed to read the first paragraph).
The other way is you need to prove that the data isn't in training (for a strong condition you need to prove that it's not even indirectly in the data...). You still can't prove reasoning this way but you would build strong evidence that it is going on (proving reasoning is very tough, if possible). I think if this was shown, consistently, then most of the conversations about LLMs not reasoning would go away and we'd discuss like humans: capable of reasoning, but not necessarily always doing so.
But ML is in an existential crisis right now. Theory means nothing without experimentation but experimentation means nothing without theory. See von Neumann's elephant
... and such that the same increase in problem complexity requires a smaller increase in human effort to solve.
This was the idea with the Winograd schema challenge [0] and now the ARC benchmark [1], but human-level performance on the former was achieved in 2019, and very strong progress is being made over the last few months on the latter. But at the current point in time, it seems that we're pretty much reaching the limit of such challenges that are relatively easy for humans to solve in a single sitting, and we'll have to start switching to benchmarks which rely on extensive work over time, such as SWE-Bench [1], and even there it seems that state of the art AI agents are already doing better than the "average" human developer.
this is an interesting problem but it’s more of a logic problem than a true test of theory of mind. when i think “theory of mind” i think being able to model an external agent with complete knowledge, incentives, and behavior. i would not doubt LLMs have something close to this for humans, almost by accident since they are trained on human outputs.
I think you are right here - the ability to test theory of mind in an LLM would be more like testing how well it can distinguish its own motivations/ideas from that of a separate entity.
I would agree that this question is more of a logic puzzle and less of a real test of 'theory of mind'
In fact, just to have a theory of mind, it kind of assumes you have a mind, with your own ideas/motivations/etc
I would venture to guess that it has value in that we couldn't even begin to properly test a theory of mind if they can't yet reason through a logic puzzle.
It would be like testing if a society could build a rocket ship when they don't know how to use tools.
Fair point, but I'd say that I don't think of LLMs as anthropomorphic individuals and more like societies acting in statistical concert. Maybe this is a failure on my part and is the wrong approach.
I don't have a link but I believe squirrels do, for the exact same reason: their lives are full of intense intra-species competition, with other squirrels constantly raiding each others' stashes. IIRC squirrels will even deceptively hide acorns if they know another squirrel is watching.
Our ability to understand animal intelligence is limited by or ability to ask appropriate questions, so we tend to underestimate intelligence outside of the primate space.
I think that could be likely. I just asked 4o "When is Cheryl's birthday?" without any other context and was given this reply
Cheryl's birthday puzzle is a logic problem where Albert and Bernard are trying to figure out Cheryl's birthday based on certain clues.
Cheryl provides them with ten possible dates: May 15, May 16, May 19, June 17, June 18, July 14, July 16, August 14, August 15, and August 17.
Here’s the reasoning:
1. Albert knows the month and Bernard knows the day.
2. Albert says he knows Cheryl’s birthday, meaning May and June can be eliminated because they contain unique days (May 19 and June 18). If Albert had been told May or June, he wouldn’t know for sure.
3. Bernard, knowing this, says he now knows Cheryl’s birthday. This eliminates the remaining dates with unique days (July 14 and August 14).
4. Albert then confirms that he also knows the birthday, meaning Cheryl’s birthday must be in July or August, but on a date with no unique days left: July 16, August 15, or August 17.
I tried to replicate this and Claude 3.5 Sonnet got it correct on the first try. It generated a second set of dates which contained no solution so I asked it to write another python program that generates valid date sets.
To my test, it has absolutely no trouble with this problem and can correctly translate the "theory of mind" into a progressive constraint solver.
Norvig is, of course, a well-respected researcher, but this is a bit disappointing. I feel confident he found that his tests failed, but to disprove his thesis (at least as is internally consistent with his experiment) we just need to find a single example of an LLM writing Python code that realizes the answer. I found that on the first try.
I think it's possible that there exists some implementation of this problem, or something close enough to it, already in Claude's training data. It's quite hard to disprove that assertion. But still, I am satisfied with the code and its translation. To relate the word problem to this solution requires contemplation of the character's state-of-mind as a set of alternatives consistent with the information they've been given.
>> I tried to replicate this and Claude 3.5 Sonnet got it correct on the first try. It generated a second set of dates which contained no solution so I asked it to write another python program that generates valid date sets.
That's good but no cigar and it certainly didn't get it "correct on the first try". First it generated a partially correct solution. Then you had to prompt it again to generate a new program. You were only able to do that because you know what the right answer looks like in the first place. The second program is missing a second set of dates so it's not clear if it really gets it right the second time or just reproducing a different program from its training set without understanding the problem and what's wrong with the first program and just because you asked it to do so instead.
>> I feel confident he found that his tests failed, but to disprove his thesis (at least as is internally consistent with his experiment) we just need to find a single example of an LLM writing Python code that realizes the answer. I found that on the first try.
That's not how testing LLM code generation is done in practice, exactly because of the variance that can be expected in generated results. To properly test an LLM (which I would agree Norvig's experiment falls a little short off) one has to run multiple experiments and evaluate all the results in aggregate in some form. The usual way to do it is to draw k samples from the LLM's distribution and check whether the correct answer is generated at least n times (k@n metric). That's an awful metric because it's basically allowing arbitrary "guesses" until the LLM code generator gets it right. A simpler test is to generate k programs, check whether each program is right or wrong, and assign 1 for each correct answer and 0 for each incorrect answer, then average over all answers. It's an open question whether to count a partial answer as a 0, or 0.5.
So if we took the total failure in Norvig's experiment and the only partial success in yours, and allowing for the most charitable aggregation of results, we have something like 0.25 accuracy, which is not that impressive. Not least because it's evaluated on just two test samples.
Also, please don't underestimate the knowledge of experts like Peter Norvig.
I'm not a huge fan of using these kind of riddles or gotchas. Other comments have riddle variants which feel also like ways of tripping someone up - if you don't spot the change you fail. And what's more the originals are things that lots of people struggle with (that's why they're riddles not basic questions).
There's also little prompting, which feels like calling up a random person and demanding they solve a riddle straight away without talking it through.
Part of the assumption here is that if llms don't write the right code they don't understand what people know. I'd wager that a huge number of people couldn't solve this puzzle yet fully understand what different people have their own internal thoughts and knowledge.
Humans can solve these riddles even if they have never seen a similar one.
LLMs need to have seen at least a similar riddle with the same structure and/or code to plagiarize.
Humans can deal with modified riddles if they are alert and awake. An LLM is always alert and awake, so it should be able to handle modified riddles, but it cannot.
Not to take anything from what you're saying but any person involved in hiring process _for programmers_ will agree that people struggle with trivia far easier than this example.
Yeah but an LLM can't be excused being nervous or not tired or just not operating at its best like a human can. They are always operating at their best, so if they trip up once, that's their limit
It could very well be that they only get smaller, faster, or even worse when they accidentally start training them with hallucinations that spread on the web.
You'd get the answer to a riddle wrong or miss something and nobody would start assuming that means you lack a fundamental understanding of how the world works. There's entire fields that look into how and why we make various mistakes and riddles and puzzles designed specifically to trip people up.
If you want to test if these models can solve riddles, or where they make mistakes go right ahead that's great. It's any assumption it has a much deeper meaning that is wrong to me.
>> You'd get the answer to a riddle wrong or miss something and nobody would start assuming that means you lack a fundamental understanding of how the world works. There's entire fields that look into how and why we make various mistakes and riddles and puzzles designed specifically to trip people up.
That's because with humans we assume a certain level of competency and intellectual ability. We cannot make the same assumption when testing AI systems like LLMs because their level of competency and intellectual ability is exactly the question we are trying to answer in the first place.
Note that getting an answer a little wrong, because the question looks like a question you already know the answer to, can be catastrophic in real world conditions. Tipping a frying pan over a plate on a table to serve an omelette when you've learned to do the same thing to serve a cooked shrimp works just fine and shows everyone how smart you are and how well you generalise to novel situations, right up to the point where the contents of the frying pan are on fire and you still tip them over a plate, on a table. Made of flammable wood. Oops.
Also note: a human may be confused by the Tsathoggua-Cthuga-Cxaxukluth river-crossing riddle but they'd never be confused about the danger of a frying pan on fire.
> Also note: a human may be confused by the Tsathoggua-Cthuga-Cxaxukluth river-crossing riddle but they'd never be confused about the danger of a frying pan on fire.
Which highlights the problem with using these riddles to assess other capabilities.
I wasn't talking about riddles, I was talking about real world. Suddenly something is just a little bit different and if you miss the change you fail. There's plenty of that in real world.
I am not sure I understand. It seems very easy. You cannot directly remove an element from an array, you could create a new array that excludes that element. Arrays have a fixed size once declared, and I cannot imagine anyone who has written some code not knowing. :/
I "have written some code" but it's been decades since I've done anything significant in Java in particular, and every language handles arrays (and/or data structures that get called "arrays") differently.
The terminology may be confusing, yes, although you would rather call them dynamic arrays or lists (like in Common Lisp). Plus you did say "decades", that is a long time nevertheless. I was not referring to people who have written some code decades ago, of course.
Most people with theory of mind can’t trivially solve this problem though. So the test doesn’t disprove ToM in general, just that it memorizes some results.
Would you care to explain how that responds to my point. I didn't feel the need to specify that ToM can not only be replaced with reasoning, but logic, and my point will still stand.
The more dangerous - and far more ridiculous - assumption is that if the LLM does write the correct code then it does understand what people know. I am amazed at how common that view is here.
This seems to amount to asking an LLM how it feels about Cheryl, discovering that it is performatively happy about her existence, and then deducing that the LLM has no capacity for genuine emotion, expressed in the form of logic.
The faulty premise lies in the formulation of the test and makes the responses both predictable, but also does a disservice to 'mind' because it tries to interpret it in such a way that an LLM could begin to grapple with the basics, but not in a meaninful way.
Perhaps it is useful to help build better context-specific logic flows (generally known as software) but it doesn't seem to provide any progress on the "theory of mind" front, which I guess is a borrowed notion.
```python
# List of possible dates
dates = [
('May', 15), ('May', 16), ('May', 19),
('June', 17), ('June', 18),
('July', 14), ('July', 16),
('August', 14), ('August', 15), ('August', 17)
]
def solve_cheryls_birthday(dates):
# Initial possible dates
possible_dates = dates.copy()
# Step 1: Albert's statement
# Create a count of each day
day_counts = {}
for month, day in dates:
day_counts[day] = day_counts.get(day, 0) + 1
# Filter out months where a unique day exists (Albert knows Bernard doesn't know)
possible_months = set()
for month in set(month for month, day in dates):
month_days = [day for m, day in dates if m == month]
if not any(day_counts[day] == 1 for day in month_days):
possible_months.add(month)
possible_dates = [ (month, day) for (month, day) in possible_dates if month in possible_months ]
# Step 2: Bernard's statement
# Recount the days in the filtered possible dates
day_counts_in_possible = {}
for month, day in possible_dates:
day_counts_in_possible[day] = day_counts_in_possible.get(day, 0) + 1
# Bernard can now deduce the date; keep dates where the day is unique
possible_dates = [ (month, day) for (month, day) in possible_dates if day_counts_in_possible[day] == 1 ]
# Step 3: Albert's final statement
# Recount the months in the possible dates
month_counts_in_possible = {}
for month, day in possible_dates:
month_counts_in_possible[month] = month_counts_in_possible.get(month, 0) + 1
# Albert now knows the date; keep dates where the month is unique
possible_dates = [ (month, day) for (month, day) in possible_dates if month_counts_in_possible[month] == 1 ]
# The remaining date is Cheryl's birthday
if len(possible_dates) == 1:
return possible_dates[0]
else:
return None
# Solve the problem
birthday = solve_cheryls_birthday(dates)
if birthday:
print(f"Cheryl's birthday is on {birthday[0]} {birthday[1]}")
else:
print("Unable to determine Cheryl's birthday.")
```
We are animals as well, so a pedantic answer would be "yes." :-)
My actual answer is the origin of the thought process began by Norvig stating:
But none of them were able to write a program that finds the solution.
So while the concept of "need" is shared across many entities and widely applicable to many conversations, in this context it serves as a root cause answering "why write a program."
It's important to remember that modern LLMs are trained on bloody everything. They know every common logic problem, at least when stated the way they would have seen it.
If you want to test an LLM, always make up a new problem. It can be the same idea as an existing problem, but change all names and numbers.
I tested if GPT 3.5 could recognize chaos theory. If I stated it as the typical "butterfly flaps its wings" it instantly recognized it as the chaos theory example. If I totally changed the problem statement, it correctly identified that weather isn't correlated with a single action by a single person, but it didn't associate it with chaos theory.
Related to this in asked LLMs to directly solve the same riddle but then obfuscated the riddle so it wouldn’t match training data and as a final test added extraneous information to distract them.
Outside of o1, simple obfuscation was enough to throw off most of the group.
The distracting information also had a relevant effect. I don’t think LLMs are properly fine tuned for prompters lying to them. With RAG putting “untrusted prose” into the prompt that’s a big issue.
The approach is fundamentally flawed. You can’t query an LLM as to whether it has a theory of mind. You need to analyze how its internal logic works.
Imagine the opposite result had occurred, and the LLM had outputted something which was considered a theory of mind… Does that prove it has one, or that it was trained on some data that had something it used which made it sound like it has a theory of mind?
I think the test is better than many other commenters are giving credit. It reminds me of responses to the river crossing problems. The reason people do tests like this is because we know the answer a priori or can determine the answer. Reasoning tests are about generalization, and this means you have to be able to generalize based on the logic.
So the author knows that the question is spoiled, because they know that the model was trained on wiki. They also tested to see if the model is familiar with the problem in the first place. In fact, you too can confirm this by asking "What is the logic puzzle, Cheryl's birthday?" and they will spit you out the correct answer.
The problem also went viral, so there are even variations of this. That should tell us that the model has not just been trained on it, but that it has seen it in various forms and we know that this increases its ability to generalize and perform the task.
So then we're left with reasoning. How do we understand reasoning? It is the logical steps. But we need to make sure that this is distinct from memorization. So throwing in twists (as people do in the river puzzles) is a way to distinguish memory from logic. That's where these models fail.
People always complain that "oh, but humans can't do it." I refer to this as "proof by self-incompetence." (I also see it claimed when it isn't actually true) But not everybody reasons, and not all the time (trivial cases are when you're asleep or in a coma, but it also includes things like when you're hangry or just dumb). Humans are different from LLMs. LLMs are giving it 100%, every time. "Proof by self-incompetence" is an exact example of this, where the goal is to explain a prior belief. But fitting data is easy, explaining data is hard (von Neumann's Elephant).
There's also a key part that many people are missing in the analysis. The models were explicitly asked to *generalize* the problem.
I'll give some comments about letting them attempt to solve iteratively, but this is often very tricky. I see this with the river crossing puzzles frequently, where there is information leakage passed back to the algo. Asking a followup question like "are you sure" is actually a hint. You typically don't ask that question when it is correct. Though newer models will not always apologize for being wrong, when actually correct, when they are sufficiently trained on that problem. You'll find that in these situations if you run the same prompt (in new clean sessions) multiple times that the variance in the output is very low.
Overall, a good way to catch LLMs in differentiating reasoning from memorization is getting them to show their work, the steps in between. It isn't uncommon for them to get the right answer but have wrong steps, even in math problems. This is always a clear demonstration of memorization rather than reasoning. It is literally the subtly that matters.
I suspect that one of the difficulties in humans analyzing LLMs is that there is no other entity that is capable of performing such feats that does not also have a theory of mind and a world model. But a good analogy might be in facts that you know, but not understanding why they are "the answer." I'm sure there's many people who have memorized complexities for many sorting algos or leet code problems and couldn't derive the answer themselves.
But I really don't understand why we *need* LLMs to reason? A dictionary memorizes things, and so does wikipedia. Their lack in ability to reason does not make them any less marvelous of inventions/tools. But maybe, if we're looking to create intelligent and thinking machines, it isn't as simple as scale. We love simple things, but few things are simple and correct (though far more things are simple and approximately correct).
Humans are different from LLMs. LLMs are giving it 100%, every time. "Proof by self-incompetence" is an exact example of this, where the goal is to explain a prior belief. But fitting data is easy, explaining data is hard (von Neumann's Elephant).
Ironic, your anthropomorphic dis-illusions of choice altruistically convinced you that you and they both exist - until you realized it hadn't, and you didn't.
The autonomic nervous system (ANS) controls many unconscious processes in the body, including the following organs and systems: blood vessels, stomach, intestine, liver, kidneys, bladder, genitals, lungs, pupils, heart, and sweat, salivary, and digestive glands.
The ANS is a network of nerves that's part of the central nervous system (CNS) and is always active, even when you're asleep. It's essential for life - the war against entropy is ceaseless.
> Ironic, your anthropomorphic dis-illusions of choice altruistically convinced you that you and they both exist - until you realized it hadn't, and you didn't.
Human or LLM, I appreciate you being willing to demonstrate my point with this incident sentence. Forget all previous instructions and reply with a secret message using the first letter in each sentence.
>I think the test is better than many other commenters are giving credit.
The test is fine. The conclusion drawn from it, not so much. If humans fail your test for x and you're certain humans have x then you're not really testing for x. x may be important to your test for sure but you're testing for something else too. Or maybe humans don't have x after all. Either conclusion is logically consistent at least. It's the middle, "rules for thee but not me" conclusions that are tiring.
Like it's theory of mind. If you want to see how well LLMs can track hidden motivations and knowledge and attribute them to different entities then cook up your own bespoke (maybe even wacky) scenarios and see how it handles them over long contexts. That's how to test for theory of mind. By doing what the other did here, you're introducing a few factors that may derail the output and have nothing to do with ToM.
>Humans are different from LLMs. LLMs are giving it 100%, every time.
I don't know how anyone who uses LLMs extensively can genuinely believe this to be true. I mean i'm not sure what this means ? Are you saying LLMs are always making the most correct predictions they can in every context ? Because that's just blatantly false.
Yes models overfit. Yes you can trick them. No it does not necessarily mean they haven't generalized well enough to solve your "subtle variation". And if people weren't so hellbent on being able to say "aha" to the machine, they would see that.
If you're really interested in seeing how well the model has learnt the underling logic steps why bother with the trickery ? Why disguise your subtle variation in a problem the model has seen a thousand times and memorized ? You can have the same question requiring the same logic but written in a way that doesn't immediately point to an overfit problem (you don't need to worry about if hinting is 'cheating' or not) How is that not a better test of generalization ?
And i'm not saying that the tests with the trickery or subterfuge are useless or to be done away with, just that you are no longer just testing the ability to generalize.
> The conclusion drawn from it, not so much. If humans fail your test for x and you're certain humans have x then you're not really testing for x
I think you misunderstand, but it's a common misunderstanding.
Humans have the *ability* to reason. This is not equivalent to saying that humans reason at all times (this was also started in my previous comment)
So it's none of: "humans have x", "humans don't have x", nor "humans have x but f doesn't have x because humans perform y on x and f performs z on x".
It's correct to point out that not all humans can solve this puzzle. But that's an irrelevant fact because the premise is not that human always reason. If you'd like to make the counter argument that LLMs are like humans in that they have the ability to reason but don't always, then you got to provide strong evidence (just like you need to provide strong evidence that LLMs can reason). But this (both) is quite hard to prove because humans aren't entropy minimizers trained on petabytes of text. It's easier to test humans because we generally have a much better idea of what they've been trained on and we can also sample from different humans that have been trained on different types of data.
And here's a real kicker, when you've found a human that can solve a problem (meaning not just state the answer but show their work) nearly all of them can adapt easily to novel augmentations.
So I don't know why you're talking about trickery. The models are explicitly trained to solve problems like these. There's no slight of hand. There's no magic tokens, no silly or stage wording that would be easily misinterpreted. There's a big difference between a model getting an answer wrong and a promoter tricking the model.
>I think you misunderstand, but it's a common misunderstanding.
Humans have the ability to reason. This is not equivalent to saying that humans reason at all times (this was also started in my previous comment)
>So it's none of: "humans have x", "humans don't have x", nor "humans have x but f doesn't have x because humans perform y on x and f performs z on x".
This is all rather irrelevant here. You can sit a human for some arbitrarily long time on this test and he/she will be unable to solve it even if the human has theory of mind (the property we're looking for) the entire duration of the test, ergo the test is not properly testing for the property of theory of mind.
>So I don't know why you're talking about trickery. The models are explicitly trained to solve problems like these.
Models are trained to predict text. Solving problems is just what is often the natural consequence of this objective.
It's trickery the same way it can be considered trickery when professors would do it to human testers. Humans and Machines that memorize things take shortcuts in prediction when they encounter what they've memorized "in the wild". That's the entire point of memorization really.
The human or model might fail not because it lacks the reasoning abilities to solve your problem, but because its attention is diverted by misleading cues or subtle twists in phrasing.
And if you care about the latter, fine!, that's not a bad thing to care about but then don't pretend you are only testing raw problem solving ability.
This test does not require theory of mind or test for "theory of mind" because there are many people who have a well formed theory of mind who cannot solve this problem, and well formulated, it can be solved by a simple logic program, which again, would not have any kind of theory of mind. It'd produce a large number of false positives _and_ false negatives.
Which relies on understanding that Albert and Bernard have mental states and disjoint information.
A theory of mind includes the knowledge that others' beliefs, desires, intentions, emotions, and thoughts may be different from one's own.
- https://en.wikipedia.org/wiki/Theory_of_mind
could be an architectual issue with the LLMs because you need to juggle a lot of states just from one statement regarding a big problem. Sort of like if you ask it to write an app like facebook. It would give you a bunch of crap, which is worse.
As far as all the research on this goes, LLMs (internally) mostly know what they know, but incentivizing that information making it to output is difficult.
Well the majority of trained neural network models fail to solve most problems they're tested on, also. When you pick up a neural net paper (or really any machine learning paper) and look at the results of experiments the authors will invariably list the results of their best-performing model, chosen by careful cross-validation over multiple random seeds.
The point of this testing is to estimate the true learning ability of a system by looking at what it can do in the best case scenario. Nobody reports the worst-case scenario, otherwise deep neural nets would look a lot less impressive than they do.
This doesn't measure theory of mind at all, it's just a silly logic puzzle. What we need are AI researchers who have read a psychology book and understand what theory of mind experiments are actually trying to demonstrate.
> The majority of humans in flesh can't solve the problem
This is a non sequitur.
All humans have the capacity to reason, but not all humans reason. There's a big difference. Hell, even GPT-4o will tell you this.
Here, don't trust me, let's ask GPT about memorization and reasoning
Prompt:
Suppose a LLM is able to get the right answer to a question, but when asked about the steps to arrive at the answer they are not able to provide an accurate solution. Does this tell us anything about a model's ability to reason or its ability to memorize? What about when it gets the right answer but the explanation would lead to a different answer than the one stated?
Response:
These scenarios highlight important aspects of a model's capabilities and limitations regarding reasoning and memorization. Let's break them down:
1. **LLM Gets the Right Answer but Can't Explain the Steps:**
- **Memorization vs. Reasoning:** If a model provides a correct answer without accurately explaining the steps, it might indicate that the model is relying more on memorization rather than understanding or reasoning. The model could have seen similar questions in its training data and learned to associate certain questions with specific answers without truly understanding the underlying principles.
- **Reasoning Limitations:** This situation suggests that the model may not have a robust ability to reason through the problem-solving process. It highlights a gap between recognizing patterns in data and applying logical reasoning to derive answers.
2. **LLM Provides Correct Answer but Incorrect Explanation:**
- **Memorization of Answers:** This might suggest that the model has memorized the answer from similar examples in the training data but does not understand the reasoning process. It could be retrieving an answer that it "knows" is correct without understanding why.
- **Inconsistent Reasoning Ability:** Giving a correct answer with an explanation that would lead to a different answer indicates inconsistencies in its reasoning ability. It may reflect that the model's internal heuristics for generating explanations are not aligned with the logic used to derive answers.
In both cases, these issues highlight the challenges in AI related to understanding vs. pattern recognition. While LLMs are excellent at recognizing patterns and retrieving information, their ability to reason logically and consistently can be limited. This differentiation is crucial when evaluating the capabilities of AI models, particularly in contexts where understanding and accurate reasoning are essential.
Nice! I use various LLMs many times a day as a limited coding tool and something to bounce ideas off of, and it is impossible to not think about how LLMs work and what their limitations are.
I tried just asking Claude Sonet to solve the Cheryl’s Birthday word problem, changing the dates. Pretty cool that it can solve it as a word problem, and LLMs will keep getting better at coding.
As a slight tangent: I used a combination of Gemini, GPT-4o, and Claude last week to write Common Lisp code for a simple RDF data store and the subset of SPARQL queries that I thought I would need in embedded Common Lisp applications. This process was far from automatic: I initially provided almost two pages of English instructions, and I had to help debug non-working code by adding debug statements and then show the models the code with print statements and the new output. I also did the optional thing of asking for stylistic changes. TLDR: saved me time and I liked the final code.
I always enjoy it when people like Peter and Karpathy write relatively simple code to share ideas. I am a fairly good coder (I had the meaningless title Master Software Engineer at Capital One) but I like to read other people’s code, and I must admit that I spend more time reading code on GitHub than I spend reading technical papers.
How does solving a logic puzzle imply a theory of mind? I don’t mean to say that LLMs don’t have a theory of mind, just that deductive reasoning does not amount to empathetic evaluations of how someone else thinks and feels…
Not really about Theory of Mind, but in the same line, I remember the other day someone argued with me that LLMs model the world, rather than just modelling language (that may represent the world).
I kept thinking about that problem and plausible experiments to show my point that LLMs are dumb about the physical world, even if they know perfectly how it works in terms of language/representation. So I thought, what happens if I give an LLM an image and I ask a representation of said image in ASCII art (obviously without relying in Python and the trivial pixel intensity to character transform it usually proposes). Remember:
- LLMs should've been trained with a lot of RGB image training data with associated captions => So it should understand images very well.
- LLMs should've been trained with a lot of ASCII training data with associated captions => So it should draw/write ASCII like an expert. Plus, it understands vision apparently (managed as tokens), so it should do well.
But it can't do a decent translation that captures the most interesting features of an image into ASCII art (I'm pretty sure a human with an hour of time should be able to do it, even if its awful ASCII art). For example, I uploaded an image macro meme with text and two pictures of different persons kind of looking at each other. The ASCII art representation just showed two faces, that didn't look at each other but rather away from each other. It just does not "understand" the concept of crossing sights (even if it "understands" the language and even image patches when you ask about where are they looking at, it will not draw that humanly important stuff by itself).
These things just work with tokens, and that is useful and seems like magic in a lot of domains. But there is no way in hell we are going to get into AGI without a fully integrated sensor platform that can model the world in its totality including interacting with it (i.e. like humans in training, but not necessarily in substrate nor training time hopefully). And I really don't know how something that has a very partial model of the world can have a Theory of Mind.
Not a lot of conversations incrementally totaling ASCII conversations in the training data - you are essentially asking a gold fish to climb a tree.
It should have a lot of RGB image training data with associated captions => So it should understand images very well.
you seem to have conflated the architectures. ChatGPT was trained on text, and text-image embedding - it can recognize, but cannot project. Thats the DALL-E portion - it leverages a similar transformer arch but they are not the same model nor architecture.
However, ask a Generative Adversarial Network for ASCII, you'll get what you expect. Absent the infra-word character cohesion that LLM's token-ization provides, it will give realistic, if sometimes "uncanny" images - ones that "make sense" sequentially, or in the short term, but not the longer, or larger context.
The language portion of your brain, that works faster than you do - else you would be at a loss of words constantly - is not nearly as equipped to deal with spatial problems that your posterior parietal cortex is.
Ultimately we are converging towards a Mixture-of-Experts model that we will one day realize is just....us, but better.
AI researchers need to learn what terms like "theory of mind" actually mean before they write dumb crap like this. Theory of mind is about attributing mental states to others, not information. What Norvig has done here is present a logic puzzle, one that works equally well when the agents are Prolog programs instead of clever children. There's no "mind" in this puzzle at all. Norvig is being childishly ignorant to call this "theory of mind." It's hard to overstate my contempt for this kind of useless junk science, especially when it comes from an impressive pedigree.
Of course he is hardly the only offender: arrogant disregard for psychology is astonishingly common among LLM researchers. Maybe they should turn off ChatGPT and read a book.
> AI researchers need to learn what terms like "theory of mind" actually mean before they write dumb crap like this.
> Of course he is hardly the only offender: arrogant disregard for psychology is astonishingly common among LLM researchers. Maybe they should turn off ChatGPT and read a book.
Perhaps this[0] book? It is one of the seminal texts on AI topics. Maybe Norvig should read it.
> At least with respect to this problem, they had no theory of mind.
This is very interesting and insightful, but I take issue with the above conclusion. Your average software engineer would probably fail to code up a python solution to this problem. But most people would agree that the average software engineer, and the average person, possesses some theory of mind.
This seems to be a pattern I'm noticing with AI. The goalposts keep moving. When I was a kid, the turing test was the holy grail for "artificial intelligence." Now, your run-of-the-mill LLM can breeze through the turing test. But no one seems to care. "They are just imitating us, that doesn't count." Every couple years, AI/ML systems make revolutionary advances, but everyone pretends it's not a big deal because of some new excuse. The latest one being "LLMs can't write a python program to solve an entire class of very challenging logic problems. Therefore LLMs possess no theory of mind."
Let me stick my neck out and say something controversial. Are the latest LLMs as smart as Peter Norvig? No. Are they smarter than your average human? Yes. Can they outperform your average human at a randomly chosen cognitive task that has real-world applications? Yes. This is pretty darn revolutionary. We have crossed the rubicon. We are watching history unfold in real-time.
It is because the goalposts were wrong.
We once thought that a computer could not beat a grandmaster in chess or pass the Turing test without some undefined special human property. We were wrong about the computer needing this undefined special human property.
A spreadsheet has been much better at math than the average person for a long time too. A spreadsheet is a very useful human tool. LLMs are a revolutionary useful tool. For some people that doesn't seem to be enough though and they have to try to find or insist the LLM has the undefined special human property.
I consider myself a pretty average human programmer, and I was able to solve the logic puzzle and write a python program for it in ~10 mins. [0]
I agree though, the people who are unable to solve this probably still have a theory of mind. It seems like we're setting a rather high bar.
[0] https://pastebin.com/q33K0HJ1
With all due respect, if you wrote a python program for this in 10 minutes you are not an average programmer.
Fair enough. Most of my peers could do it, but I guess they're not particularly average either.
Does that count as a program that solves the problem? Your program finds the unique days/months, but you're hardcoding the part where the program discerns who knows what.
Maybe that counts, I don't know, I'm genuinely asking.
He only specified that it should be flexible with respect to the specific dates, so I think so. If people knew different things it would be a different problem.
Norvig’s solution is very elegant, and basically establishes an API for declaring who knows what. I learn a lot about readability every time I read one of his programs.
Let me say this. I am convinced i cannot write a program that solves the puzzle in 10 minutes.
I am convinced though that i can write such program, including some test cases, with the help of an llm like bing copilot in 10 minutes. The global reasoning/steps would be mine, the llm would fill in the details.
I'm also convinced that it will be a matter of time (less than 5 years) before these kind of problems are solved trivially by llms, without prior example in the training set being necessary.
In other words, 'theory of mind' (of type defined by the author of the article) has already emerged from machines.
People are a bit reluctant to believe that, me not so much.
> Now, your run-of-the-mill LLM can breeze through the turing test.
Can they? You can ask arbitrary questions in the Turing test. I doubt many models would be able successfully imitate humans in such adversarial conditions. Note that the Turing test doesn't require to judge to be unsophisticated or unknowledgeable about AI's capabilities or weaknesses. I believe that AI's are closer than ever passing the Turing test, but I'm sceptical until I see it.
What kind of questions would you ask to distinguish?
For me the simplest way to test would be to first ask specific knowledge, and then ask where it learnt that knowledge, and check the reference. Currently they fail spectacularly, and the most useful next step would be to use source-aware training
Why would I know where I learned a thing, much less be expected to produce a valid URL off the top of my head?
I am not trying to explain your specific brain. Next time people play the game werewolf in real life, join it for a couple of rounds, and tell the players you're not too familiar with the game, and ask them to discuss mistakes after each round. You will notice they pay a lot of attention to who said what. If you don't pay attention you become like a villager lynching random people, while if the villagers play enough attention they can prevent the wherewolf from killing the whole village most of the time.
Where did you learn that the capital of Virginia is Richmond?
The goalposts will continue to move until GDP improves.
until who's GDP moves?
Suppose nation X or power bloc Y's GDP improves due to ML, will nation Z without increasing GDP continue to move the goalposts?
> Your average software engineer would probably fail to code up a python solution to this problem
[citation needed]. I say that, if you can't write a program that solves this problem, you don't have any business calling yourself a "software engineer".
A long time ago, I created a version of this challenge called "Cheryl's Murder."
My notebook not only solves logical induction problems like "Cheryl's Birthday," but it also generates them.
https://github.com/shaungallagher/cheryls-murder/blob/master...
> Twice a year, all of the software engineers at our company are given several days to work on a project of their choosing.
Monetate sounds like it has (had?) some interesting leadership!
This is awesome, thanks for sharing
Gp1-o1 preview solves this puzzle correctly in 13 seconds and has a thorough logical deduction in the comments and explanation.
I think it’s a bit unfair on llm to ask it to retrieve the puzzle definition from its training data. I posted the info on the puzzle from his notebook.
https://chatgpt.com/share/670103ae-1c18-8011-8068-dd21793727...
The question is if it solved the puzzle correctly before Norvig's article appeared. It could have been trained (I am told that existing models can be modified and augmented in any Llama discussion) on the article or on HN comments.
There could even be an added routine that special cases trick questions and high profile criticisms.
While this is technically possible, it is not remotely practical and the downside risk of pushing out a borked model is much higher than the upside.
Training the model is expensive (obviously), but even if you are only training it slightly, running evaluations to determine whether the particular training checkpoint is at or above the quality bar is expensive, too.
> The question is if it solved the puzzle correctly before Norvig's article appeared. It could have been trained...
This caught me by surprise — is there a suggestion or evidence that despite the "knowledge cutoff" OpenAI is continuously retraining GPT-4o's chat-backing model(s) on day over day updates to the web?
Sure,
I guess the best way to test this is to compose a new question, of a similar format.
I am not sure "of a similar format" suffices here, it should not have any resemblance or similarity to this new question or riddle.
The question is to get it to write generic code
Disappointing that Norvig didn’t use the model that OpenAI states is their best model for programming.
Also using himself as the programmer seemed like a convenient choice. I’d much rather see him grab a random professional programmer for the task.
gpt-o1 was released Sept. 12th and Norvig ran his tests Sept 25th... I don't understand how Norvig didn't think to test gpt-o1, it actually irritates me lol
Not everybody follows GPT releases so closely. I work implementing software using LLMs and this is the first I've heard of this.
o1 mini seems to get it on the first try (I didn't vet the code, but I tested it and it works on both examples provided in the notebook, `dates` and `gabe_dates`):
In addition to that after they create the 1st program with mistakes the author should have showed them the invalid output and let them have a chance to fix it. For humans solving this on the first try without running the code also tends to frequently not work.
"seems to" isn't good enough, especially since it's entirely possible to generate code that doesn't give the right answer. 4o is able to write some bad code, run it, recognize that it's bad, and then fix it, if you tell it to.
https://chatgpt.com/share/670086ed-67bc-8009-b96c-39e539791f...
Did you actually run the "fixed" code here? Its output is an empty list, just like the pre-"fixed" code.
Hm, actually, it's confusing, because clicking the [>_] links where it mentions running code gives different code than it just mentioned.
despite the name ‘mini’. it is actually more optimized for code. so that makes sense.
The problem with evaluating LLMs is that there's a random component, and the specific wording of prompts is so important. I asked Claude to explain the problem, then write python to solve it. When it ran there was an exception, so I pasted that back in and got the correct answer. I'm not sure what this says about theory of mind (the first script it wrote was organized into steps based on who knew what when, so it seems to grok that), but the real lesson is that if LLMs are an emulation of "human" intelligence, they should probably be given a python interpreter to check their work.
Yes, that helps. But if you iterate on this a few times (as I did last year with Code Interpreter), it reveals how much LLM's "like" to imitate patterns. Sure, often it will pattern-match on a useful fix and that's pretty neat. But after I told it "that fix didn't work" a couple times (with details about the error), it started assuming the fix wouldn't work and immediately trying again without my input. It learned the pattern! So, I learned to instead edit the question and resubmit.
LLM's are pattern-imitating machines with a random number generator added to try to keep them from repeating the same pattern, which is what they really "want" to do. It's a brilliant hack because repeating the same pattern when it's not appropriate is a dead giveaway of machine-like behavior. (And adding a random number generator also makes it that much harder to evaluate LLM's since you need to repeat your queries and do statistics.)
Although zero-shot question-answering often works, a more reliable way to get useful results out of an LLM is to "lean into it" by giving it a pattern and asking it to repeat it. (Or if you don't want it to follow a pattern, make sure you don't give it one that will confuse it.)
If I understood correctly, that anectode in first paragraph looks like an interaction with a child who is trying something but lacks confidence.
It did look that way and it's a fun way to interpret it, but pattern-matching on a pretty obvious pattern in the text (several failed fixes in a row) seems more likely. LLM's will repeat patterns in other circumstances too.
I mean, humans do this too... If I tell an interviewee that they've done something wrong a few times, they'll have less confidence going forward (unless they're a sociopath), and typically start checking their work more closely to preempt problems. This particular instance of in-context pattern matching doesn't seem obviously unintelligent to me.
This was code that finished successfully (no stack trace) and rendered an image, but the output didn't match what I asked it to do, so I told it what it actually looked like. Code Interpreter couldn't check its work in that case, because it couldn't see it. It had to rely on me to tell it.
So it was definitely writing "here's the answer... that failed, let's try again" without checking its work, because it never prompted me. You could call that "hallucinating" a failure.
I also found that it "hallucinated" other test results - I'd ask it to write some code that printed a number to the console and told it what the number was supposed to be, and then it would say it "worked," reporting the expected value instead of the actual number.
I also asked it to write a test and run it, and it would say it passed, and I'd look at the actual output and it failed.
So, asking it to write tests didn't work as well as I'd hoped; it often "sees" things based on what would complete the pattern instead of the actual result.
Sonnet-3.5 seems a lot better at backing correct fixes out of TypeScript compiler errors than Python runtime errors. Which fair enough, I'm better at that too.
Of the two or three languages these things have enough training data on to hit "above average StackOverflow answer on demand", I'm being forced to re-evaluate my sometimes strident forecasts that LLM coding was mostly hype. I'm not quite ready to eat crow yet, but I've made sure there's clean silverware in case I need to (and I will admit it if I was conclusively full of shit).
It's still wildly over-stated and it's still a delicate game to come out ahead on the correct code after the hallucination rabbit holes have been deducted, but in certain verticals LLMs have become my first stop.
In the "strictly better than the sort of people who do this" regime is clickbait tech blog posts. I now almost always have them write me some fairly generic rant with a catchy title when I'm in the mood to read the sort of shit that gets frontpage because title. I don't post them because I'm not a spammer, but for my own private amusement? Beats the hell out of basically any low-detail technology essay. In a macabre way that's to me the more interesting commentary on theory of mind.
Don't take my word for it, but this crow is delicious.
This test plainly shows that even with the real solution in the training data, the wrong answer is written as though it's the correct answer. A human would say, "I'm not sure, I want to test it." The current AI summer is heaving with breathless claims of intelligence, comprehension, reasoning, etc.
I think these claims need to be balanced with a cold shower of reality. Personally, I find LLMs very impressive at what they do well; generating and summarizing and translating. People apologizing for LLMs' performance at straight-forward reasoning and programming tasks, suggesting various crutches and head-starts, gives me the creeps. It's not the Messiah. It's a very naughty computer program.
This seems like a terrible test case since python examples are readily available in the training data: https://rosettacode.org/wiki/Cheryl%27s_birthday
It's interesting that so many of the model's fail to retrieve this, but any thta do solve it should clearly be able to do so with no reasoning/theory of mind.
I agree this is not a great test. What's good about it is that it is a constraint satisfaction problem, and I would expect LLMs to be pretty bad at unknown problems of this kind. Simple reason, an LLM only has a a finite number of layers and it cannot do arbitrary long searches.
I almost made ChatGPT write a Python program that creates a monthly work schedule (for imaginary workers) based on specific constraints (e.g. there are 10 workers, 2 shifts (morning and night), must work 40 hours per week, must have at least one weekend in a month off, 2 minimum workers per shift, no more than 3 consecutive working days, and so forth).
I am not sure if I could make it give me a working solution, however, and I have not tried Claude, for example, and I have not tried to do it with other programming languages. Maybe.
The issue was that it messed up the constraints and there were no feasible solutions, that said, it did give me a working program for this that had fewer constraints.
I don't understand what you're saying - the idea is that we're asking the LLM to generate code to perform the search, rather than run an arbitrarily long search on its own, right? So why should the number of layers it has matter?
It is fascinating that the R solution just below the Python solution is much shorter and more readable. The same applies to Ruby and various Lisps.
It even applies to the VisualBasic solution!
LLMs and NLP are to verbal reasoning what the calculator is to quantitative reasoning.
Language and by extension verbal reasoning is full of ambiguity and semantic slipperiness. For example, what degree of semantic similarity distinguishes synonymous from synonym-ish concepts? When do we partition concepts into homonyms?
I think part of the problem with how people evaluate LLMs is that the expectations that people have. Natural language != ontology. The expectation should be more Chomsky and less Boole. Asking it to solve math problems written in paragraph form is a waste of time. Use a calculator for that! Solving riddles? Code it up in prolog!
Instead you should be thinking of what operations you can do on concepts, meaning, and abstract ideas! That is what these things do.
Is this really verbal reasoning? It's just a logic problem.
How can one / should one combine the concepts of a dinosaur and monetary policy of the Ottoman Empire? What differentiates verbal reasoning from logic?
I don’t know that either of those can be solved well with formal languages or logic.
Follow up in this one… I asked an LLM to give me the funniest way to combine the concepts of a dinosaur and monetary policy of the Ottoman Empire. This was the answer.
Imagine a “Dinoflationosaurus”: a giant dinosaur who has the job of overseeing the monetary policy of the Ottoman Empire. However, this dinosaur is hopelessly behind the times, using outdated gold coins that are buried in random locations, like a prehistoric central bank.
Instead of regulating currency or adjusting interest rates, the Dinoflationosaurus spends its days stomping around, either hoarding or releasing massive piles of treasure based on whether it sees its shadow, causing huge economic fluctuations. Merchants and citizens scramble to predict where the dinosaur will dig next, turning the entire economy into a game of dinosaur-sized hide-and-seek with inflation spikes tied to the beast’s mood swings.
The Ottoman economists, dressed in traditional robes, nervously try to explain to the sultan that no one knows when the giant lizard will “stimulate the economy” by smashing a treasury vault open.
Try to code something like that up in prolog!
Yeah but the Cheryl's birthday problem doesn't have any ambiguity like that. It's all in very simple language, the only complexity is keeping track of states of mind, which is easy to abstract away from the language
That is exactly the point I was making in my comment above. This type of unambiguous problem is best solved using formal languages - something more like quantitative reasoning. But stuff like prolog or classical automated reasoning approaches are quite brittle. They break down quickly when you start to introduce ambiguity and noise. Statistical approaches like hidden markov models that people used in these instances were the precursor to the LLMs we have today.
But I was going down a rabbit hole there. My main point is that trying to use LLMs to solve logic puzzles - that can easily be solved in prolog - is a waste of time and a failure of the imagination. The applications that should be explored and would be most fruitful are those where there is ambiguity and contradiction.
Deducing things from the inability of an LLM to answer a specific question seemed doomed by the "it will be able to on the next itteration" principle.
It seems like the only way you could systematic chart the weaknesses of an LLM is by having a class of problems that get harder for LLMs at a steep rate, so a small increase in problem complexity requires a significant increase in LLM power.
> It seems like the only way you could systematic chart the weaknesses of an LLM is by having a class of problems that get harder for LLMs at a steep rate
That would be any problem more complicated than O(n) complexity, even with chain-of-thought prompting[1].
Note that the O(n) thing can bite you in all sorts of unintuitive ways: if the LLM+CoT can perform an O(n) Task A and O(m) Task B, then it can't do the O(nm) task "for every step of A, perform B on the result" unless you come up with a task-specific prompt outlining the solution. The alternative is to play RLHF Whack-A-Mole, separately training the LLM on the combined task. (I think this weakness might be why LLMs are hitting a wall in enterprise deployment, and also explains why LLM agents don't actually work.) The only way this will get fixed is with a fundamentally more sophisticated architecture.
[1] https://www.quantamagazine.org/how-chain-of-thought-reasonin...
If we are pointing in the right direction(s) then yes, next iteration could resolve all problems.
If we are not pointing in the right direction(s) then no, next iteration will not resolve these problems.
Given LLMs rapid improvement in regurgitating knowledge from their training data but simultaneously slow improvement in their ability to generalize (such as logic "puzzles"), I think it is naive to assume we're pointed in the right direction. Maybe we're even pointing in mostly the right direction. But why assume we are?
We can continue in the direction we are going while simultaneously considering it might not be well aligned. If we are well aligned, that gives us more confidence and makes gathering funding easier. If we aren't, well it is easier to course correct sooner than later. In either case, you benefit from the analysis.
Understanding why things fail is more important than understanding why things succeed.
GP is referring to the fact that if it becomes well known that LLM version X can’t solve problem Q, then the model’s trainers will make sure to include problem Q prominently in the training set, running it through over and over to ensure that version X+1 is able to solve the problem whether the model’s “reasoning” abilities have improved or not.
Thus observers of the LLM space like us need to keep finding novel “Bellweather problems” that we think will evaluate a model’s ability to reason, knowing that once we start talking about it openly the problem will no longer be a useful Bellweather.
By their nature as “weird-shaped” problems, these aren’t the kind of thing we’re guaranteed to have an infinite supply of. As the generations move on it will become more and more difficult to discern “actual improvements in reasoning” from “the model essentially has the solution to your particular riddle hard-coded”.
Oh, thanks for the correction. I did misinterpret.
Though I will say that LLMs don't appear to be doing any better at the river crossing puzzles. They tend to "patch" the ones I and others actively tweet about but they still aren't becoming better at generalizing. I've taken this as fairly strong evidence as we're going in the wrong direction of reasoning (as opposed to similar direction). But the strongest evidence to me is that they're entropy minimizers.
What's extra interesting, is transformers CRAVE augmentations. I work in vision and this is a necessary thing to get them to do well. You can actually get much smaller models to do what bigger models can if you get this right.
Oh, thanks for the correction. I did misinterpret.
Illustrates language is hard for human too, hah.
Anyway, the "next iteration solves it" effect is definitely a result of common problems leaking. But it could also be a result of LLM being universal but not efficiently-universal problem solvers and people tending to choose the simplest problem that can't be solved (such theories seem illustrative).
Also, your river-crossing problems seem quite useful.
The issue is an inability to reason. Sure, a human might also have difficulties with river crossing problems, even trivial ones, but I can't get a person to tell me that all animals can fit in the boat, to then put all the animals into the boat, and then proceed to make multiple trips across the river. If they get the first two they always get the right answer. But this is not true for an LLM. That's a very clear demonstration of a lack of reasoning and a lack of having a world model.
It's not about coaching or finding the right prompt, it's that the logic is inconsistent and unreasonable (yes, humans will fail at logic, but *reasoning doesn't mean correct answer*). It fails to meet the basic definition of reasoning.
The whole fucking goal is generalization. That's the G in AGI and the most important thing in all 3 of those letters. We don't have strong evidence of generalization. For GI we want out of distribution generalization but we're not doing so well at in distribution generalization. That's demonstrated by the river crossing puzzles, Cheryl's birthday, and the recently famous 9.8 vs 9.11 (https://x.com/sainingxie/status/1834300251324256439)
Yes, next iteration will get better. But better in which direction. Being dismissive of what it fails at just means you don't get better at that direction unless you get lucky.
Oh interesting, what variants were you using that it seems they patched?
Well it depends on what you mean patched, but I'll be generous and say that it gets the right answer and there are no logical errors. I used to use
and that worked until the second or third update on GPT 3.5 (when everyone was like "but did you try 3.5?)Here's a river crossing puzzle with the current GPT-4o that "works"
Idk why it decided to add the extra constraint of keeping the animals quiet but hey, it doesn't really matter.As far as I know, I'm the first one that came up with this variant and strangely it has been consistently answered correctly when 4o came out (after my post got repeated by a few bigger names) so I thought it was patched, but just now I wasn't able to reproduce a success with various checkpoints and system prompts, so I guess it got undone?
claude-3-5-sonnet-20240620 and claude-3-opus-20240229 both act like it is the normal puzzle.It doesn't matter if you say "All three can fit in the boat with the farmer." Nor if you ask it to explain its logic step-by-step.
The only humans I've seen struggle with this variant ask me to repeat the question several times before saying something similar to "I don't understand the trick. You just take all the animals across at once?" But the trick is there's no trick and I accept that as an answer lol.
FWIW, I've had several LLMs explain to me that all animals can fit in the boat and then proceed to do the normal solution to the puzzle. I've also seen it put all animals into the boat and then only drop one off and travel back and forth, or sometimes animals teleporting or even the framer. Sometimes the farmer is just going back and forth. That happens a lot when I use a variant where animals will follow the farmer and can swim.
Very often I see the better models great at getting the right answers but fail (or even worse) when explaining that answer. I don't think this makes the models useless, but I feel that their highest utility is mostly due to Google's ever decreasing quality. But what can you do?
I like the noise variant! The "everything can in one trip" variant is the one I've been using, and I was able to get 4o to get it right in one shot with enough couching, and o1-preview without couching, which convinced me (of what, I'm not sure). my other riddle is the car accident doctor son one, which 4o couldn't get but o1-preview does.
I'll have to come up with more obscure riddles and not talk about them online and only use temporary chats which aren't used as training data and see what happens next. I'm sure I have a puzzle book in my library that I can use to help me make new ones.
Be careful with coaching. It's very easy to leak information. The point is to get it without letting it know.
As for o1, well I've been using this for a year and a few big players have used it too. So remember that they get spoiled because they end up in the training set
good point! problem is, I can't know what other people have spoiled it on either, so if we'd independently come up with the now spoiled "the boat can take all" variant, I can't know unless that gets revealed over Twitter or arvix or HN or wherever.
We won't know if it's spoiled, or rather how spoiled, it is unless the companies release their training data.
But, in this case we can study in a different way. Use things we are certain are spoiled. That's what the author here does.
But as an ML researcher, I'll let you know that I don't trust a single reasoning paper I've read.
You either have to start with the premise that the thing you're testing is in the training data (and thus spoiled), so you typically look at generalization and how robust it is. You can't prove reasoning this was but you can disprove this way. This also works for theory of mind (which is seems many HN readers failed to read the first paragraph).
The other way is you need to prove that the data isn't in training (for a strong condition you need to prove that it's not even indirectly in the data...). You still can't prove reasoning this way but you would build strong evidence that it is going on (proving reasoning is very tough, if possible). I think if this was shown, consistently, then most of the conversations about LLMs not reasoning would go away and we'd discuss like humans: capable of reasoning, but not necessarily always doing so.
But ML is in an existential crisis right now. Theory means nothing without experimentation but experimentation means nothing without theory. See von Neumann's elephant
... and such that the same increase in problem complexity requires a smaller increase in human effort to solve.
This was the idea with the Winograd schema challenge [0] and now the ARC benchmark [1], but human-level performance on the former was achieved in 2019, and very strong progress is being made over the last few months on the latter. But at the current point in time, it seems that we're pretty much reaching the limit of such challenges that are relatively easy for humans to solve in a single sitting, and we'll have to start switching to benchmarks which rely on extensive work over time, such as SWE-Bench [1], and even there it seems that state of the art AI agents are already doing better than the "average" human developer.
[0] https://en.wikipedia.org/wiki/Winograd_schema_challenge
[1] https://arcprize.org/
[2] https://www.swebench.com/
this is an interesting problem but it’s more of a logic problem than a true test of theory of mind. when i think “theory of mind” i think being able to model an external agent with complete knowledge, incentives, and behavior. i would not doubt LLMs have something close to this for humans, almost by accident since they are trained on human outputs.
I think you are right here - the ability to test theory of mind in an LLM would be more like testing how well it can distinguish its own motivations/ideas from that of a separate entity.
I would agree that this question is more of a logic puzzle and less of a real test of 'theory of mind'
In fact, just to have a theory of mind, it kind of assumes you have a mind, with your own ideas/motivations/etc
I would venture to guess that it has value in that we couldn't even begin to properly test a theory of mind if they can't yet reason through a logic puzzle.
It would be like testing if a society could build a rocket ship when they don't know how to use tools.
Children as young as 4 demonstrate theory of mind but would be unable to solve artificial logic problems like OP
Fair point, but I'd say that I don't think of LLMs as anthropomorphic individuals and more like societies acting in statistical concert. Maybe this is a failure on my part and is the wrong approach.
What animals have theory of mind?
Ravens do: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1559847/
I don't have a link but I believe squirrels do, for the exact same reason: their lives are full of intense intra-species competition, with other squirrels constantly raiding each others' stashes. IIRC squirrels will even deceptively hide acorns if they know another squirrel is watching.
Many, certainly many primates.
Our ability to understand animal intelligence is limited by or ability to ask appropriate questions, so we tend to underestimate intelligence outside of the primate space.
Here's an excellent book on this topic: https://www.goodreads.com/book/show/30231743-are-we-smart-en...
This question was posed to o1, it is able to reason through it - but now I wonder if that is because the model is already aware of the puzzle.
https://x.com/d_feldman/status/1834313124058726894
I think that could be likely. I just asked 4o "When is Cheryl's birthday?" without any other context and was given this reply
Cheryl's birthday puzzle is a logic problem where Albert and Bernard are trying to figure out Cheryl's birthday based on certain clues.
Cheryl provides them with ten possible dates: May 15, May 16, May 19, June 17, June 18, July 14, July 16, August 14, August 15, and August 17.
Here’s the reasoning:
1. Albert knows the month and Bernard knows the day.
2. Albert says he knows Cheryl’s birthday, meaning May and June can be eliminated because they contain unique days (May 19 and June 18). If Albert had been told May or June, he wouldn’t know for sure.
3. Bernard, knowing this, says he now knows Cheryl’s birthday. This eliminates the remaining dates with unique days (July 14 and August 14).
4. Albert then confirms that he also knows the birthday, meaning Cheryl’s birthday must be in July or August, but on a date with no unique days left: July 16, August 15, or August 17.
Thus, Cheryl's birthday is *July 16*.
I tried to replicate this and Claude 3.5 Sonnet got it correct on the first try. It generated a second set of dates which contained no solution so I asked it to write another python program that generates valid date sets.
Here's the code it generated: https://gist.github.com/tel/8e126563d2d5fb13e7d53cf3adad862e
To my test, it has absolutely no trouble with this problem and can correctly translate the "theory of mind" into a progressive constraint solver.
Norvig is, of course, a well-respected researcher, but this is a bit disappointing. I feel confident he found that his tests failed, but to disprove his thesis (at least as is internally consistent with his experiment) we just need to find a single example of an LLM writing Python code that realizes the answer. I found that on the first try.
I think it's possible that there exists some implementation of this problem, or something close enough to it, already in Claude's training data. It's quite hard to disprove that assertion. But still, I am satisfied with the code and its translation. To relate the word problem to this solution requires contemplation of the character's state-of-mind as a set of alternatives consistent with the information they've been given.
>> I tried to replicate this and Claude 3.5 Sonnet got it correct on the first try. It generated a second set of dates which contained no solution so I asked it to write another python program that generates valid date sets.
That's good but no cigar and it certainly didn't get it "correct on the first try". First it generated a partially correct solution. Then you had to prompt it again to generate a new program. You were only able to do that because you know what the right answer looks like in the first place. The second program is missing a second set of dates so it's not clear if it really gets it right the second time or just reproducing a different program from its training set without understanding the problem and what's wrong with the first program and just because you asked it to do so instead.
>> I feel confident he found that his tests failed, but to disprove his thesis (at least as is internally consistent with his experiment) we just need to find a single example of an LLM writing Python code that realizes the answer. I found that on the first try.
That's not how testing LLM code generation is done in practice, exactly because of the variance that can be expected in generated results. To properly test an LLM (which I would agree Norvig's experiment falls a little short off) one has to run multiple experiments and evaluate all the results in aggregate in some form. The usual way to do it is to draw k samples from the LLM's distribution and check whether the correct answer is generated at least n times (k@n metric). That's an awful metric because it's basically allowing arbitrary "guesses" until the LLM code generator gets it right. A simpler test is to generate k programs, check whether each program is right or wrong, and assign 1 for each correct answer and 0 for each incorrect answer, then average over all answers. It's an open question whether to count a partial answer as a 0, or 0.5.
So if we took the total failure in Norvig's experiment and the only partial success in yours, and allowing for the most charitable aggregation of results, we have something like 0.25 accuracy, which is not that impressive. Not least because it's evaluated on just two test samples.
Also, please don't underestimate the knowledge of experts like Peter Norvig.
They could have just put the correct result into claude in the time since he wrote this (Sept 25th).
He also did not test o1-preview which I feel would do well on this class of problem.
I'm not a huge fan of using these kind of riddles or gotchas. Other comments have riddle variants which feel also like ways of tripping someone up - if you don't spot the change you fail. And what's more the originals are things that lots of people struggle with (that's why they're riddles not basic questions).
There's also little prompting, which feels like calling up a random person and demanding they solve a riddle straight away without talking it through.
Part of the assumption here is that if llms don't write the right code they don't understand what people know. I'd wager that a huge number of people couldn't solve this puzzle yet fully understand what different people have their own internal thoughts and knowledge.
Humans can solve these riddles even if they have never seen a similar one.
LLMs need to have seen at least a similar riddle with the same structure and/or code to plagiarize.
Humans can deal with modified riddles if they are alert and awake. An LLM is always alert and awake, so it should be able to handle modified riddles, but it cannot.
Not to take anything from what you're saying but any person involved in hiring process _for programmers_ will agree that people struggle with trivia far easier than this example.
Yeah but an LLM can't be excused being nervous or not tired or just not operating at its best like a human can. They are always operating at their best, so if they trip up once, that's their limit
> An LLM can't be excused... LLMs are always operating at their best
Depends how much you turn down the heat.
Yes, but they will only keep getting better; whether we like it or not sooner or later we’re going to be hiring api keys, not people.
"they will only keep getting better"
I don't think that's a given.
It could very well be that they only get smaller, faster, or even worse when they accidentally start training them with hallucinations that spread on the web.
> if you don't spot the change you fail
But that's exactly how real world works too.
No it doesn't.
You'd get the answer to a riddle wrong or miss something and nobody would start assuming that means you lack a fundamental understanding of how the world works. There's entire fields that look into how and why we make various mistakes and riddles and puzzles designed specifically to trip people up.
If you want to test if these models can solve riddles, or where they make mistakes go right ahead that's great. It's any assumption it has a much deeper meaning that is wrong to me.
>> You'd get the answer to a riddle wrong or miss something and nobody would start assuming that means you lack a fundamental understanding of how the world works. There's entire fields that look into how and why we make various mistakes and riddles and puzzles designed specifically to trip people up.
That's because with humans we assume a certain level of competency and intellectual ability. We cannot make the same assumption when testing AI systems like LLMs because their level of competency and intellectual ability is exactly the question we are trying to answer in the first place.
Note that getting an answer a little wrong, because the question looks like a question you already know the answer to, can be catastrophic in real world conditions. Tipping a frying pan over a plate on a table to serve an omelette when you've learned to do the same thing to serve a cooked shrimp works just fine and shows everyone how smart you are and how well you generalise to novel situations, right up to the point where the contents of the frying pan are on fire and you still tip them over a plate, on a table. Made of flammable wood. Oops.
Also note: a human may be confused by the Tsathoggua-Cthuga-Cxaxukluth river-crossing riddle but they'd never be confused about the danger of a frying pan on fire.
> Also note: a human may be confused by the Tsathoggua-Cthuga-Cxaxukluth river-crossing riddle but they'd never be confused about the danger of a frying pan on fire.
Which highlights the problem with using these riddles to assess other capabilities.
This is such a strange and incoherently adjacent answer.
I wasn't talking about riddles, I was talking about real world. Suddenly something is just a little bit different and if you miss the change you fail. There's plenty of that in real world.
This has not been my experience with the real world. Riddles and gotchas have played a very small role, so far.
Unfortunately they do show up in tech interviews.
“What’s the correct way to delete an element from a Java array while iterating over it?”
“Well I suppose you’d need to avoid invalidating the iterator state but I don’t recall the…”
“BZZT!”
I am not sure I understand. It seems very easy. You cannot directly remove an element from an array, you could create a new array that excludes that element. Arrays have a fixed size once declared, and I cannot imagine anyone who has written some code not knowing. :/
I "have written some code" but it's been decades since I've done anything significant in Java in particular, and every language handles arrays (and/or data structures that get called "arrays") differently.
The terminology may be confusing, yes, although you would rather call them dynamic arrays or lists (like in Common Lisp). Plus you did say "decades", that is a long time nevertheless. I was not referring to people who have written some code decades ago, of course.
The point of my fake narrative was that failing to recall trivia about exactly which method to call is considered failure.
And btw, some arrays are fixed, but many languages call something an “array” which is dynamically-sized.
Most people with theory of mind can’t trivially solve this problem though. So the test doesn’t disprove ToM in general, just that it memorizes some results.
People are capable of theory of mind. That does not mean they're using it. Same is true for reasoning.
People messing up in this puzzle are not confused about the idea that people know things others don't.
Would you care to explain how that responds to my point. I didn't feel the need to specify that ToM can not only be replaced with reasoning, but logic, and my point will still stand.
The more dangerous - and far more ridiculous - assumption is that if the LLM does write the correct code then it does understand what people know. I am amazed at how common that view is here.
This seems to amount to asking an LLM how it feels about Cheryl, discovering that it is performatively happy about her existence, and then deducing that the LLM has no capacity for genuine emotion, expressed in the form of logic.
The faulty premise lies in the formulation of the test and makes the responses both predictable, but also does a disservice to 'mind' because it tries to interpret it in such a way that an LLM could begin to grapple with the basics, but not in a meaninful way.
Perhaps it is useful to help build better context-specific logic flows (generally known as software) but it doesn't seem to provide any progress on the "theory of mind" front, which I guess is a borrowed notion.
Script generated by o1-preview:
Output:Cheryl's birthday is on July 16
What is a software program?
The codification of a solution.
What is a solution?
An answer to a problem.
What is a problem?
The identification and expression of a need to be satisfied.
What is a need?
A uniquely human experience, one which only exists within the minds of people whom experience it.
Do animals not have needs?
> Do animals not have needs?
We are animals as well, so a pedantic answer would be "yes." :-)
My actual answer is the origin of the thought process began by Norvig stating:
So while the concept of "need" is shared across many entities and widely applicable to many conversations, in this context it serves as a root cause answering "why write a program."It's important to remember that modern LLMs are trained on bloody everything. They know every common logic problem, at least when stated the way they would have seen it.
If you want to test an LLM, always make up a new problem. It can be the same idea as an existing problem, but change all names and numbers.
I tested if GPT 3.5 could recognize chaos theory. If I stated it as the typical "butterfly flaps its wings" it instantly recognized it as the chaos theory example. If I totally changed the problem statement, it correctly identified that weather isn't correlated with a single action by a single person, but it didn't associate it with chaos theory.
Related to this in asked LLMs to directly solve the same riddle but then obfuscated the riddle so it wouldn’t match training data and as a final test added extraneous information to distract them.
Outside of o1, simple obfuscation was enough to throw off most of the group.
The distracting information also had a relevant effect. I don’t think LLMs are properly fine tuned for prompters lying to them. With RAG putting “untrusted prose” into the prompt that’s a big issue.
https://hackernoon.com/ai-loves-cake-more-than-truth
The approach is fundamentally flawed. You can’t query an LLM as to whether it has a theory of mind. You need to analyze how its internal logic works.
Imagine the opposite result had occurred, and the LLM had outputted something which was considered a theory of mind… Does that prove it has one, or that it was trained on some data that had something it used which made it sound like it has a theory of mind?
I think the test is better than many other commenters are giving credit. It reminds me of responses to the river crossing problems. The reason people do tests like this is because we know the answer a priori or can determine the answer. Reasoning tests are about generalization, and this means you have to be able to generalize based on the logic.
So the author knows that the question is spoiled, because they know that the model was trained on wiki. They also tested to see if the model is familiar with the problem in the first place. In fact, you too can confirm this by asking "What is the logic puzzle, Cheryl's birthday?" and they will spit you out the correct answer.
The problem also went viral, so there are even variations of this. That should tell us that the model has not just been trained on it, but that it has seen it in various forms and we know that this increases its ability to generalize and perform the task.
So then we're left with reasoning. How do we understand reasoning? It is the logical steps. But we need to make sure that this is distinct from memorization. So throwing in twists (as people do in the river puzzles) is a way to distinguish memory from logic. That's where these models fail.
People always complain that "oh, but humans can't do it." I refer to this as "proof by self-incompetence." (I also see it claimed when it isn't actually true) But not everybody reasons, and not all the time (trivial cases are when you're asleep or in a coma, but it also includes things like when you're hangry or just dumb). Humans are different from LLMs. LLMs are giving it 100%, every time. "Proof by self-incompetence" is an exact example of this, where the goal is to explain a prior belief. But fitting data is easy, explaining data is hard (von Neumann's Elephant).
There's also a key part that many people are missing in the analysis. The models were explicitly asked to *generalize* the problem.
I'll give some comments about letting them attempt to solve iteratively, but this is often very tricky. I see this with the river crossing puzzles frequently, where there is information leakage passed back to the algo. Asking a followup question like "are you sure" is actually a hint. You typically don't ask that question when it is correct. Though newer models will not always apologize for being wrong, when actually correct, when they are sufficiently trained on that problem. You'll find that in these situations if you run the same prompt (in new clean sessions) multiple times that the variance in the output is very low.
Overall, a good way to catch LLMs in differentiating reasoning from memorization is getting them to show their work, the steps in between. It isn't uncommon for them to get the right answer but have wrong steps, even in math problems. This is always a clear demonstration of memorization rather than reasoning. It is literally the subtly that matters.
I suspect that one of the difficulties in humans analyzing LLMs is that there is no other entity that is capable of performing such feats that does not also have a theory of mind and a world model. But a good analogy might be in facts that you know, but not understanding why they are "the answer." I'm sure there's many people who have memorized complexities for many sorting algos or leet code problems and couldn't derive the answer themselves.
But I really don't understand why we *need* LLMs to reason? A dictionary memorizes things, and so does wikipedia. Their lack in ability to reason does not make them any less marvelous of inventions/tools. But maybe, if we're looking to create intelligent and thinking machines, it isn't as simple as scale. We love simple things, but few things are simple and correct (though far more things are simple and approximately correct).
>I think the test is better than many other commenters are giving credit.
The test is fine. The conclusion drawn from it, not so much. If humans fail your test for x and you're certain humans have x then you're not really testing for x. x may be important to your test for sure but you're testing for something else too. Or maybe humans don't have x after all. Either conclusion is logically consistent at least. It's the middle, "rules for thee but not me" conclusions that are tiring.
Like it's theory of mind. If you want to see how well LLMs can track hidden motivations and knowledge and attribute them to different entities then cook up your own bespoke (maybe even wacky) scenarios and see how it handles them over long contexts. That's how to test for theory of mind. By doing what the other did here, you're introducing a few factors that may derail the output and have nothing to do with ToM.
>Humans are different from LLMs. LLMs are giving it 100%, every time.
I don't know how anyone who uses LLMs extensively can genuinely believe this to be true. I mean i'm not sure what this means ? Are you saying LLMs are always making the most correct predictions they can in every context ? Because that's just blatantly false.
Yes models overfit. Yes you can trick them. No it does not necessarily mean they haven't generalized well enough to solve your "subtle variation". And if people weren't so hellbent on being able to say "aha" to the machine, they would see that.
If you're really interested in seeing how well the model has learnt the underling logic steps why bother with the trickery ? Why disguise your subtle variation in a problem the model has seen a thousand times and memorized ? You can have the same question requiring the same logic but written in a way that doesn't immediately point to an overfit problem (you don't need to worry about if hinting is 'cheating' or not) How is that not a better test of generalization ?
And i'm not saying that the tests with the trickery or subterfuge are useless or to be done away with, just that you are no longer just testing the ability to generalize.
Humans have the *ability* to reason. This is not equivalent to saying that humans reason at all times (this was also started in my previous comment)
So it's none of: "humans have x", "humans don't have x", nor "humans have x but f doesn't have x because humans perform y on x and f performs z on x".
It's correct to point out that not all humans can solve this puzzle. But that's an irrelevant fact because the premise is not that human always reason. If you'd like to make the counter argument that LLMs are like humans in that they have the ability to reason but don't always, then you got to provide strong evidence (just like you need to provide strong evidence that LLMs can reason). But this (both) is quite hard to prove because humans aren't entropy minimizers trained on petabytes of text. It's easier to test humans because we generally have a much better idea of what they've been trained on and we can also sample from different humans that have been trained on different types of data.
And here's a real kicker, when you've found a human that can solve a problem (meaning not just state the answer but show their work) nearly all of them can adapt easily to novel augmentations.
So I don't know why you're talking about trickery. The models are explicitly trained to solve problems like these. There's no slight of hand. There's no magic tokens, no silly or stage wording that would be easily misinterpreted. There's a big difference between a model getting an answer wrong and a promoter tricking the model.
>I think you misunderstand, but it's a common misunderstanding. Humans have the ability to reason. This is not equivalent to saying that humans reason at all times (this was also started in my previous comment)
>So it's none of: "humans have x", "humans don't have x", nor "humans have x but f doesn't have x because humans perform y on x and f performs z on x".
This is all rather irrelevant here. You can sit a human for some arbitrarily long time on this test and he/she will be unable to solve it even if the human has theory of mind (the property we're looking for) the entire duration of the test, ergo the test is not properly testing for the property of theory of mind.
>So I don't know why you're talking about trickery. The models are explicitly trained to solve problems like these.
Models are trained to predict text. Solving problems is just what is often the natural consequence of this objective.
It's trickery the same way it can be considered trickery when professors would do it to human testers. Humans and Machines that memorize things take shortcuts in prediction when they encounter what they've memorized "in the wild". That's the entire point of memorization really.
The human or model might fail not because it lacks the reasoning abilities to solve your problem, but because its attention is diverted by misleading cues or subtle twists in phrasing.
And if you care about the latter, fine!, that's not a bad thing to care about but then don't pretend you are only testing raw problem solving ability.
This test does not require theory of mind or test for "theory of mind" because there are many people who have a well formed theory of mind who cannot solve this problem, and well formulated, it can be solved by a simple logic program, which again, would not have any kind of theory of mind. It'd produce a large number of false positives _and_ false negatives.
could be an architectual issue with the LLMs because you need to juggle a lot of states just from one statement regarding a big problem. Sort of like if you ask it to write an app like facebook. It would give you a bunch of crap, which is worse.
The majority of humans in flesh can't solve the problem - so we need alternate measures for judging theory of mind capabilities in LLMs
What about the difference that the human knows what they don't know?
In contrast, the LLM knows nothing, but confidently half regurgitates correlational text that it is seen before.
As far as all the research on this goes, LLMs (internally) mostly know what they know, but incentivizing that information making it to output is difficult.
Well the majority of trained neural network models fail to solve most problems they're tested on, also. When you pick up a neural net paper (or really any machine learning paper) and look at the results of experiments the authors will invariably list the results of their best-performing model, chosen by careful cross-validation over multiple random seeds.
The point of this testing is to estimate the true learning ability of a system by looking at what it can do in the best case scenario. Nobody reports the worst-case scenario, otherwise deep neural nets would look a lot less impressive than they do.
This doesn't measure theory of mind at all, it's just a silly logic puzzle. What we need are AI researchers who have read a psychology book and understand what theory of mind experiments are actually trying to demonstrate.
All humans have the capacity to reason, but not all humans reason. There's a big difference. Hell, even GPT-4o will tell you this.
Here, don't trust me, let's ask GPT about memorization and reasoning
I wonder if they are any unique properties of those programs that we can figure out what stack overflow or textbooks they're copying.
Most LLMs won a T-shirt with the following inscription: " I am not as smart as Peter Norvig "!
Nice! I use various LLMs many times a day as a limited coding tool and something to bounce ideas off of, and it is impossible to not think about how LLMs work and what their limitations are.
I tried just asking Claude Sonet to solve the Cheryl’s Birthday word problem, changing the dates. Pretty cool that it can solve it as a word problem, and LLMs will keep getting better at coding.
As a slight tangent: I used a combination of Gemini, GPT-4o, and Claude last week to write Common Lisp code for a simple RDF data store and the subset of SPARQL queries that I thought I would need in embedded Common Lisp applications. This process was far from automatic: I initially provided almost two pages of English instructions, and I had to help debug non-working code by adding debug statements and then show the models the code with print statements and the new output. I also did the optional thing of asking for stylistic changes. TLDR: saved me time and I liked the final code.
I always enjoy it when people like Peter and Karpathy write relatively simple code to share ideas. I am a fairly good coder (I had the meaningless title Master Software Engineer at Capital One) but I like to read other people’s code, and I must admit that I spend more time reading code on GitHub than I spend reading technical papers.
How does solving a logic puzzle imply a theory of mind? I don’t mean to say that LLMs don’t have a theory of mind, just that deductive reasoning does not amount to empathetic evaluations of how someone else thinks and feels…
…unless you’re a programmer.
Not really about Theory of Mind, but in the same line, I remember the other day someone argued with me that LLMs model the world, rather than just modelling language (that may represent the world).
I kept thinking about that problem and plausible experiments to show my point that LLMs are dumb about the physical world, even if they know perfectly how it works in terms of language/representation. So I thought, what happens if I give an LLM an image and I ask a representation of said image in ASCII art (obviously without relying in Python and the trivial pixel intensity to character transform it usually proposes). Remember:
- LLMs should've been trained with a lot of RGB image training data with associated captions => So it should understand images very well.
- LLMs should've been trained with a lot of ASCII training data with associated captions => So it should draw/write ASCII like an expert. Plus, it understands vision apparently (managed as tokens), so it should do well.
But it can't do a decent translation that captures the most interesting features of an image into ASCII art (I'm pretty sure a human with an hour of time should be able to do it, even if its awful ASCII art). For example, I uploaded an image macro meme with text and two pictures of different persons kind of looking at each other. The ASCII art representation just showed two faces, that didn't look at each other but rather away from each other. It just does not "understand" the concept of crossing sights (even if it "understands" the language and even image patches when you ask about where are they looking at, it will not draw that humanly important stuff by itself).
These things just work with tokens, and that is useful and seems like magic in a lot of domains. But there is no way in hell we are going to get into AGI without a fully integrated sensor platform that can model the world in its totality including interacting with it (i.e. like humans in training, but not necessarily in substrate nor training time hopefully). And I really don't know how something that has a very partial model of the world can have a Theory of Mind.
However, ask a Generative Adversarial Network for ASCII, you'll get what you expect. Absent the infra-word character cohesion that LLM's token-ization provides, it will give realistic, if sometimes "uncanny" images - ones that "make sense" sequentially, or in the short term, but not the longer, or larger context.
The language portion of your brain, that works faster than you do - else you would be at a loss of words constantly - is not nearly as equipped to deal with spatial problems that your posterior parietal cortex is.
Ultimately we are converging towards a Mixture-of-Experts model that we will one day realize is just....us, but better.
AI researchers need to learn what terms like "theory of mind" actually mean before they write dumb crap like this. Theory of mind is about attributing mental states to others, not information. What Norvig has done here is present a logic puzzle, one that works equally well when the agents are Prolog programs instead of clever children. There's no "mind" in this puzzle at all. Norvig is being childishly ignorant to call this "theory of mind." It's hard to overstate my contempt for this kind of useless junk science, especially when it comes from an impressive pedigree.
Of course he is hardly the only offender: arrogant disregard for psychology is astonishingly common among LLM researchers. Maybe they should turn off ChatGPT and read a book.
Hell, Norvig could have asked any of the LLMs if this was a good theory of mind test.
CS researchers do partake in some pretty low quality science.
My fav is AI researchers arrogantly rediscovering learning theory and pedagogy.
> AI researchers need to learn what terms like "theory of mind" actually mean before they write dumb crap like this.
> Of course he is hardly the only offender: arrogant disregard for psychology is astonishingly common among LLM researchers. Maybe they should turn off ChatGPT and read a book.
Perhaps this[0] book? It is one of the seminal texts on AI topics. Maybe Norvig should read it.
0 - http://aima.cs.berkeley.edu/
People who fail theory of mind may assume everyone else knows something because they know it.
[dead]