> Frontier models score ~90% on Python but only 3.8% on esoteric languages, exposing how current code generation relies on training data memorization rather than genuine programming reasoning.
Finally! This is a really obvious test-case that I've wondered about myself, and have seen many casual skeptics and cautiously optimistic people independently raising for several years now. When megacorp is not crowing about such a test, the silence is deafening, and it was practically guaranteed that they tested, didn't like the results, and didn't publish.
I'm still surprised it took this long for academics to try it, and skimming cites, I don't see anything similar. Anyone know if this is the first paper to try this kind of thing, or just the first paper to put together a especially good suite of reusable benchies?
If this benchmark becomes popular, then presumably to avoid such embarrassments synthetic data is eventually added to training sets to make sure even esolangs are somewhat more in-distro, and then we gradually run out of esolangs to do honest testing with. SAT is a whole different animal admittedly, but comparable honest tests might involve just forcing models to use randomly generated but easily checked EBNF grammar? I don't have a quick link to the relevant papers, but afaik benchmarks of strict adherence to non-simple JSON schemas is also still pretty bad, and we're just working around it with lots of retries/tokens. "But look how well it works for 10k lines of kubernetes manifests!" Well yeah, maybe, but it barely needs to really follow a schema since that is more stuff that's in the training set..
> If this benchmark becomes popular, then presumably to avoid such embarrassments synthetic data is eventually added to training sets to make sure even esolangs are somewhat more in-distro
"After the paper was finalized, we ran agentic systems that mimic how humans would learn to solve problems in esoteric languages. We supplied our agents with a custom harness + tools on the same benchmark. They absolutely crushed the benchmark. Stay tuned"
Never heard that before! But ok, it seems like this entity is affiliated with the paper, I'm interested..
> A little harness engineering was enough
Enough for what? It's not enough to crush the benchmark if that just means showing it is feasible to generate esolang code. No one cares about that if we're using it as a proxy to investigate general reasoning. Given validation/execution feedback loops, and 1000 retries for hello-world where we succeed with trial and error, the case for reasoning still wouldn't look great.
Suppose it's way better than that though; maybe trials are few and show clear logical progression. Well, we needed a harness, and that's still damning for whether and to what extent models can reason. But with harnesses at least we have a way to do general reasoning well enough on novel problems, right?
> mimic how humans would learn to solve problems in esoteric languages
Well hold on, does the harness do that, or does it enable models to do reasoning? We've retreated back towards solving that thing we weren't actually interested in..
I don’t have much confidence n the premise. Where was the human control? I think most Python programmers when tasked with “now do it in brainfuck” would fail. There is not much meaningful overlap in how to express intent and solutions to problems. The ridiculous syntax is the joke.
But more importantly, I don’t have to solve any problems with languages that are elaborate practical jokes, so I’m not worried about the implications of an LLMs ability to be useful.
The point here is to test for "genuine reasoning" or something approaching it. If a model is truly reasoning it should be competent even in a new language you just made up (provided the language itself is competently designed)
> I don’t have to solve any problems with languages that are elaborate practical jokes
This is just being needlessly dismissive. Esolangs are (and have been) an area of active CS research for decades. I know I'm a bit of an esolang nerd, and while some are jokes, most focus on specific paradigms (e.g. Piet is visual, bf is a Turing tarpit, etc.).
> I think most Python programmers when tasked with “now do it in brainfuck” would fail.
This is untrue. Given internet-level awareness and infinite time, virtually all developers should be able to go from Python to brainfuck (trivially, I might add.) Did you even look at the test sets? It's all pretty basic stuff (palindromes, array traversal, etc.—we aren't using pandas here). I mean, sure, it would take forever and be mega annoying, but manipulating a head and some tape is hardly difficult.
> Frontier models score ~90% on Python but only 3.8% on esoteric languages, exposing how current code generation relies on training data memorization rather than genuine programming reasoning.
I would probably score about the same, does this prove I also rely on training data memorization rather than genuine programming reasoning?
Or does this simply show that esolangs are hard to reason in by design? A more honest approach would use a "real", but relatively unpopular, language. Make them use CoffeeScript or Ada or PL/I or Odin or that other systems programming language that that very opinionated guy is implementing on top of QBE.
Unlike AI, you aren't able to regurgitate entire programs and patterns you've seen before.
AI's capacity for memorisation is unrivaled, I find it mind blowing that you can download a tiny ~4gb model and it will have vastly more general knowledge than an average human (considering that the human is more likely to be wrong if you ask it trivia about e.g. the spanish civil war).
But the average human still has actual reasoning capabilities, which is still (I think?) a debated point with AI.
> which is still (I think?) a debated point with AI.
It's not, people misread an Apple study and it became a meme. It lost currency as a meme because it is impossible to use a model in 2026 and come away with the idea it cannot reason, for any reasonable definition of the word reason (pun intended). Most of the debate from there is just people misreading each-other and imagining incentive structures at play. (ex. I am not claiming they are never stupid, ex. the car wash dilemma, but I am claiming its gee-whiz enough at enough that it's become de facto beyond honest debate)
> AI's capacity for memorisation is unrivaled,
Much like "it just memorizes training data", "memorization" has a kernel of truth to it. Memorizing does not imply "it has 100% "learned", for some definition of learned similar to "guaranteed 100% reproducible translatable computation", brainfuck to the point it's just as easy as writing any other program, and thus if it hasn't, it cannot reason"
At the end of the day these are just mathematical objects. And while it's not discourse-contributing, the mundane truth is, those matmuls born from boring curve-fitting at scale know/memorized/can reason about/can parrot/have adjusted the float32s in such a way that it produces C a lot better than Brainfuck. Much like us. But they're just matmuls curve-fitting at scale.
Reason and "appearance" of reasoning are two different things. Some people intrinsically understand this. And some does not, and those people can never be made to understand it. I think it is one you things that you either get it automatically, or not get it at all..
So does a human engaged in rationalization or confabulation just appear to reason? We might be closer to these machines than you think, and I don’t mean that in a positive way.
Not OP, but as an LLM skeptic, I'd absolutely say that humans are natively very poor reasoners.
With effort, support, and resources, we can learn to reason well from first principles - call it reaching "intellectual maturity."
Catch an emotionally-immature human in a mistake or conflicting set of beliefs, and you'll be able to see them do exactly what you describe above: rationalize, deflect, and twist the data to support a more emotionally-comfortable narrative.
That usually holds even for intellectually-mature individuals who have not yet matured emotionally, even though they may reason quite well when the stakes are low.
Humans that have matured both emotionally and intellectually, however, are often able to keep themselves stable and reason well even in difficult circumstances.
The ways LLMs consistently fail spectacularly on out-of-distribution problems (like these esolangs) do seem to suggest they don't really mature intellectually, not the way humans can.
Maybe the Wiggum loop strategy shows otherwise? I'm not sure I know.
To me, it smells more like brute-forcing through to a result without fully understanding the problem, though.
Just look what kind of problems the easy task set is (hello world, echo line, count vowels, etc.). With best being ~10% of total in brainfuck this is 10 out of 20. You can google more solutions to these problems than that.
So you'd want to control for training data (e.g. brainfuck vs Odin?)
And ideally you'd control by getting it down to 0, i.e. inventing new programming languages with various properties and testing the LLMs on those.
I think that would be a useful benchmark for other reasons. It would measure the LLMs' ability to "learn" on the spot. From what I understand, this remains an underdeveloped area of their intelligence. (And may not be solvable with current architectures.)
> I would probably score about the same, does this prove I also rely on training data memorization rather than genuine programming reasoning?
It doesn't even prove the models do that. The RLVR environments being mostly Python isn't "training data memorization". That's just the kind of dumb thing people say to sound savvy.
> I would probably score about the same, does this prove I also rely on training data memorization rather than genuine programming reasoning?
Setting aside whether this benchmark is meaningful or not - the argument you're making is faulty. There are indeed humans who can write complete programs in Brainfuck and these other esolangs. The fact that you personally can't is not logically relevant.
particularly if you'd already read approximately all written material in existence about those languages. many humans are capable of learning a language from the documentation.
I don’t know your background, but suspect that if you were given sufficient motivation, you could solve these problems in an esoteroic language. It might be tedious, but I suspect that most anyone with an undergraduate degree in computer science and sufficient experience in a couple programming languages could meet the task.
I did something very similar last year, but with programming languages that were REALLY out of distribution; they were generated specifically for the benchmark. I call it TiānshūBench (天书Bench): https://jeepytea.github.io/general/introduction/2025/05/29/t...
Some models were OK at solving very simple problems, but nearly all of them would, for example, hallucinate control structures that did not exist in the target language.
A few months ago I created a little API to help with an obnoxious case in FFI [0] in an extremely esoteric language known as Python. It was straightforward, it had a fully typed signature, and I fully documented it. (And the entire implementation was only 50 lines or so of intentionally very straightforward code.) The LLM (Codex 5.2 IIRC) could not manage to call the function with the right arguments even after multiple rounds of prompting.
Sometimes I think LLMs are unbelievably, amazingly good at things. And sometimes I’m deeply suspicious that they really not very smart, and this was an example of the latter.
[0] Python calling to C, passing a callback function pointer and a void *opaque that C will pass back to the callback. Short of writing an extension module, this is pretty much forced to go through an inherently nasty JIT codegen process in libffi, which is sort of tolerable, but you really don’t want to redo it for each object that gets opacified to void*. Codex passed a lambda, which did the nasty JIT thing every time. I wrote a little shim using weakref. Apparently no one has done this before, so Codex wasn’t trained on it, and it couldn’t make itself call the function. Maybe I should post it to PyPI.
I'm shocked to see how poorly these models, which I find useful day to day, do in solving virtually any of the problems in Unlambda.
Before looking at the results my guess was that scores would be higher for Unlambda than any of the others, because humans that learn Scheme don't find it all that hard to learn about the lambda calculus and combinatory logic.
But the model that did the best, Qwen-235B, got virtually every problem wrong.
Probably because there's a ton of code that deals with nested parentheses across languages in the training data, and models have learned how to work around tokenization limitations, when it comes to parentheses.
I have encountered the opposite of this. All of the latest pro tier models are still fighting for their lives to use powershell correctly, really basic things like quotes, escaping, heredocs. Doesn't matter what I put in agents.md or instruct it to do. You just have to accept the token tax of it stomping on rakes until it figures it out itself and then keep using that session.
It's bad enough that I've considered writing some sort of cursed bash->posh translation layer
Yet it has no issues at all implementing and then writing slopjective-c 3.0
Opus 4.6 has gotten pretty good at writing Powershell.
It’s the first model where I didn’t have to ask, repeatedly, that it use Powershell 5, and never use emojis or other invalid characters, like Gemini and those non-ASCII spaces.
(founder of Lossfunk, the lab behind this research.)
Esolang-Bench went viral on X. A lot of discussion ensued; addressing some of the common points that came up. Addressing a few questions about our Esolang-Bench. Hope it helps.
a) Why do it? Does it measure anything useful?
It was a curiosity-driven project. We're interested in how humans exhibit sample-efficiency in learning and OOD generalization. So we simply asked: if models can zero/few shot correct answers for simple programming problems in Python, can they do the same in esoteric languages as well?
The benchmark is what it is. Different people can interpret its usefulness differently, and we encourage that.
b) But humans can't also write esoteric languages well. It's an unfair comparison.
Primarily, we're interested in measuring LLM capabilities. With the talk of ASI, it is supposed that their capabilities will soon be super-human. So, our primary motivation wasn't to compare to humans but to check what they can do this by-construction difficult benchmark.
However, we do believe that humans are able to teach themselves a new domain by transferring their old skills. So this benchmark was to set a starting point to explore how AI systems can do the same as well (which is what we're exploring now)
c) But Claude Code crushes it. You limited models artificially.
Yes, we tested models in zero and few shot capabilities. And in the agentic loop we describe in the paper, we limit the number of iterations. As we wrote above, we wanted to understand their performance from a comparative point of view (say on highly represented languages like Python) and that's by the benchmark by design is like this.
After the paper was finalized, we experimented with agentic systems where we gave models tools like bash and allowed unlimited iterations (but limited submission attempts). They indeed perform much better.
The question that's relevant is what makes these models perform so well when you give them tools and iterations v/s when you don't. Are they reasoning / learning like humans or is it something else?
d) So, are LLMs hyped? Or is our study clickbait?
The paper, code and benchmark are all open source.
We encourage whoever is interested to read it, and make up their own minds.
(We couldn't help notice that the same set of results were interpreted wildly differently within the community. A debate between opposing camps of LLMs ensued. Perhaps that's a good thing?)
Yep that's the implication. Anecdotally this is obvious to me. I'm using LLMs to write Java and C++, and then can churn out generic plumbing with no issues, but novel code for a novel implementation of a novel idea, they have no idea what they're doing.
I'm getting good productivity gains, but it requires a lot of hand holding because AI does not know what it's doing.
On far less novel problems I get far better results.
I guess if you tell codex to build a transpiler from a subset of python to brainfuck, then solve in that subset of python, it would work much better. Would that be cheating?
Doing something like that is basically the only way to write unlambda: you start with a lambda calculus (or scheme or whatever) and reduce the lambdas away mechanically. (This is in the unlambda docs!)
I am not surprised by this, and am glad to see a test like this. One thing that keeps popping up for me when using LLMs is the lack of actual understanding. I write Elixir primarily and I can say without a doubt, that none of the frontier models understand concurrency in OTP/Beam. They look like they do, but they’ll often resort to weird code that doesn’t understand how “actors” work. It’s an imitation of understanding that is averaging all the concurrency code it has seen in training. With the end result being huge amount of noise, when those averages aren’t enough, guarding against things that won’t happen, because they can’t… or they actively introduce race conditions because they don’t understand how message passing works.
Current frontier models are really good at generating boiler plate, and really good at summarizing, but really lack the ability to actually comprehend and reason about what’s going on. I think this sort of test really highlights that. And is a nice reminder that, the LLMs, are only as good as their training data.
When an LLM or some other kind of model does start to score well on tests like this, I’d expect to see better them discovering new results, solutions, and approaches to questions/problems. Compared to how they work now, where they generally only seem to uncover answers that have been obfuscated but are present.
Mhh... my hunch is that part of this is that all python keywords are 1 token, I assume. And for those very weird languages, tokenizing might make it harder to reason over those tokens.
Would love to see how the benchmarks results change if the esoteric languages are changed a bit to make them have 1-token keywords only.
Reasoning is hard, reasoning about colors while wearing glasses that obfuscate the real colors... even harder... but not the core issue if your brain not wired correctly to reason.
I suspect the way out of this is to separate knowledge from reason: to train reasoning with zero knowledge and zero language... and then to train language on top of a pre-trained-for-reasoning model.
LLMs already use mixture of experts models, if you ensure the neurons are all glued together then (i think) you train language and reason simultaneously
> Frontier models score ~90% on Python but only 3.8% on esoteric languages, exposing how current code generation relies on training data memorization rather than genuine programming reasoning.
Finally! This is a really obvious test-case that I've wondered about myself, and have seen many casual skeptics and cautiously optimistic people independently raising for several years now. When megacorp is not crowing about such a test, the silence is deafening, and it was practically guaranteed that they tested, didn't like the results, and didn't publish.
I'm still surprised it took this long for academics to try it, and skimming cites, I don't see anything similar. Anyone know if this is the first paper to try this kind of thing, or just the first paper to put together a especially good suite of reusable benchies?
If this benchmark becomes popular, then presumably to avoid such embarrassments synthetic data is eventually added to training sets to make sure even esolangs are somewhat more in-distro, and then we gradually run out of esolangs to do honest testing with. SAT is a whole different animal admittedly, but comparable honest tests might involve just forcing models to use randomly generated but easily checked EBNF grammar? I don't have a quick link to the relevant papers, but afaik benchmarks of strict adherence to non-simple JSON schemas is also still pretty bad, and we're just working around it with lots of retries/tokens. "But look how well it works for 10k lines of kubernetes manifests!" Well yeah, maybe, but it barely needs to really follow a schema since that is more stuff that's in the training set..
> If this benchmark becomes popular, then presumably to avoid such embarrassments synthetic data is eventually added to training sets to make sure even esolangs are somewhat more in-distro
https://x.com/lossfunk/status/2034637505916792886
"After the paper was finalized, we ran agentic systems that mimic how humans would learn to solve problems in esoteric languages. We supplied our agents with a custom harness + tools on the same benchmark. They absolutely crushed the benchmark. Stay tuned"
A little harness engineering was enough!
> Stay tuned
Never heard that before! But ok, it seems like this entity is affiliated with the paper, I'm interested..
> A little harness engineering was enough
Enough for what? It's not enough to crush the benchmark if that just means showing it is feasible to generate esolang code. No one cares about that if we're using it as a proxy to investigate general reasoning. Given validation/execution feedback loops, and 1000 retries for hello-world where we succeed with trial and error, the case for reasoning still wouldn't look great.
Suppose it's way better than that though; maybe trials are few and show clear logical progression. Well, we needed a harness, and that's still damning for whether and to what extent models can reason. But with harnesses at least we have a way to do general reasoning well enough on novel problems, right?
> mimic how humans would learn to solve problems in esoteric languages
Well hold on, does the harness do that, or does it enable models to do reasoning? We've retreated back towards solving that thing we weren't actually interested in..
I don’t have much confidence n the premise. Where was the human control? I think most Python programmers when tasked with “now do it in brainfuck” would fail. There is not much meaningful overlap in how to express intent and solutions to problems. The ridiculous syntax is the joke.
But more importantly, I don’t have to solve any problems with languages that are elaborate practical jokes, so I’m not worried about the implications of an LLMs ability to be useful.
The point here is to test for "genuine reasoning" or something approaching it. If a model is truly reasoning it should be competent even in a new language you just made up (provided the language itself is competently designed)
So humans don't do "genuine reasoning"?
> I don’t have to solve any problems with languages that are elaborate practical jokes
This is just being needlessly dismissive. Esolangs are (and have been) an area of active CS research for decades. I know I'm a bit of an esolang nerd, and while some are jokes, most focus on specific paradigms (e.g. Piet is visual, bf is a Turing tarpit, etc.).
> I think most Python programmers when tasked with “now do it in brainfuck” would fail.
This is untrue. Given internet-level awareness and infinite time, virtually all developers should be able to go from Python to brainfuck (trivially, I might add.) Did you even look at the test sets? It's all pretty basic stuff (palindromes, array traversal, etc.—we aren't using pandas here). I mean, sure, it would take forever and be mega annoying, but manipulating a head and some tape is hardly difficult.
> Frontier models score ~90% on Python but only 3.8% on esoteric languages, exposing how current code generation relies on training data memorization rather than genuine programming reasoning.
I would probably score about the same, does this prove I also rely on training data memorization rather than genuine programming reasoning?
Or does this simply show that esolangs are hard to reason in by design? A more honest approach would use a "real", but relatively unpopular, language. Make them use CoffeeScript or Ada or PL/I or Odin or that other systems programming language that that very opinionated guy is implementing on top of QBE.
Unlike AI, you aren't able to regurgitate entire programs and patterns you've seen before.
AI's capacity for memorisation is unrivaled, I find it mind blowing that you can download a tiny ~4gb model and it will have vastly more general knowledge than an average human (considering that the human is more likely to be wrong if you ask it trivia about e.g. the spanish civil war).
But the average human still has actual reasoning capabilities, which is still (I think?) a debated point with AI.
> which is still (I think?) a debated point with AI.
It's not, people misread an Apple study and it became a meme. It lost currency as a meme because it is impossible to use a model in 2026 and come away with the idea it cannot reason, for any reasonable definition of the word reason (pun intended). Most of the debate from there is just people misreading each-other and imagining incentive structures at play. (ex. I am not claiming they are never stupid, ex. the car wash dilemma, but I am claiming its gee-whiz enough at enough that it's become de facto beyond honest debate)
> AI's capacity for memorisation is unrivaled,
Much like "it just memorizes training data", "memorization" has a kernel of truth to it. Memorizing does not imply "it has 100% "learned", for some definition of learned similar to "guaranteed 100% reproducible translatable computation", brainfuck to the point it's just as easy as writing any other program, and thus if it hasn't, it cannot reason"
At the end of the day these are just mathematical objects. And while it's not discourse-contributing, the mundane truth is, those matmuls born from boring curve-fitting at scale know/memorized/can reason about/can parrot/have adjusted the float32s in such a way that it produces C a lot better than Brainfuck. Much like us. But they're just matmuls curve-fitting at scale.
> and come away with the idea it cannot reason
Reason and "appearance" of reasoning are two different things. Some people intrinsically understand this. And some does not, and those people can never be made to understand it. I think it is one you things that you either get it automatically, or not get it at all..
So does a human engaged in rationalization or confabulation just appear to reason? We might be closer to these machines than you think, and I don’t mean that in a positive way.
Not OP, but as an LLM skeptic, I'd absolutely say that humans are natively very poor reasoners.
With effort, support, and resources, we can learn to reason well from first principles - call it reaching "intellectual maturity."
Catch an emotionally-immature human in a mistake or conflicting set of beliefs, and you'll be able to see them do exactly what you describe above: rationalize, deflect, and twist the data to support a more emotionally-comfortable narrative.
That usually holds even for intellectually-mature individuals who have not yet matured emotionally, even though they may reason quite well when the stakes are low.
Humans that have matured both emotionally and intellectually, however, are often able to keep themselves stable and reason well even in difficult circumstances.
The ways LLMs consistently fail spectacularly on out-of-distribution problems (like these esolangs) do seem to suggest they don't really mature intellectually, not the way humans can.
Maybe the Wiggum loop strategy shows otherwise? I'm not sure I know.
To me, it smells more like brute-forcing through to a result without fully understanding the problem, though.
Just look what kind of problems the easy task set is (hello world, echo line, count vowels, etc.). With best being ~10% of total in brainfuck this is 10 out of 20. You can google more solutions to these problems than that.
It's pointless to argue, we exist in world of "this technology will usher in the singularity" versus "this tech is useful but come on"
The singularity crowd has never listened to reason and never will.
Yeah there seem to be two axes here.
Esolang vs mainstream paradigm.
Popular vs scarce training data.
So you'd want to control for training data (e.g. brainfuck vs Odin?)
And ideally you'd control by getting it down to 0, i.e. inventing new programming languages with various properties and testing the LLMs on those.
I think that would be a useful benchmark for other reasons. It would measure the LLMs' ability to "learn" on the spot. From what I understand, this remains an underdeveloped area of their intelligence. (And may not be solvable with current architectures.)
> I would probably score about the same, does this prove I also rely on training data memorization rather than genuine programming reasoning?
It doesn't even prove the models do that. The RLVR environments being mostly Python isn't "training data memorization". That's just the kind of dumb thing people say to sound savvy.
Try MUMPS, widely used but little training data online. Probably less than some esolangs
Frontier models have gotten much better at ObjectScript (the InterSystems evolution of MUMPS/M).
Palindrome:
https://chatgpt.com/s/t_69bc8d8c116c8191a339a33f0fbcc935
This is a noticeable improvement from a year ago.
I wish it would use Return instead of Quit but that’s a stochastic parrot for you.
> I would probably score about the same, does this prove I also rely on training data memorization rather than genuine programming reasoning?
Setting aside whether this benchmark is meaningful or not - the argument you're making is faulty. There are indeed humans who can write complete programs in Brainfuck and these other esolangs. The fact that you personally can't is not logically relevant.
particularly if you'd already read approximately all written material in existence about those languages. many humans are capable of learning a language from the documentation.
I had similar experiences with an unpopular but not "esoteric" language (Progress ABL) and so did some other developers in my team.
I don’t know your background, but suspect that if you were given sufficient motivation, you could solve these problems in an esoteroic language. It might be tedious, but I suspect that most anyone with an undergraduate degree in computer science and sufficient experience in a couple programming languages could meet the task.
I did something very similar last year, but with programming languages that were REALLY out of distribution; they were generated specifically for the benchmark. I call it TiānshūBench (天书Bench): https://jeepytea.github.io/general/introduction/2025/05/29/t...
Some models were OK at solving very simple problems, but nearly all of them would, for example, hallucinate control structures that did not exist in the target language.
A few months ago I created a little API to help with an obnoxious case in FFI [0] in an extremely esoteric language known as Python. It was straightforward, it had a fully typed signature, and I fully documented it. (And the entire implementation was only 50 lines or so of intentionally very straightforward code.) The LLM (Codex 5.2 IIRC) could not manage to call the function with the right arguments even after multiple rounds of prompting.
Sometimes I think LLMs are unbelievably, amazingly good at things. And sometimes I’m deeply suspicious that they really not very smart, and this was an example of the latter.
[0] Python calling to C, passing a callback function pointer and a void *opaque that C will pass back to the callback. Short of writing an extension module, this is pretty much forced to go through an inherently nasty JIT codegen process in libffi, which is sort of tolerable, but you really don’t want to redo it for each object that gets opacified to void*. Codex passed a lambda, which did the nasty JIT thing every time. I wrote a little shim using weakref. Apparently no one has done this before, so Codex wasn’t trained on it, and it couldn’t make itself call the function. Maybe I should post it to PyPI.
I'm shocked to see how poorly these models, which I find useful day to day, do in solving virtually any of the problems in Unlambda.
Before looking at the results my guess was that scores would be higher for Unlambda than any of the others, because humans that learn Scheme don't find it all that hard to learn about the lambda calculus and combinatory logic.
But the model that did the best, Qwen-235B, got virtually every problem wrong.
They are also weirdly bad at Brainfuck which is basically just a subset of C.
Yeah well they also still struggle with "4 + 6 / 9" so I'm not sure why anyone is surprised with these findings
BF involves a lot of repeated symbols, which is hard for tokenized models. Same problem as r's in strawberry.
Interesting. So why do the models seem to handle deeply nested Lisp expressions just fine?
Probably because there's a ton of code that deals with nested parentheses across languages in the training data, and models have learned how to work around tokenization limitations, when it comes to parentheses.
I have encountered the opposite of this. All of the latest pro tier models are still fighting for their lives to use powershell correctly, really basic things like quotes, escaping, heredocs. Doesn't matter what I put in agents.md or instruct it to do. You just have to accept the token tax of it stomping on rakes until it figures it out itself and then keep using that session.
It's bad enough that I've considered writing some sort of cursed bash->posh translation layer
Yet it has no issues at all implementing and then writing slopjective-c 3.0
Opus 4.6 has gotten pretty good at writing Powershell.
It’s the first model where I didn’t have to ask, repeatedly, that it use Powershell 5, and never use emojis or other invalid characters, like Gemini and those non-ASCII spaces.
(founder of Lossfunk, the lab behind this research.)
Esolang-Bench went viral on X. A lot of discussion ensued; addressing some of the common points that came up. Addressing a few questions about our Esolang-Bench. Hope it helps.
a) Why do it? Does it measure anything useful?
It was a curiosity-driven project. We're interested in how humans exhibit sample-efficiency in learning and OOD generalization. So we simply asked: if models can zero/few shot correct answers for simple programming problems in Python, can they do the same in esoteric languages as well?
The benchmark is what it is. Different people can interpret its usefulness differently, and we encourage that.
b) But humans can't also write esoteric languages well. It's an unfair comparison.
Primarily, we're interested in measuring LLM capabilities. With the talk of ASI, it is supposed that their capabilities will soon be super-human. So, our primary motivation wasn't to compare to humans but to check what they can do this by-construction difficult benchmark.
However, we do believe that humans are able to teach themselves a new domain by transferring their old skills. So this benchmark was to set a starting point to explore how AI systems can do the same as well (which is what we're exploring now)
c) But Claude Code crushes it. You limited models artificially.
Yes, we tested models in zero and few shot capabilities. And in the agentic loop we describe in the paper, we limit the number of iterations. As we wrote above, we wanted to understand their performance from a comparative point of view (say on highly represented languages like Python) and that's by the benchmark by design is like this.
After the paper was finalized, we experimented with agentic systems where we gave models tools like bash and allowed unlimited iterations (but limited submission attempts). They indeed perform much better.
The question that's relevant is what makes these models perform so well when you give them tools and iterations v/s when you don't. Are they reasoning / learning like humans or is it something else?
d) So, are LLMs hyped? Or is our study clickbait?
The paper, code and benchmark are all open source.
We encourage whoever is interested to read it, and make up their own minds.
(We couldn't help notice that the same set of results were interpreted wildly differently within the community. A debate between opposing camps of LLMs ensued. Perhaps that's a good thing?)
I had hope we might finally be ushering in a bold new era of programming in Malbolge but apparently that was too optimistic.
Does this imply LLMs will not work well on novel reasoning problems?
Yep that's the implication. Anecdotally this is obvious to me. I'm using LLMs to write Java and C++, and then can churn out generic plumbing with no issues, but novel code for a novel implementation of a novel idea, they have no idea what they're doing.
I'm getting good productivity gains, but it requires a lot of hand holding because AI does not know what it's doing.
On far less novel problems I get far better results.
ARC-AGI is already testing that.
I guess if you tell codex to build a transpiler from a subset of python to brainfuck, then solve in that subset of python, it would work much better. Would that be cheating?
Doing something like that is basically the only way to write unlambda: you start with a lambda calculus (or scheme or whatever) and reduce the lambdas away mechanically. (This is in the unlambda docs!)
I bet I can do better by allowing this: the llm can pull documentation of the language from the web to understand how it works.
If the llm has “skills” for that language, it will definitely increase accuracy.
I am not surprised by this, and am glad to see a test like this. One thing that keeps popping up for me when using LLMs is the lack of actual understanding. I write Elixir primarily and I can say without a doubt, that none of the frontier models understand concurrency in OTP/Beam. They look like they do, but they’ll often resort to weird code that doesn’t understand how “actors” work. It’s an imitation of understanding that is averaging all the concurrency code it has seen in training. With the end result being huge amount of noise, when those averages aren’t enough, guarding against things that won’t happen, because they can’t… or they actively introduce race conditions because they don’t understand how message passing works.
Current frontier models are really good at generating boiler plate, and really good at summarizing, but really lack the ability to actually comprehend and reason about what’s going on. I think this sort of test really highlights that. And is a nice reminder that, the LLMs, are only as good as their training data.
When an LLM or some other kind of model does start to score well on tests like this, I’d expect to see better them discovering new results, solutions, and approaches to questions/problems. Compared to how they work now, where they generally only seem to uncover answers that have been obfuscated but are present.
Mhh... my hunch is that part of this is that all python keywords are 1 token, I assume. And for those very weird languages, tokenizing might make it harder to reason over those tokens.
Would love to see how the benchmarks results change if the esoteric languages are changed a bit to make them have 1-token keywords only.
Considering that brainfuck only has 8 characters and models are scoring at 6.2% I don't think tokenization is the issue
The only issue. *
Reasoning is hard, reasoning about colors while wearing glasses that obfuscate the real colors... even harder... but not the core issue if your brain not wired correctly to reason.
I suspect the way out of this is to separate knowledge from reason: to train reasoning with zero knowledge and zero language... and then to train language on top of a pre-trained-for-reasoning model.
LLMs already use mixture of experts models, if you ensure the neurons are all glued together then (i think) you train language and reason simultaneously
"Genuine Reasoning"