The key seems to be that you take the transcript of a model working within a problem domain that it’s not yet good at or where the context doesn’t match it’s original training and then you continually retrain it based on its efforts and guidance from a human or other expert. You end up with a specialty model in a given domain that keeps getting better at that domain, just like a human.
The hard part is likely when someone proves some “fact” which the models knows and has had reinforced by this training is no longer true. The model will take time to “come around” to understand this new situation. But this isn’t unlike the general populous. At scale humans accept new things slowly.
> But this isn’t unlike the general populous. At scale humans accept new things slowly.
right, the model works like humans at scale. Not like a human who reads the actual paper disproving the fact they thought was correct and is able to adapt. True not every human manages to do that, science advancing one death at a time, but some can.
But since the model is a statistical one, it works like humans at scale.
Context learning means learning facts or rules without pre-training. They are two distinct phases.
An interesting question is, if pre-trained specialized models are available for a thousand or ten thousand most common tasks humans do every day, of what use a general model could be?
I'm conflicted. I don't know that I would necessarily want a model to pass all of these. Here is the fundamental problem. They are putting the rules and foundational context in "user" messages.
Essentially I don't think you want to train the models on full compliance to the user messages, they are essentially "untrusted" content from a system/model perspective. Or at least it is not generally "fully authoritative".
This creates a tension with the safety, truthfulness training, etc.
Sure, but the opposite end of the spectrum (which LLM providers have tended toward) is treating the training/feedback weights as "fully authoritative", which comes with its own questions about truth and excessive homogeneity.
Ultimately I think we end up with the same sort of considerations that are wrestled with in any society - freedom of speech, paradox of tolerance, etc. In other words, where do you draw lines between beneficial and harmful heterodox outputs?
I think AI companies overly indexing toward the safety side of things is probably more correct, in both a moral and strategic sense, but there's definitely a risk of stagnation through recursive reinforcement.
The article is suggesting that there should be a way for the LLM to gain knowledge (changing weights) on the fly upon gaining new knowledge which would eliminate the need for manual fine tuning.
LLMs of the future will need good data for proper context, but it is less and less making it onto the internet. Unpublished data stores like Discord or meeting recordings are going to be the only way forward. How else can you get up to date information except to be where the people are.
It's basically continual learning. This is beyond a hard problem it's currently an impossible one. I know of no system that solve CL even at small scale let alone large models.
Annoyingly, they have SOME inherent capability to do it. It's really easy to get sucked down this path due to that glimmer of hope but the longer you play with it the more annoying it becomes.
SSI seems to be focused on this problem directly so maybe they discover something?
Bit by bit, we need to figure out how to rebuild human contextual understanding in a way that LLMs can understand. One thing that gets overlooked is the problem if incorrect data. You can provide all of the context in the world but LLMs tend to choke on contradictions or, at the minimum, work a whole lot harder to determine how to ignore or work around incorrect facts.
"Forgetting" and "ignoring" are hugely valuable skills when building context.
It is weird to read because they bring up many things a lot of people have been critiquing for years.
> But as impressive as these feats are, they obscure a simple truth: being a "test-taker" is not what most people need from an AI.
> In all these cases, humans aren't relying solely on a fixed body of knowledge learned years ago. We are learning, in real-time, from the context right in front of us.
> To bridge this gap, we must fundamentally change our optimization direction.
I'm glad the conversation is changing but it's been a bit frustrating that when these issues were brought up people blindly point to benchmarks. It made doing this type of research difficult (enough to cause many to be pushed out). Then it feels weird to say "harder than we thought" because well... truthfully, they even state why this result should be expected
> They rely primarily on parametric knowledge—information compressed into their weights during massive pre-training runs. At inference time, they function largely by recalling this static, internal memory, rather than actively learning from new information provided in the moment.
And that's only a fraction of the story. Online algorithms aren't enough. You still need a fundamental structure to codify and compress information, determine what needs to be updated (as in what is low confidence), to actively seek out new information to update that confidence, make hypotheses, and so so much more.
So I hope the conversation keeps going in a positive direction but I hope we don't just get trapped in a "RL will solve everything" trap. RL is definitely a necessary component and no doubt will it result in improvements, but it also isn't enough. It's really hard to do deep introspection into how you think. It's like trying to measure your measuring stick with your measuring stick. It's so easy to just get caught up in oversimplification and it seems like the brain wants to avoid it. To quote Feynman: "The first principle is to not fool yourself, and you're the easiest person to fool." It's even easier when things are exciting. It's so easy because you have evidence for your beliefs (like I said, RL will make improvements). It's so easy because you're smart, and smart enough to fool yourself. So I hope we can learn a bigger lesson: learning isn't easy, scale is not enough. I really do think we'll get to AGI but it's going to be a long bumpy road if we keep putting all our eggs in one basket and hoping there's simple solutions.
Because we don't experience reality through language but direct sensory perception. Language is arbitrary bird song and visual representations dragged forward from history, accepted definitions never uniformly distributed.
Testing based on contextual correctness makes no sense when there is no center to the universe. No "one true context to rule them all".
We learn from hands on sensory experiences. Our bodies store knowledge independent of the brain; often referred to as muscle memory.
Gabe Newell mentioned this years ago; our brain is only great at some things like language and vision processing but the rest of our body is involved in sensory information processing too: https://en.wikiquote.org/wiki/Gabe_Newell
The most potent evidence the brain is not the center of the universe we commonly think it to be is that patient with 90% of their skull filled with fluid while they carried out a typical first worlder life: https://www.sciencealert.com/a-man-who-lives-without-90-of-h...
”Because we don't experience reality through language but direct sensory perception”
That statement is patently false. We know that language influences our senses to a degree where we are unable to perceive things if our language doesn’t have a word for it, and will see different things as being equal if our language uses the same word for both.
There are examples of tribal humans not being able to perceive a green square among blue squares, because their language does not have a word for the green color.
Similarly, some use the same word for blue and white, and are unable to perceive them as different colors.
If you're referring to the Himba experiment (or one of the news or blog posts tracing back to it), the outcome was far less decisive than you're implying. Language showed an impact on perception time of color differences, not a complete inability to distinguish.
"There are examples of tribal humans not being able to perceive a green square among blue squares, because their language does not have a word for the green color.
Similarly, some use the same word for blue and white, and are unable to perceive them as different colors."
Both of the above is false. There are a ton of different colors that I happen to call "red", that does not mean that I can't perceive them as different. That I don't call them "different colors" is completely irrelevant. And unable to perceive blue and white as different colors? (Maybe that was a joke?) Even a hypothetical language which only used a single word for non-black items, say, "color", for everything else, would be able to perceive the difference with zero problems.
Japanese use "aoi" for a set of colors which in English would be separated into "blue" and "green". I can assure you (from personal experience) that every Japanese speaker with a fully functioning visual system is perfectly able to perceive the difference between, in this case, blue and green as we would call them.
> So, for instance, you know, I’ve made this example before: a child lying in a crib and a hummingbird comes into the room and the child is ecstatic because this shimmering iridescence of movement and sound and attention, it’s just wonderful. I mean, it is an instantaneous miracle when placed against the background of the dull wallpaper of the nursery and so forth. But, then, mother or nanny or someone comes in and says, “It’s a bird, baby. Bird. Bird!” And, this takes this linguistic piece of mosaic tile, and o- places it over the miracle, and glues it down with the epoxy of syntactical momentum, and, from now on, the miracle is confined within the meaning of the word. And, by the time a child is four or five or six, there- no light shines through. They're- they have tiled over every aspect of reality with a linguistic association that blunts it, limits it, and confines it within cultural expectation.
that language prevents a child from learning nuance? sounds like nonsense to me. a child first learns broad categories. for example some children as they learn to speak think every male person is dad. then they recognize everyone with a beard is dad, because dad has a beard. and only later they learn to differentiate that dad is only one particular person. same goes for the bird. first we learn hat everything with wings is a bird, and later we learn the specific names for each bird. this quote makes an absurd claim.
Only after we acquire language from sensory experience first.
It need not be language as we know it that fosters those outcomes either.
What you describe is reinforcement education which can be achieved without our language, without the word "blue" we can still see the portion of the visible light spectrum that we associate to the specific word.
> Similarly, some use the same word for blue and white, and are unable to perceive them as different colors.
You really think they can't see clouds in the sky because they have the same word for white and blue? I think you take those studies as saying more than they said.
We do adapt our perception a little bit to fit what we need for our every day life, not for language but whats useful for us. Language matches what people need to talk about, not the other way around, if a cultures language doesn't differentiate between blue and green its because they never needed to.
Don't always trust everything you read in papers. Researchers are usually under incredible pressure to publish something, anything. Wait a few years and see if the paper survives the test of time. LLMs work reasonably fine for me in new domains.
This is quite on brand for China. I think they are experts at reverse engineering and learning 'from context' rather than by formal consumption of foreign training material.
The fictional training data with a made up country and laws was a very interesting experiment design, I can imagine that's how they approach making business with other countries. Like an alien made up system they have to learn on the spot.
The problem is even more fundamental: Today's models stop learning once they're deployed to production.
There's pretraining, training, and finetuning, during which model parameters are updated.
Then there's inference, during which the model is frozen. "In-context learning" doesn't update the model.
We need models that keep on learning (updating their parameters) forever.
The key seems to be that you take the transcript of a model working within a problem domain that it’s not yet good at or where the context doesn’t match it’s original training and then you continually retrain it based on its efforts and guidance from a human or other expert. You end up with a specialty model in a given domain that keeps getting better at that domain, just like a human.
The hard part is likely when someone proves some “fact” which the models knows and has had reinforced by this training is no longer true. The model will take time to “come around” to understand this new situation. But this isn’t unlike the general populous. At scale humans accept new things slowly.
> But this isn’t unlike the general populous. At scale humans accept new things slowly.
right, the model works like humans at scale. Not like a human who reads the actual paper disproving the fact they thought was correct and is able to adapt. True not every human manages to do that, science advancing one death at a time, but some can.
But since the model is a statistical one, it works like humans at scale.
Context learning means learning facts or rules without pre-training. They are two distinct phases.
An interesting question is, if pre-trained specialized models are available for a thousand or ten thousand most common tasks humans do every day, of what use a general model could be?
Hmm.. I looked at the benchmark set.
I'm conflicted. I don't know that I would necessarily want a model to pass all of these. Here is the fundamental problem. They are putting the rules and foundational context in "user" messages.
Essentially I don't think you want to train the models on full compliance to the user messages, they are essentially "untrusted" content from a system/model perspective. Or at least it is not generally "fully authoritative".
This creates a tension with the safety, truthfulness training, etc.
Sure, but the opposite end of the spectrum (which LLM providers have tended toward) is treating the training/feedback weights as "fully authoritative", which comes with its own questions about truth and excessive homogeneity.
Ultimately I think we end up with the same sort of considerations that are wrestled with in any society - freedom of speech, paradox of tolerance, etc. In other words, where do you draw lines between beneficial and harmful heterodox outputs?
I think AI companies overly indexing toward the safety side of things is probably more correct, in both a moral and strategic sense, but there's definitely a risk of stagnation through recursive reinforcement.
Isn’t that what fine tuning does anyway?
The article is suggesting that there should be a way for the LLM to gain knowledge (changing weights) on the fly upon gaining new knowledge which would eliminate the need for manual fine tuning.
LLMs of the future will need good data for proper context, but it is less and less making it onto the internet. Unpublished data stores like Discord or meeting recordings are going to be the only way forward. How else can you get up to date information except to be where the people are.
Norms will shift, be prepared.
It's basically continual learning. This is beyond a hard problem it's currently an impossible one. I know of no system that solve CL even at small scale let alone large models.
Annoyingly, they have SOME inherent capability to do it. It's really easy to get sucked down this path due to that glimmer of hope but the longer you play with it the more annoying it becomes.
SSI seems to be focused on this problem directly so maybe they discover something?
Bit by bit, we need to figure out how to rebuild human contextual understanding in a way that LLMs can understand. One thing that gets overlooked is the problem if incorrect data. You can provide all of the context in the world but LLMs tend to choke on contradictions or, at the minimum, work a whole lot harder to determine how to ignore or work around incorrect facts.
"Forgetting" and "ignoring" are hugely valuable skills when building context.
> the problem if incorrect data.
Was the typo intentional? :)
It is weird to read because they bring up many things a lot of people have been critiquing for years.
I'm glad the conversation is changing but it's been a bit frustrating that when these issues were brought up people blindly point to benchmarks. It made doing this type of research difficult (enough to cause many to be pushed out). Then it feels weird to say "harder than we thought" because well... truthfully, they even state why this result should be expected And that's only a fraction of the story. Online algorithms aren't enough. You still need a fundamental structure to codify and compress information, determine what needs to be updated (as in what is low confidence), to actively seek out new information to update that confidence, make hypotheses, and so so much more.So I hope the conversation keeps going in a positive direction but I hope we don't just get trapped in a "RL will solve everything" trap. RL is definitely a necessary component and no doubt will it result in improvements, but it also isn't enough. It's really hard to do deep introspection into how you think. It's like trying to measure your measuring stick with your measuring stick. It's so easy to just get caught up in oversimplification and it seems like the brain wants to avoid it. To quote Feynman: "The first principle is to not fool yourself, and you're the easiest person to fool." It's even easier when things are exciting. It's so easy because you have evidence for your beliefs (like I said, RL will make improvements). It's so easy because you're smart, and smart enough to fool yourself. So I hope we can learn a bigger lesson: learning isn't easy, scale is not enough. I really do think we'll get to AGI but it's going to be a long bumpy road if we keep putting all our eggs in one basket and hoping there's simple solutions.
Because we don't experience reality through language but direct sensory perception. Language is arbitrary bird song and visual representations dragged forward from history, accepted definitions never uniformly distributed.
Testing based on contextual correctness makes no sense when there is no center to the universe. No "one true context to rule them all".
We learn from hands on sensory experiences. Our bodies store knowledge independent of the brain; often referred to as muscle memory.
Gabe Newell mentioned this years ago; our brain is only great at some things like language and vision processing but the rest of our body is involved in sensory information processing too: https://en.wikiquote.org/wiki/Gabe_Newell
The most potent evidence the brain is not the center of the universe we commonly think it to be is that patient with 90% of their skull filled with fluid while they carried out a typical first worlder life: https://www.sciencealert.com/a-man-who-lives-without-90-of-h...
States are banning a reading education framework that's been linked to lower literacy scores in younger generations; 3-cueing relies on establishing correctness via context assessment: https://www.edweek.org/teaching-learning/more-states-are-tak...
"Establishing context" is a euphemism for "arguing semantics".
Putting the brain at the root of of human intelligence is a relic of hierarchical and taxonomical models. There are no natural hierarchies.
”Because we don't experience reality through language but direct sensory perception”
That statement is patently false. We know that language influences our senses to a degree where we are unable to perceive things if our language doesn’t have a word for it, and will see different things as being equal if our language uses the same word for both.
There are examples of tribal humans not being able to perceive a green square among blue squares, because their language does not have a word for the green color.
Similarly, some use the same word for blue and white, and are unable to perceive them as different colors.
If you're referring to the Himba experiment (or one of the news or blog posts tracing back to it), the outcome was far less decisive than you're implying. Language showed an impact on perception time of color differences, not a complete inability to distinguish.
https://languagelog.ldc.upenn.edu/nll/?p=18237 https://www.sciencedirect.com/science/article/abs/pii/S00100...
"There are examples of tribal humans not being able to perceive a green square among blue squares, because their language does not have a word for the green color.
Similarly, some use the same word for blue and white, and are unable to perceive them as different colors."
Both of the above is false. There are a ton of different colors that I happen to call "red", that does not mean that I can't perceive them as different. That I don't call them "different colors" is completely irrelevant. And unable to perceive blue and white as different colors? (Maybe that was a joke?) Even a hypothetical language which only used a single word for non-black items, say, "color", for everything else, would be able to perceive the difference with zero problems.
Japanese use "aoi" for a set of colors which in English would be separated into "blue" and "green". I can assure you (from personal experience) that every Japanese speaker with a fully functioning visual system is perfectly able to perceive the difference between, in this case, blue and green as we would call them.
There's a Terence McKenna quote about this:
> So, for instance, you know, I’ve made this example before: a child lying in a crib and a hummingbird comes into the room and the child is ecstatic because this shimmering iridescence of movement and sound and attention, it’s just wonderful. I mean, it is an instantaneous miracle when placed against the background of the dull wallpaper of the nursery and so forth. But, then, mother or nanny or someone comes in and says, “It’s a bird, baby. Bird. Bird!” And, this takes this linguistic piece of mosaic tile, and o- places it over the miracle, and glues it down with the epoxy of syntactical momentum, and, from now on, the miracle is confined within the meaning of the word. And, by the time a child is four or five or six, there- no light shines through. They're- they have tiled over every aspect of reality with a linguistic association that blunts it, limits it, and confines it within cultural expectation.
and what is this quote supposed to explain?
that language prevents a child from learning nuance? sounds like nonsense to me. a child first learns broad categories. for example some children as they learn to speak think every male person is dad. then they recognize everyone with a beard is dad, because dad has a beard. and only later they learn to differentiate that dad is only one particular person. same goes for the bird. first we learn hat everything with wings is a bird, and later we learn the specific names for each bird. this quote makes an absurd claim.
Only after we acquire language from sensory experience first.
It need not be language as we know it that fosters those outcomes either.
What you describe is reinforcement education which can be achieved without our language, without the word "blue" we can still see the portion of the visible light spectrum that we associate to the specific word.
> Similarly, some use the same word for blue and white, and are unable to perceive them as different colors.
You really think they can't see clouds in the sky because they have the same word for white and blue? I think you take those studies as saying more than they said.
We do adapt our perception a little bit to fit what we need for our every day life, not for language but whats useful for us. Language matches what people need to talk about, not the other way around, if a cultures language doesn't differentiate between blue and green its because they never needed to.
Don't always trust everything you read in papers. Researchers are usually under incredible pressure to publish something, anything. Wait a few years and see if the paper survives the test of time. LLMs work reasonably fine for me in new domains.
wasn't in-context learning an emergent behavior a while ago (1-2 years)?
This is quite on brand for China. I think they are experts at reverse engineering and learning 'from context' rather than by formal consumption of foreign training material.
The fictional training data with a made up country and laws was a very interesting experiment design, I can imagine that's how they approach making business with other countries. Like an alien made up system they have to learn on the spot.