The more I listen to NotebookLM “episodes”, the more I am convinced that Google has trained a two-speaker “podcast discussion” model that directly generates the podcast off the back of an existing multimodal backbone. The two speakers interrupt and speak over each other in an uncannily humanlike manner. I wonder whether they basically fine tuned against a huge library of actual podcasts along with the podcast transcripts and perhaps generated synthetic “input material” from the transcripts to feed in as training samples.
In other words, take an episode of The Daily and have one language model write a hypothetical article that would summarize what the podcast was about. And then pass that article into the two—speaker model, transcribe the output, and see how well that transcript aligns with the article fed in as input.
I am sure I’m missing essential details, but the natural sound of these podcasts cannot possibly be coming from a text transcript.
> the more I am convinced that Google has trained a two-speaker “podcast discussion” model that directly generates the podcast off the back of an existing multimodal backbone.
I have good and bad news for you - they did not! We were the first podcast to interview the audio engineer who led the audio model:
TLDR they did confirm that the transcript and the audio are generated separately, but yes the TTS model is trained far beyond anything we have in OSS or commercially available
they didnt confirm or deny this in the episode - all i can say is there are about 1-2 yrs of additional research that went into nblm's tts. soundstorm is more of an efficiency paper imo
I feel similarly about NotebookLM, but have noticed one odd thing - occasionally Host A will be speaking, and suddenly Host B will complete their sentence. And usually when this happens, it's in a way that doesn't make sense, because Host A was just explaining something to or answering a question of Host B.
I'm actually not sure what to make of that, but it's interesting to note
It's speaker diarisation, and depending on the quality of the resulting labelling and speaker end marker tokens, what influences the rhythm of a conversation (Or the input data just has many podcast hosts completing each other's..sandwiches?)
I think this is an important enough quality that betrays that there are no two minds here creating 1+1=3.
One cheap trick to overcome this uncanny valley may be to actually use two separate LLMs or two separate contexts / channels to generate the conversations and take "turns" to generate the followup responses and even interruptions if warranted.
Funnily, even two different LLMs, when put in conversation with each other, can end up completing each other's sentence. I guess it has something to do with the sequence prediction training objective.
Those moments always make me think they’re going for a scripted conversation style where the “learner” is picking up the thread too quickly and interjecting their epiphany inline for the benefit of the listener.
It doesn’t look that useful to use as it. But the approach there are investigating is clearly and well documented in plain text. Seems like a valid contribution to public knowledge to be grateful for, even if it can’t be use verbatim.
(Please note that the parent poster has edited their comment. Before edit, they implied that it was the OP who included the words “open source” in the HN post title.)
Oh, I see the links now, thanks! But they reference four different licenses, and those are the licenses just for model weights I think?
If the intention was to make something that you can only use with Llama models, stating that clearly in a separate code license file would be better IMO. (Of course, this would also mean that the code still isn’t open source.)
Great to see this: Fellow tech-geeks, ignore the NotebookLM thing at your peril.
NotebookLM, far and away, has been the "AI Killer App" for the VAST MAJORITY of bright-but-not-particularly-techy people I know. My 70ish parents and my 8 year old kid are both just blown away by this thing and can't stop playing with it.
Edit: As someone pointed out below, I absolutely mean just the "podcast" thing.
I can understand why it's cool for a lot of people but it's the opposite of a time saver to me: they are a time loser, if that's a word. It's the same thing of those videos that serve a purpose only because some people (and developers) are not able to read or feel intimidated at walls of text. They are at a competitive disadvantage only partially mitigated by having videos for even the smallest text page.
I don't get it. Are you saying "bright but not particularly techy" people can't read? What would I be missing out on by ignoring this just like I do every other podcast? I've literally never heard of someone learning anything from a podcast except scattered knowledge from another field that will never be useful.
Again, I'm absolutely like you and I'm with you. I don't much do podcasts either, but in a way this is why I worded it like this. It struck me as a fun party trick to ignore, but it really seems to GRAB a lot of other people.
The point being made is that while this may be grating for you. It is magic for a large part of the population. This combined with chatgpt advanced voice mode shows a direction of travel for AI agents. It makes it possible to imagine a world where everyone has personalized tutors and that world isn't very far away.
> It makes it possible to imagine a world where everyone has personalized tutors and that world isn't very far away.
My issue with AI hype is exactly this. Everything is “imagine if this was just better enough to be useful”
“Imagine if we had an everything machine”
“Image everyone having a personal assistant/artist/tutor/programmer”
“Imagine a world where finance is decentralized and we all truly own our digital stuff”
<rant>
I’m not much of a visionary, admittedly, but it’s exhausting being told to imagine products that only half exist now.
Having worked with LLMs in the autonomous agent space, I think we’re very far away from agents actually doing useful work consistently.
There are still so many problems to be solved around the nature of statistical models. And they’re hard problems where the solution, at least at the product level, boils down to “wait for a better model to come out”
I’m just tired of people imagining a future instead of building useful things today
At any given time there are millions of children who will fall for the coin behind the ears trick. It's magic to this large part of the population. That doesn't make it a technique I need to evaluate for my professional practice, because I'm not a clown.
Ariana already has personalized tutors. Wikipedia, for example is just arriving in different forms. you could argue chatbots are superior in many forms versus a podcast where you can't just scan information
It does have a tendacy to meander or spend too time reflecting on a topic instead of distilling the details. However the new ability to add a prompt improves this greatly.
Some instructions that worked for me:
- Specifics instead of high level
- Approach from non-critical perspective
- Dont be philosophical
- Use direct quotes often
- Focus on the details. Provide a lesson, not reflections
- Provide a 'sparknotes' style thorough understanding of the subject
you might just know very old non-tech people. but the non-tech people that will generally be the larger tech people of the future are gen z and they're definitely not on notebookLM. they are on AI character chatbots
I tried to build something kind of like NotebookLM (personalized news podcasts) over the past months (https://www.tailoredpod.ai), but biggest issue is that the existing good TTS Apis are so expensive that a product such as NotebookLM is not really possible for a normal company that doesn't have internal access to Google's models. OpenAI has the cheapest / quality good enough TTS Api, but even then generating hours of audio for free is way too expensive.
Pretty weird choice of TTS engines. None of them are anywhere near state of the art as far as open TTS system goes. XTTSv2 or the new F5-TTS would have been much better choices.
You can always update the code to use that. Meta releasing stuff on github is not trying to release the "bet" but to give a proof of concept. The licenses of those TTS system matters, it's not enough to be open. If this was a product for their users, they will definitely have better TTS.
"Speech Model experimentation: The TTS model is the limitation of how natural this will sound. This probably be improved with a better pipeline and with the help of someone more knowledgable-PRs are welcome! :)"
The sample output is very poor. Cool demo, but really just emphasizes how much of a hit product the NotebookLM team has managed to come up with, ostensibly with more or less the same foundation models already available.
I'm not so sure this is an open source NotebookLM as it is a few experiments in an iPython notebook. What NotebookLM does at an LLM level is not particularly novel, it's the packaging as a product in a different way than what others are doing that I think is interesting. Also the "podcast" bit is really just an intro/overview of a large corpus, far more useful is being able to discuss that corpus with the bot and get cited references.
What this does however demonstrate is that prototyping with LLMs is very fast. I'd encourage anyone who hasn't had a play around with APIs to give it a go.
not necessarily when you're really jiving with someone, the conversation flows really well. notice this is also what makes for really good vs bad television, example pulp fiction
Counterpoint: I have used the podcast numerous times and shared it with many. Great system and medium to digest complex information that I otherwise wouldn’t have.
If we can have this running locally on mobile phone that would be pretty cool. Imagine receiving a work document (for example, product requirement documents), and then this turning it into a podcast to play for me while I am driving. I think my productivity will be through the roof and I don't need to worry about compliance issues.
its more with using the microphones in the car rather than the phone's microphone, as they tend to work better for hearing the driver..or at least I think they would.
Here is another (Jupyter based) notebook solution supporting LLaMA models: https://raku.land/zef:antononcube/Jupyter::Chatbook .
Here is a demo movie: https://youtu.be/zVX-SqRfFPA
The more I listen to NotebookLM “episodes”, the more I am convinced that Google has trained a two-speaker “podcast discussion” model that directly generates the podcast off the back of an existing multimodal backbone. The two speakers interrupt and speak over each other in an uncannily humanlike manner. I wonder whether they basically fine tuned against a huge library of actual podcasts along with the podcast transcripts and perhaps generated synthetic “input material” from the transcripts to feed in as training samples.
In other words, take an episode of The Daily and have one language model write a hypothetical article that would summarize what the podcast was about. And then pass that article into the two—speaker model, transcribe the output, and see how well that transcript aligns with the article fed in as input.
I am sure I’m missing essential details, but the natural sound of these podcasts cannot possibly be coming from a text transcript.
Following up on swyx, the TTS is probably Google finally releasing Soundstorm from the basement.
https://google-research.github.io/seanet/soundstorm/examples...
> the more I am convinced that Google has trained a two-speaker “podcast discussion” model that directly generates the podcast off the back of an existing multimodal backbone.
I have good and bad news for you - they did not! We were the first podcast to interview the audio engineer who led the audio model:
https://www.latent.space/p/notebooklm
TLDR they did confirm that the transcript and the audio are generated separately, but yes the TTS model is trained far beyond anything we have in OSS or commercially available
Soundstorm is probably the TTS https://google-research.github.io/seanet/soundstorm/examples...
they didnt confirm or deny this in the episode - all i can say is there are about 1-2 yrs of additional research that went into nblm's tts. soundstorm is more of an efficiency paper imo
Really good catch. Ty.
Thank you swyx. How did I miss this episode?
did you LIKE and SUBSCRIBE?? :)
I feel similarly about NotebookLM, but have noticed one odd thing - occasionally Host A will be speaking, and suddenly Host B will complete their sentence. And usually when this happens, it's in a way that doesn't make sense, because Host A was just explaining something to or answering a question of Host B.
I'm actually not sure what to make of that, but it's interesting to note
That's the annoying part about NLM. It ruins the illusion of having one person explaining it to the other person.
It's speaker diarisation, and depending on the quality of the resulting labelling and speaker end marker tokens, what influences the rhythm of a conversation (Or the input data just has many podcast hosts completing each other's..sandwiches?)
I think this is an important enough quality that betrays that there are no two minds here creating 1+1=3.
One cheap trick to overcome this uncanny valley may be to actually use two separate LLMs or two separate contexts / channels to generate the conversations and take "turns" to generate the followup responses and even interruptions if warranted.
Might mimic a human conversation more closely.
Funnily, even two different LLMs, when put in conversation with each other, can end up completing each other's sentence. I guess it has something to do with the sequence prediction training objective.
And this regularly happens with humans too
Those moments always make me think they’re going for a scripted conversation style where the “learner” is picking up the thread too quickly and interjecting their epiphany inline for the benefit of the listener.
This is in fact pretty explicitly not open source: https://github.com/meta-llama/llama-recipes/blob/d83d0ae7f5c...
(And given there is no LICENSE file, I’m afraid you can only use this code as reference at best right now)
It doesn’t look that useful to use as it. But the approach there are investigating is clearly and well documented in plain text. Seems like a valid contribution to public knowledge to be grateful for, even if it can’t be use verbatim.
I would just hope that they disingeniously stop promoting these kind of things as open source
This content could easily be a blog post and worth a read. But it’s in notebook form to make it interactive.
It’s a tired comment narrative about these being about open source.
It’s literally the title in the README: https://github.com/meta-llama/llama-recipes/tree/main/recipe...
(Please note that the parent poster has edited their comment. Before edit, they implied that it was the OP who included the words “open source” in the HN post title.)
Also the text is enjoyably playful “For our GPU poor friends” and “let's proceed to justify our distaste for writing regex”
It might be a mistake since it's different from what's stated in their readme:
https://github.com/meta-llama/llama-models/blob/main/models/...
(which is referring to the license of Meta Llama 3.2)
Oh, I see the links now, thanks! But they reference four different licenses, and those are the licenses just for model weights I think?
If the intention was to make something that you can only use with Llama models, stating that clearly in a separate code license file would be better IMO. (Of course, this would also mean that the code still isn’t open source.)
Thanks but I will use it anyway.
Great to see this: Fellow tech-geeks, ignore the NotebookLM thing at your peril.
NotebookLM, far and away, has been the "AI Killer App" for the VAST MAJORITY of bright-but-not-particularly-techy people I know. My 70ish parents and my 8 year old kid are both just blown away by this thing and can't stop playing with it.
Edit: As someone pointed out below, I absolutely mean just the "podcast" thing.
As someone who doesn’t listen to podcasts what perils will I suffer from not making podcasts in notebookLM?
Yeah, I deliberately worded it that way because I would have said the same as you.
I don't really see MYSELF being into it, but it just seems to WOW the hell out of a lot of people.
I can understand why it's cool for a lot of people but it's the opposite of a time saver to me: they are a time loser, if that's a word. It's the same thing of those videos that serve a purpose only because some people (and developers) are not able to read or feel intimidated at walls of text. They are at a competitive disadvantage only partially mitigated by having videos for even the smallest text page.
I don't get it. Are you saying "bright but not particularly techy" people can't read? What would I be missing out on by ignoring this just like I do every other podcast? I've literally never heard of someone learning anything from a podcast except scattered knowledge from another field that will never be useful.
Oh, probably nothing.
Again, I'm absolutely like you and I'm with you. I don't much do podcasts either, but in a way this is why I worded it like this. It struck me as a fun party trick to ignore, but it really seems to GRAB a lot of other people.
Are we talking about NotebookLM generally or specifically the podcast stunt?
Good question: I absolutely mean the podcast stunt.
Idk if I’d call it a killer app.
The podcasts are grating to listen to and usually only contain very surface information I could gain from a paper’s abstract.
It’s a wildly impressive technical achievement though.
The point being made is that while this may be grating for you. It is magic for a large part of the population. This combined with chatgpt advanced voice mode shows a direction of travel for AI agents. It makes it possible to imagine a world where everyone has personalized tutors and that world isn't very far away.
> It makes it possible to imagine a world where everyone has personalized tutors and that world isn't very far away.
My issue with AI hype is exactly this. Everything is “imagine if this was just better enough to be useful”
“Imagine if we had an everything machine”
“Image everyone having a personal assistant/artist/tutor/programmer”
“Imagine a world where finance is decentralized and we all truly own our digital stuff”
<rant>
I’m not much of a visionary, admittedly, but it’s exhausting being told to imagine products that only half exist now.
Having worked with LLMs in the autonomous agent space, I think we’re very far away from agents actually doing useful work consistently.
There are still so many problems to be solved around the nature of statistical models. And they’re hard problems where the solution, at least at the product level, boils down to “wait for a better model to come out”
I’m just tired of people imagining a future instead of building useful things today
<\rant>
At any given time there are millions of children who will fall for the coin behind the ears trick. It's magic to this large part of the population. That doesn't make it a technique I need to evaluate for my professional practice, because I'm not a clown.
Ariana already has personalized tutors. Wikipedia, for example is just arriving in different forms. you could argue chatbots are superior in many forms versus a podcast where you can't just scan information
It does have a tendacy to meander or spend too time reflecting on a topic instead of distilling the details. However the new ability to add a prompt improves this greatly.
Some instructions that worked for me:
- Specifics instead of high level
- Approach from non-critical perspective
- Dont be philosophical
- Use direct quotes often
- Focus on the details. Provide a lesson, not reflections
- Provide a 'sparknotes' style thorough understanding of the subject
Oh, when was this added? I'll have to check it out.
Added about a week ago
Kaleidoscopes also offer mindless fun, I would rather suggest those.
you might just know very old non-tech people. but the non-tech people that will generally be the larger tech people of the future are gen z and they're definitely not on notebookLM. they are on AI character chatbots
No dispute there.
I tried to build something kind of like NotebookLM (personalized news podcasts) over the past months (https://www.tailoredpod.ai), but biggest issue is that the existing good TTS Apis are so expensive that a product such as NotebookLM is not really possible for a normal company that doesn't have internal access to Google's models. OpenAI has the cheapest / quality good enough TTS Api, but even then generating hours of audio for free is way too expensive.
Open Source TTS models are slowly catching up, but they still need beefy hardware (e.g. https://github.com/SWivid/F5-TTS)
You have users? If TTS is your bottleneck, I might be able to help. Email in bio.
When you say beefy? How much beef?
Pretty weird choice of TTS engines. None of them are anywhere near state of the art as far as open TTS system goes. XTTSv2 or the new F5-TTS would have been much better choices.
You can always update the code to use that. Meta releasing stuff on github is not trying to release the "bet" but to give a proof of concept. The licenses of those TTS system matters, it's not enough to be open. If this was a product for their users, they will definitely have better TTS.
From improvements needed on the page:
"Speech Model experimentation: The TTS model is the limitation of how natural this will sound. This probably be improved with a better pipeline and with the help of someone more knowledgable-PRs are welcome! :)"
The "PRs are welcome" posture for a for-profit entity, that actively harms minds and pretending to be open source gives me the heebie-jeebies
The sample output is very poor. Cool demo, but really just emphasizes how much of a hit product the NotebookLM team has managed to come up with, ostensibly with more or less the same foundation models already available.
I'm not so sure this is an open source NotebookLM as it is a few experiments in an iPython notebook. What NotebookLM does at an LLM level is not particularly novel, it's the packaging as a product in a different way than what others are doing that I think is interesting. Also the "podcast" bit is really just an intro/overview of a large corpus, far more useful is being able to discuss that corpus with the bot and get cited references.
What this does however demonstrate is that prototyping with LLMs is very fast. I'd encourage anyone who hasn't had a play around with APIs to give it a go.
> What NotebookLM does at an LLM level is not particularly novel, it's the packaging as a product...
Disagreed. NLM is novel in how the two hosts interrupt and overlap each other. No other OSS solution does that, they just take turns talking.
Fair point, although to me the "audio overviews" are a minor feature of the product.
But that's a bad habit and we tell people not to do it. So it's a novel but undesirable feature IMHO.
not necessarily when you're really jiving with someone, the conversation flows really well. notice this is also what makes for really good vs bad television, example pulp fiction
It only creates the podcasts right?
I am more interested in the other features of NotebookLM. The podcasts are fun but gimmicky.
Counterpoint: I have used the podcast numerous times and shared it with many. Great system and medium to digest complex information that I otherwise wouldn’t have.
If we can have this running locally on mobile phone that would be pretty cool. Imagine receiving a work document (for example, product requirement documents), and then this turning it into a podcast to play for me while I am driving. I think my productivity will be through the roof and I don't need to worry about compliance issues.
I wish chatgpt or Claude would make an an Android Auto app that I can use while driving.
you could just Bluetooth your speakers
its more with using the microphones in the car rather than the phone's microphone, as they tend to work better for hearing the driver..or at least I think they would.
I wonder, how soon they release this in other languages and with different accents epecially Se-Asian accents.
Man.. the sample is pretty rough
I’d love to hear the output if anyone has used this.
There’s an example output linked on the github page
now i need something that pseudonyms my pdfs/input in the first step
Page title: NotebookLlama: An Open Source version of NotebookLM
Fixed. Thanks! (Submitted title was "Meta's Open Source NotebookLM")
"Please use the original title, unless it is misleading or linkbait; don't editorialize." - https://news.ycombinator.com/newsguidelines.html