I gave this a shot using speech-to-speech¹ modified so that it skips the LLM/AI assistant part and just repeats back what it thinks I said and displays the text.
For longer sentences my perception is that Moonshine performs at 80-90% of what Whisper² could do, while using considerably less resources. When trying shorter, two-word utterances it nosedived for some reason.
These numbers don't mean much, but when paired with MeloTTS, Moonshine and Whisper² ate up 1.2 and 2.5 GB of my GPU's memory, respectively.
I've played quite a lot with all the whisper models up to "medium" size. Mostly via faster-whisper as original OpenAI whisper only seem to care for/optimize performance for GPU.
I would agree that the "tiny" model has a clear drop off in accuracy, not good enough for anything real (even if transcribing your own speech, the error rate means too much editing needed). In my experience, accuracy can be more of a problem on shorter sentences because there is less context to help it.
I think for serious use (on GPU) it would be the "medium" or "large" models only. There is now a "large-turbo" model which is apparently faster than "medium" (on GPU) but more accurate than "medium" - haven't tried it yet.
On CPU for personal use (faster-whisper, CPU) I have found "base" is usable, "small" is good. On a laptop CPU though "small" is slow for real time. "Medium" is more accurate, though mostly just on punctuation, far too slow for CPU. Of course all models will get some uncommon surnames, place names wrong.
Since OpenAI have re-released the "large" models twice and now done a "large-turbo" I hope that they will re-release the smaller models too so that the smallest models become more useful.
These moonshine models are compared to original OpenAI whisper, but really I'd say they need to compare to faster-whisper: multiple projects are faster than original OpenAI whisper.
There are libraries that can help with this, such as SpeechRecognition for python. If all you're looking for is short terms with minimal background noise, this should do it for you.
Looks like Moonshine is competing against the Whisper-tiny model. There isn't any information in the paper to see how it compares to the larger whisper-large-v3.
I don't mean to give negative feedback, as I don't consider myself a full-blown expert with Python/ML, however, for someone with passing experience, it fails out of the box for me, with and without the typically required 16Hz bit rate audio files (of various codecs/formats).
Was really hoping it would be a quick, brilliant solution to something I'm working on now, perhaps I'll dig in and invest in it, but I'm not sure I have the luxury right now to do the exploratory work... Hope someone else has better luck than I!
I would recommend then to be more specific. Did you had trouble installing it? Did it give you an error? Was there no output? Was the output wrong? Is it not working on your files, but working on example files? Is it solving a different problem than the one you have?
Installing was okay, but it was not running on any of the sample files I had. This is the output I got:
UserWarning: You are using a softmax over axis 3 of a tensor of shape (1, 8, 1, 1). This axis has size 1. The softmax operation will always return the value 1, which is likely not what you intended. Did you mean to use a sigmoid instead?
warnings.warn(
I know this isn't the right place for this, the right place is raising within github, but because you asked I posted...
Moonshine author here: The warning is from Keras library, and is benign. If you didn't get any other output, it was probably because the model thought there was no speech (not saying there really was no speech). We uploaded ONNX version that is considerably faster than the Torch/JAX/TF versions, and is usable with less package bloat. I hope you would give it another shot.
The section "3.2. Training data collection & preprocessing" covers what you're inquiring about:
"We train Moonshine on a combination of 90K hours
from open ASR datasets and over 100K hours from own
internally-prepared dataset, totalling around 200K hours.
From open datasets, we use Common Voice 16.1 (Ardila
et al., 2020), the AMI corpus (Carletta et al., 2005), Gi-
gaSpeech (Chen et al., 2021), LibriSpeech (Panayotov
et al., 2015), the English subset of multilingual Lib-
riSpeech (Pratap et al., 2020), and People’s Speech (Galvez
et al., 2021). We then augment this training corpus with
data that we collect from openly-available sources on the
web. We discuss preparation methods for our self-collected
data in the following."
My take wasn't that they are claiming SOTA, just that they are claiming same/better accuracy than whisper "tiny" and "base" models at higher performance versus original whisper.
The WER scores on the larger whisper models are lower (better) than "tiny"/"base", so none of these small models are close to SOTA either way.
I agree, it’s also irresponsible to name something that reminds many people of alcohol. I think they were trying to invoke the idea of the moonlight as reflecting the sun’s light, just as their software reflects the speech of the user.
As an alcoholic - sober 6 years on January 1, 2025 - I can comfortably say that my addiction is my problem. I do not expect the world to comport to my issues.
If an alcoholic is triggered by this name they need more tools in their toolkit.
The best thing anyone in recovery* can do is build up a toolkit and support system to stay sober. If one expects the world to isolate them from temptation they will never get sober.
* recovery: I loathe this term, but use it because it's familiar. My dislike for it is another conversation and unnecessary here.
Offensiveness is certainly a sliding scale and infused by the culture of the recipients. I agree with OP that some thought should be given to a name to not inflict hurt unnecessarily but also wouldn't have caught moonshine as a term celebrating alcoholism (which is still one of the most damaging drugs available, c.f. David Nutt's Drugs without the hot air).
I gave this a shot using speech-to-speech¹ modified so that it skips the LLM/AI assistant part and just repeats back what it thinks I said and displays the text.
For longer sentences my perception is that Moonshine performs at 80-90% of what Whisper² could do, while using considerably less resources. When trying shorter, two-word utterances it nosedived for some reason.
These numbers don't mean much, but when paired with MeloTTS, Moonshine and Whisper² ate up 1.2 and 2.5 GB of my GPU's memory, respectively.
¹ https://github.com/huggingface/speech-to-speech ² distil-whisper/distil-large-v3
GitHub
https://github.com/usefulsensors/moonshine
Having played with the GB sized Whisper models, I'm amazed to learn the 80MB version is actually useful for anything.
I was aiming for an agent-like experience, and found the accuracy drop below what I'd consider useful levels even above the 1GB mark.
Perhaps for shorter, few word sentences like "lights on"?
I've played quite a lot with all the whisper models up to "medium" size. Mostly via faster-whisper as original OpenAI whisper only seem to care for/optimize performance for GPU.
I would agree that the "tiny" model has a clear drop off in accuracy, not good enough for anything real (even if transcribing your own speech, the error rate means too much editing needed). In my experience, accuracy can be more of a problem on shorter sentences because there is less context to help it.
I think for serious use (on GPU) it would be the "medium" or "large" models only. There is now a "large-turbo" model which is apparently faster than "medium" (on GPU) but more accurate than "medium" - haven't tried it yet.
On CPU for personal use (faster-whisper, CPU) I have found "base" is usable, "small" is good. On a laptop CPU though "small" is slow for real time. "Medium" is more accurate, though mostly just on punctuation, far too slow for CPU. Of course all models will get some uncommon surnames, place names wrong.
Since OpenAI have re-released the "large" models twice and now done a "large-turbo" I hope that they will re-release the smaller models too so that the smallest models become more useful.
These moonshine models are compared to original OpenAI whisper, but really I'd say they need to compare to faster-whisper: multiple projects are faster than original OpenAI whisper.
There are libraries that can help with this, such as SpeechRecognition for python. If all you're looking for is short terms with minimal background noise, this should do it for you.
Looks like Moonshine is competing against the Whisper-tiny model. There isn't any information in the paper to see how it compares to the larger whisper-large-v3.
Yeah I was just mildly surprised such a small variant would be useful. Will certainly try when I get back home.
This looks awesome! Actually something I’m looking at playing with this evening!
I don't mean to give negative feedback, as I don't consider myself a full-blown expert with Python/ML, however, for someone with passing experience, it fails out of the box for me, with and without the typically required 16Hz bit rate audio files (of various codecs/formats).
Was really hoping it would be a quick, brilliant solution to something I'm working on now, perhaps I'll dig in and invest in it, but I'm not sure I have the luxury right now to do the exploratory work... Hope someone else has better luck than I!
> I don't mean to give negative feedback
I would recommend then to be more specific. Did you had trouble installing it? Did it give you an error? Was there no output? Was the output wrong? Is it not working on your files, but working on example files? Is it solving a different problem than the one you have?
Installing was okay, but it was not running on any of the sample files I had. This is the output I got:
UserWarning: You are using a softmax over axis 3 of a tensor of shape (1, 8, 1, 1). This axis has size 1. The softmax operation will always return the value 1, which is likely not what you intended. Did you mean to use a sigmoid instead? warnings.warn(
I know this isn't the right place for this, the right place is raising within github, but because you asked I posted...
Moonshine author here: The warning is from Keras library, and is benign. If you didn't get any other output, it was probably because the model thought there was no speech (not saying there really was no speech). We uploaded ONNX version that is considerably faster than the Torch/JAX/TF versions, and is usable with less package bloat. I hope you would give it another shot.
Very, very cool. Will have to try it out! It’s all fun and games until a universal translator comes out in glasses or earpiece form…
* Which languages is it available in?
* Does the system automatically detect the language?
* What are the hardware requirements for it to work?
Nice. Looks like a way to have achievable live text transcripts via tiny devices without using APIs.
Wonder where the training data for this is.
They supply their paper in the Git repo, here: https://github.com/usefulsensors/moonshine/blob/main/moonshi...
The section "3.2. Training data collection & preprocessing" covers what you're inquiring about: "We train Moonshine on a combination of 90K hours from open ASR datasets and over 100K hours from own internally-prepared dataset, totalling around 200K hours. From open datasets, we use Common Voice 16.1 (Ardila et al., 2020), the AMI corpus (Carletta et al., 2005), Gi- gaSpeech (Chen et al., 2021), LibriSpeech (Panayotov et al., 2015), the English subset of multilingual Lib- riSpeech (Pratap et al., 2020), and People’s Speech (Galvez et al., 2021). We then augment this training corpus with data that we collect from openly-available sources on the web. We discuss preparation methods for our self-collected data in the following."
It does continue...
So they're claiming SOTA because they compare against OpenAI as SOTA. What about Groq or fal.ai?
My take wasn't that they are claiming SOTA, just that they are claiming same/better accuracy than whisper "tiny" and "base" models at higher performance versus original whisper.
The WER scores on the larger whisper models are lower (better) than "tiny"/"base", so none of these small models are close to SOTA either way.
Groq and fal are hosters of third party models. They cannot be SOTA.
Fal provides Whisper which is OpenAI.
Groq only hosts LLMs.
Kind of a weird name choice, not very searchable and completely unrelated. But the tech looks great.
I agree, it’s also irresponsible to name something that reminds many people of alcohol. I think they were trying to invoke the idea of the moonlight as reflecting the sun’s light, just as their software reflects the speech of the user.
I think “Artemis” or “Luna” would work better.
As an alcoholic - sober 6 years on January 1, 2025 - I can comfortably say that my addiction is my problem. I do not expect the world to comport to my issues.
If an alcoholic is triggered by this name they need more tools in their toolkit.
The best thing anyone in recovery* can do is build up a toolkit and support system to stay sober. If one expects the world to isolate them from temptation they will never get sober.
* recovery: I loathe this term, but use it because it's familiar. My dislike for it is another conversation and unnecessary here.
Luna stigmatizes people with mental illnesses (lunatic) and Artemis was the goddess of virginty—valorizing feminine virginity is misogynistic.
You can play this game with every possible name.
Offensiveness is certainly a sliding scale and infused by the culture of the recipients. I agree with OP that some thought should be given to a name to not inflict hurt unnecessarily but also wouldn't have caught moonshine as a term celebrating alcoholism (which is still one of the most damaging drugs available, c.f. David Nutt's Drugs without the hot air).
great example (tutorial) how to get offended by anything.
How is it irresponsible?
I’m also awaiting an answer for such a preposterous claim. Let’s rename Svelte because it reminds me of Swedish pancakes and I have diabetes.
now i really want pancakes.
Me too. People should really be more careful.
I guess jaco6 thinks recovering addicts shouldn't be reminded of their addiction. I also guess that is casting a quite wide net on speech.
Isn't it clearly a reference to the idiom "talking moonshine", making it a self-deprecating joke?
You think it’s irresponsible to name something a drug?
Meta
If somebody thinks of alcohol only if they read moonshine it’s their problem.
https://en.wikipedia.org/wiki/Moonshine_(disambiguation)
[dead]