I have my own fork here: https://github.com/HorizonXP/voxtral.c where I’m working on a CUDA implementation, plus some other niceties. It’s working quite well so far, but I haven’t got it to match Mistral AI’s API endpoint speed just yet.
There is also another Mistral implementation: https://github.com/EricLBuehler/mistral.rs Not sure what the difference is, but it seems to be just be overall better received.
mistral.rs is more like llama.cpp, it's a full inference library written in rust that supports a ton of models and many hardware architectures, not just mistral models.
I don't know anything about these models, but I've been trying Nvidia's Parakeet and it works great. For a model like this that's 9GB for the full model, do you have to keep it loaded into GPU memory at all times for it to really work realtime? Or what's the delay like to load all the weights each time you want to use it?
Personally I run an ollama server. Models load pretty quickly.
There's a distinction between tokens per second and time to first token.
Delays come for me when I have to load a new model, or if I'm swapping in a particularly large context.
Most of the time, since the model is already loaded, and I'm starting with a small context that builds over time, tokens per second is the biggest impactor.
It's worth noting I don't do much fancy stuff, a tiny bit of agent stuff, I mainly use qwen-coder 30a3b or qwen2.5 code instruct/base 7b.
I'm finding more complex agent stuff where multiple agents are used can really slow things down if they're swapping large contexts. ik_llama has prompt caching which help speed this up when swapping between agent contexts up until a point.
tldr: loading weights each time isn't much of a problem, unless you're having to switch between models and contexts a lot, which modern agent stuff is starting to.
Look I think its great that it runs in the browser and all, but I don't want to live in a world where its normal for a website to download 2.5Gb in the background to run something
Just tried out Handy. This is much better and lightweight UI than the previous solutions I've tried out! I know it wasn't you intention, but thank you for the recommendation!
That said, I now agree with your original statement and really want Voxtral support...
Handy is awesome! and easy to fork. I highly recommend building it from source and submitting PRs if there are any features you want. The author is highly responsive and open to vibe-coded PRs as long as you do a good job. (Obviously you should read the code and stand by it before you submit a PR, but I just mean he doesn't flatly reject all AI code like some other projects do.) I submitted a PR recently to add an onboarding flow to Macs that just got merged, so now I'm hooked
It's not fast enough to be realtime, though you could do a more advanced UI and a ring buffer and have it as you describe. (ex. I do this with Whisper in Flutter, and also inference GGUFs in llama.cpp via Dart)
This isn't even close to realtime on M4 Max. Whisper's ~realtime on any device post-2022 with an ONNX implementation. The extra inference cost isn't worth the WER decrease on consumer hardware, or at least, wouldn't be worth the time implementing.
For those exploring browser STT, this sits in an interesting space between Whisper.wasm and the Deepgram KC client. The 2.5GB quantized footprint is notably smaller than most Whisper variants — any thoughts on accuracy tradeoffs compared to Whisper base/small?
hm, seems broken on my machine (Firefox, Asahi Linux, M1 Pro). I said hello into the mic, and it churned for a minute or so before giving me:
panorama panorama panorama panorama panorama panorama panorama panorama� molest rist moundothe exh� Invothe molest Yan artist��������� Yan Yan Yan Yan Yanothe Yan Yan Yan Yan Yan Yan Yan
I just tried it, I said "what's up buddy, hey hey stop" and it transcribed this for me: " وطبعا هاي هاي هاي ستوب" No, I'm not in any arabic or middle eastern country. The second test was better, it detected english.
If folks are interested, @antirez has opened a C implementation of Voxtral Mini 4B here: https://github.com/antirez/voxtral.c
I have my own fork here: https://github.com/HorizonXP/voxtral.c where I’m working on a CUDA implementation, plus some other niceties. It’s working quite well so far, but I haven’t got it to match Mistral AI’s API endpoint speed just yet.
There is also another Mistral implementation: https://github.com/EricLBuehler/mistral.rs Not sure what the difference is, but it seems to be just be overall better received.
mistral.rs is more like llama.cpp, it's a full inference library written in rust that supports a ton of models and many hardware architectures, not just mistral models.
Kudos, this is were it's add: open-models running on-premise. Preferred by users and businesses. Glad Mistral's got that figured out.
I don't know anything about these models, but I've been trying Nvidia's Parakeet and it works great. For a model like this that's 9GB for the full model, do you have to keep it loaded into GPU memory at all times for it to really work realtime? Or what's the delay like to load all the weights each time you want to use it?
Personally I run an ollama server. Models load pretty quickly.
There's a distinction between tokens per second and time to first token.
Delays come for me when I have to load a new model, or if I'm swapping in a particularly large context.
Most of the time, since the model is already loaded, and I'm starting with a small context that builds over time, tokens per second is the biggest impactor.
It's worth noting I don't do much fancy stuff, a tiny bit of agent stuff, I mainly use qwen-coder 30a3b or qwen2.5 code instruct/base 7b.
I'm finding more complex agent stuff where multiple agents are used can really slow things down if they're swapping large contexts. ik_llama has prompt caching which help speed this up when swapping between agent contexts up until a point.
tldr: loading weights each time isn't much of a problem, unless you're having to switch between models and contexts a lot, which modern agent stuff is starting to.
Look I think its great that it runs in the browser and all, but I don't want to live in a world where its normal for a website to download 2.5Gb in the background to run something
Awesome work, Would be good to have it work with handy.computer. Also are there plans to support streaming ?
I'm looking into porting this into transcribe-rs so handy can use it.
The first cut will probably not be a streaming implementation
okay... so I cannot get this to run on my mac. maybe something with the burn kernels for quantized?
will report a GitHub issue
Just tried out Handy. This is much better and lightweight UI than the previous solutions I've tried out! I know it wasn't you intention, but thank you for the recommendation!
That said, I now agree with your original statement and really want Voxtral support...
Handy is awesome! and easy to fork. I highly recommend building it from source and submitting PRs if there are any features you want. The author is highly responsive and open to vibe-coded PRs as long as you do a good job. (Obviously you should read the code and stand by it before you submit a PR, but I just mean he doesn't flatly reject all AI code like some other projects do.) I submitted a PR recently to add an onboarding flow to Macs that just got merged, so now I'm hooked
thanks for your contribution :)
I tried the demo and it looks like you have to click Mic, then record your audio, then click "Stop and transcribe" in order to see the result.
Is it possible to rig this up so it really is realtime, displaying the transcription within a second or two of the user saying something out loud?
The Hugging Face server-side demo at https://huggingface.co/spaces/mistralai/Voxtral-Mini-Realtim... manages that, but it's using a much larger (~8.5GB) server-side model running on GPUs.
It's not fast enough to be realtime, though you could do a more advanced UI and a ring buffer and have it as you describe. (ex. I do this with Whisper in Flutter, and also inference GGUFs in llama.cpp via Dart)
This isn't even close to realtime on M4 Max. Whisper's ~realtime on any device post-2022 with an ONNX implementation. The extra inference cost isn't worth the WER decrease on consumer hardware, or at least, wouldn't be worth the time implementing.
For those exploring browser STT, this sits in an interesting space between Whisper.wasm and the Deepgram KC client. The 2.5GB quantized footprint is notably smaller than most Whisper variants — any thoughts on accuracy tradeoffs compared to Whisper base/small?
hm, seems broken on my machine (Firefox, Asahi Linux, M1 Pro). I said hello into the mic, and it churned for a minute or so before giving me:
panorama panorama panorama panorama panorama panorama panorama panorama� molest rist moundothe exh� Invothe molest Yan artist��������� Yan Yan Yan Yan Yanothe Yan Yan Yan Yan Yan Yan Yan
Is this a criticism of a program written in Rust? KILL THE HERETIC!
(no speech detected)
or... not talking anything generate random German sentences.
Man, I'd love to fine-tune this, but alas the huggingface implementation isn't out as far as I can tell.
Notable this isn't even close to realtime. M4 Max.
I just tried it, I said "what's up buddy, hey hey stop" and it transcribed this for me: " وطبعا هاي هاي هاي ستوب" No, I'm not in any arabic or middle eastern country. The second test was better, it detected english.
fwiw, that is the right-ish transliteration into arabic. It just picked the wrong language to transcribe to lol
>init failed: Worker error: Uncaught RuntimeError: unreachable
Anything I can do to fix/try it on Brave?
Would check memory, ensure you have free ram. Tested here https://imgur.com/a/3vLJ6no Not perfect dictation, but close enough
Does disabling shields help?