4 comments

  • guerython 5 hours ago ago

    Nice to see Diarize lean into CPU-only inference for compliance workloads. We leaned on the same Silero -> embedding -> spectral stack and one stabilizer that helped was filtering Silero segments under ~350 ms and merging anything with cosine distance <0.25 before the GMM, so the clustering stopped flipping speakers on micro-pauses.

    Another lever we added was keeping the last few call centroids and biasing the spectral solver toward the prototype that had >0.75 similarity, which keeps returning participants from spawning a new SPEAKER label every session. Are you thinking about exposing that kind of anchor_embeddings hook so teams can keep participant IDs consistent across calls?

    • loookas 5 hours ago ago

      Good tips on the pre-clustering filtering- we do something similar with the 0.4s threshold on short segments, but the cosine distance merge before GMM is interesting, I'll look into that.

      on the cross-session speaker consistency— yes, that's on the roadmap. The plan is to store speaker embeddings (256-dim vectors) in a vector DB and use them for matching during diarization.

      Something like an anchor_embeddings parameter you can pass in, so the output labels stay consistent across calls.

      Right now every call produces SPEAKER_00, SPEAKER_01 etc. independently. the embedding extraction already works well enough for matching (that's what cosine similarity on WeSpeaker embeddings is good at), the missing piece is the API surface and the matching logic on top of clustering.

      What's your setup for storing/matching the centroids? Curious if you're doing it at inference time or as a post-processing step.

  • loookas 5 hours ago ago

    I built this because I needed speaker diarization for two things: a meeting summarization script (record → diarize → transcribe → feed to Claude for summaries), and a robotics project where I need real-time speaker identification.

    I started with pyannote, which is the standard tool for this. It worked, but processing a single call took forever on CPU, and the fans on my MacBook sounded like a jet engine. So I decided to build something faster.

    The pipeline: Silero VAD → WeSpeaker ResNet34 embeddings (ONNX Runtime) → GMM+BIC speaker count estimation → spectral clustering. All classical ML after the embedding step — no neural segmentation model like pyannote uses.

    Results on VoxConverse (216 files, 1–20 speakers):

    DER: ~10.8% (pyannote free models: ~11.2%) CPU speed: RTF 0.12 vs 0.86 (pyannote community-1) — about 7x faster 10-min recording: ~1.2 min vs ~8.6 min Speaker count: 87–97% within ±1 for 1–5 speakers

    What it doesn't do well: 8+ speakers (count estimation breaks down), overlapping speech (single speaker per frame), and it's only been benchmarked on one dataset so far.

    Usage: pip install diarize

    from diarize import diarize result = diarize("meeting.wav")

    No GPU, no API keys, no HuggingFace account. Apache 2.0. Happy to answer questions about the architecture, benchmarks, or tradeoffs.

  • loookas 3 hours ago ago

    One thing I found surprising during development: the speaker count estimation turned out to be the hardest part of the whole pipeline, not the embeddings or clustering.

    Most diarization papers treat it as a solved problem or skip it entirely ("assume N speakers"). But in real meetings nobody tells you upfront how many people are on the call. GMM+BIC gets you to 51% exact match on VoxConverse, which sounds bad until you look at it per bucket — for 1–4 speakers it's 54–91% exact and 88–97% within ±1. It's 8+ speakers where it completely falls apart (0% exact match) .

    Curious if anyone has found better approaches for automatic speaker count estimation that don't require a neural model.