I didn't make it all the way through the post, but I have to say I think he fundamentally understands the purpose of WebRTC. He calls himself an expert, and yeah he's written SFU's in go and rust and different companies ... but his technical credentials do not mean he's correct.
Maybe it's a comprehension issue on my end, but he seems to associate things like stun and dtls as related, compounding issues (particularly in round trip time), but they are really orthogonal.
Also, he spends too much time talking about how you can't resend packets, and reiterates that point by stating they tried really hard (at discord?). That's where he lost the plot, imo.
The RTC in WebRTC is about real time communication. Humans will naturally prefer the auditory experience of an occasional dropped packet, vs backed up audio or audio that plays at an uneven rate. To clarify, I'm talking about human speech here.
If you want to tolerate packet loss, use a protocol based on tcp instead of udp. But you know what happens when you send audio over poor network conditions with tcp? There will be pauses on the receiving end as it waits for the next correct packet. Let's say the delay is multiple seconds. What should the receiving end do when packets start flowing again? Plays the clogged audio at a natural clock? Attempt to play the audio back at a higher rate to "catch up" with any other channels? People, humans, do not generally prefer that experience.
Forget about WebRTC for a minute, but instead think about tcp vs udp for voice. Voip has been based on udp since the 90's for a reason.
I think you're not really engaging with his point, which is that RTC is a poor fit for communicating with an AI agent. I didn't read the blog as claiming that WebRTC is bad for what it is, only that it's a (very) poor choice for a voice-to-AI application.
> Humans will naturally prefer the auditory experience of an occasional dropped packet, vs backed up audio or audio that plays at an uneven rate
Yes but the difference here is there is only one human in the conversation. The other side can tolerate a 200ms delay in receiving or sending perfectly fine because it is not constrained to run in exactly real time like a human brain is.
I think he is right. This is an interesting point that I haven't considered before. The reason we skip 200ms instead of pausing for 200ms when we get missed packets in a WebRTC call is because we can't pause the human on the other side of the call. But we can pause AI just fine.
but, if it's trying to respond in a natural way, with interruptions in both directions, it may still be a good idea. if there's a delay between you stopping and it starting talking, it feels weird
(you might be able to fake some of that on the client, but then you need a thicker client)
Responding to some technical points first, but then after that I do see a future that isn't WebRTC. I don't think it matches where WebTransport+WebCodecs etc is going though.
> …but as a user, I would much rather wait an extra 200ms for my slow/expensive prompt to be accurate
This is the opposite of the feedback I get. Users want instant responses. If you have delay in generating responses/interruptions it kills the magic. You also don't want to send faster than real-time. If the user interrupts the model you just wasted a bunch of bandwidth sending 3 minutes of audio (but only played 10 seconds)
You lose stuff like AEC. You also push complexity on clients. The simplicity of WebRTC (createOffer -> setRemoteDescription) is what lets people onboard easily. Lots of developers struggled with Realtime API + web sockets (lots of code and having to do stuff by hand)
----
I think if I had my choice I would pick Offer/Answer model and then doing QUIC instead of DTLS+SCTP. Maybe do RTP over QUIC? I personally don't feel strongly about the protocol itself. I don't know how to ship code to multiple clients (and customers clients) with a much large code footprint.
1. Of course users want lower latency, but they also want fewer instances where the LLM "misheard" them. It would be amazing to run A/B experiments on the trade-off between latency vs quality, but WebRTC makes that knob difficult to turn.
2. I'm obviously not an TTS expert, but what benefit is there to trickling out the result? The silicon doesn't care how quickly the time number increments?
3. Yeah, sometimes the client is aware when their IP changes and can do an ICE renegotiation. But often they aren't aware, and normally would rely on the server detecting the change, but that's not possible with your LB setup. It's not a big deal, just unfortunate given how many hoops you have to jump through already.
4. Okay, that draft means 7 RTTs instead of 8 RTTs? Again some can be pipelined so the real number is a bit lower. But like the real issue is the mandatory signaling server which causes a double TLS handshake just in case P2P is being used.
5. Of course WebRTC is easier for a new developer because it's a black box conferencing app. But for a large company like OpenAI, that black box starts to cause problems that really could be fixed with lower level primitives.
I absolutely think you should mess around with RTP over QUIC and would love to help. If you're worried about code size, the browser (and one day the OS) provides the QUIC library. And if you switch to something closer to MoQ, QUIC handles fragmentation, retransmissions, congestion control, etc. Your application ends up being surprisingly small.
The main shortcoming with RoQ/MoQ is that we can't implement GCC because QUIC is congestion controlled (including datagrams). We're stuck with cubic/BBR when sending from the browser for now.
1.) Latency vs quality doesn't come up enough to make people want to A/B test it unfortunately. At work I would say ~5 people care about WebRTC vs QUIC vs X. All effort is around the models (how can I provide tools to be support those doing that work)
2.) The model isn't processing just text anymore. Also taking into account breathing/emotion etc... not just spitting out big responses anymore. As it generates them it is taking into account the users response.
3.) It works with the LB setup today. Clients are sending ICE traffic, if it roams we lookup the ufrag and route appropriately.
4.) With DTLS 1.3 it is 1 RTT with SNAP[0] for WebRTC session. SCTP info goes in Offer/Answer, DTLS is packed into ICE. You are totally right about signaling though! [1] was my answer for doing WebRTC without signaling, couldn't get anyone to care though.
5.) I don't have anything that I need to tune. If I want to increase (or decrease) latency [3] is something I put into Transceiver. Otherwise I can't think of any 'change this WebRTC behavior' that has been asked by users/developers.
Latency versus reliability is a false dichotomy anyway. The alternative to WebRTC isn't to wait for the user to finish speaking before you send any of the audio. Open a websocket and send the coded audio packets as they're generated. Now you're still sending audio packets immediately, but if one is dropped, TCP retransmits it until it makes it through. If the connection is really slow, packets queue up, and the user has to wait, but it still works. You get the low latency in the best case and the robustness in the worst case.
You ultimately still need a jitter buffer large enough to absorb retransmisiones. Otherwise you’ve got stuttering audio. And dynamically adjusting this jitter buffer is hard
I'm not an expert. Can't we abuse that LLMs don't need to receive audio as a continuous stream without interruptions? Couldn't we just send data and pipe it into the LLM with deduplication (if resending happens)?
You’re absolutely correct. A jitter buffer is necessary for a human listener, but a LLM isn’t aware of a time lapse, just like it isn’t aware of the time since your last message in the conversion (unless the chat harness explicitly informs it).
Human spoken conversation doesn’t really work like file buffering.
People can tolerate missing words surprisingly well. If a phrase is slightly clipped, masked by noise, or dropped, the listener can often infer it from context. That happens constantly in real speech.
But pauses and stalls are much more damaging. A sudden freeze in the middle of speech breaks turn-taking, timing, and attention. It feels like the speaker stopped thinking, the connection died, or the system got stuck.
For voice UX, a tiny omission is often less harmful than a perfectly complete sentence that freezes halfway.
> People can tolerate missing words surprisingly well. If a phrase is slightly clipped, masked by noise, or dropped, the listener can often infer it from context. That happens constantly in real speech.
LLMs are surprisingly good at this, too.
This entire blog post is based on assumptions that
1) WebRTC garbling is common
2) LLMs fall apart if there are any audio glitches
I would bet money that OpenAI explored and has statistics on both of those and how it impacts service. More than this blogger heaping snark upon snark to avoid having a realistic conversation about pros and cons
If I'm talking to a friend or peer and I'm on a crappy link, we can probably work it out. If I'm calling my lawyer from prison with my "one call" I really want my lawyer to get my instructions clearly and correctly, ideally the first time without a lot of coaching.
Where on this scale does "person talking to LLM" fit?
I believe there's a ton of research into the shannon limit and human speech. You can trivially observe how much redundancy there is by listening to a podcast at 1x, 1.2x, 1.5x, 2x, etc, and when you can't follow what's going on, you've found the "redundancy" built into that language. This number falls way off when you're listening to a person with an accent or when the recording is noisy or whatever.
You'll also find that your tolerance for lossy media is radically different based on latency and echos and jitter in the audio (which I believe is the point of the original "don't use webrtc" article...)
Finally, people may tolerate this, but the "phonem to token" thinger may be less tolerant, and will certainly not be able to magic correct meaning from lost packets, and if the resulting exchange is extremely expensive or important (from the lawyer and the "I'm in jail in poughkeepsie; I need bail!" exchange) you really want to take the time to get it right, not make things guess.
The misunderstanding the user comes down to understanding how the user prompts and what type of responses the user gets in return. I’m wondering if for anything code the llm could have an interrupter that it would first read what the user wrote and translate it to proper sentence structure and the return a truer value. I think the llm is having an understanding issue because everyone has a unique signature in how they explain something. That signature operates like a personal language of the user as to which most of us will run through different scenarios to come up with a conclusion from our personal signature/language in which we conduct ourselves. And since llms are gamed to get to the answer faster using less tokens it probably picks the average high level signature that can be used for multiple users.
Oh, I can absolutely believe it. Humans are deeply irrational, especially about things that mess about in time frames too short for our conscious thought processes to kick in. Instant but confident sounding (and confident sounding because it's instant) will beat slower every time. You don't know which is correct until a long time after you've made a decision to trust it, or whether you like it.
> Instant but confident sounding (and confident sounding because it's instant) will beat slower every time.
Sure, but I am skeptical that users are actually saying "I prefer wrong answers over lag", which is what the post I responded to implied.
This is different to user's saying "I prefer quick answers to laggy answers", which is what I presume they may have said.
To actually settle this, the feedback must answer the question "Do you want wrong answers quickly or correct answers with an added 0.2 second delay?" because, well, those are the only two options right now.
Dunno. Feels like stated vs revealed preferences to me. Of course everyone will _say_ they want the wrong answers, but I can totally see users getting annoyed at slow responses, thinking that the developers should've traded accuracy for quicker responses. (or not thinking that at all, just demanding quicker responses unconditionally)
No I think they are saying no one would say they want wrong answers. People say they want fast answers and they are implying they should also be correct.
The blog post glosses over the details and implies that 200ms of latency would be a magic solution. They do admit that WebRTC already has provisions for up to 200ms, so I guess they’re really implying that 400ms would be the happy case path for their alternative buffering, which is starting to get in the range where users would probably be annoyed.
Have you tried having conversational speech over a link with almost half a second of delay? It’s bad. You have to work hard to establish a turn taking routine with the other party and do extra mental work to identify your slot to talk.
The other half of this problem requires acknowledging that LLMs are actually pretty decent at interpreting input with gaps. You can drop words or even letters from LLM input and still get surprisingly decent results back. This post acts like a dropped packet means your response is going to send the LLM off on a wrong response or something.
I think as a user I have 2 modes:
1. Q&A mode where it's basically Google search by voice.
2. I'm trying to process an idea I have with an LLM buddy.
My desires are pretty different in the two scenarios. Q&A mode if it's not quick to respond I'll think something is wrong with my phone.
Deep think mode I'm honestly kind of pissed off at how fast it tries to respond. I want it to slow down and give me a chance to process and use extra compute on its side (including newer models) so it doesn't just spew low thought bullshit at me.
It seems like the system could detect which of these two modes was happening and adapt, including protocol.
I haven't tried the voice mode since the new model updates, maybe it's gotten better.
Counter to everything I just said though and germain to the topic at hand, when I'm in q&a mode that's probably the worst time for it to drop audio as it changes the query significantly. vs when I'm talking at it for 2 minutes it could probably throw half away.
Especially when 200ms is the rule of thumb for things still feeling "instant" to users in terms of UX, this is like a rounding error in terms of latency when I regularly wait for actual minutes for an LLM to finish its bloody thinking and have to refresh through several "we're experiencing heavy load" errors.
A single dropped or missed word in a sentence can reverse the meaning.
I am skeptical that people would rather have wrong answers than lag. I am not claiming what the percentage is and neither are you, because no one measured it at the low lag.
I would be punching my phone if the stupid network causing a wrong prompt and the LLM sends me unrelated answers. Correctness should be foundational no matter what, then improve the latency as best as possible. We all understand that if the network is bad then the latency can not be guaranteed but correctness should be.
> You also don't want to send faster than real-time. If the user interrupts the model you just wasted a bunch of bandwidth sending 3 minutes of audio (but only played 10 seconds)
You only need to send ~1 second at a time. There's no reason to send 20ms or 10 min at a time. Both are stupid.
> . You also push complexity on clients. The simplicity of WebRTC (createOffer -> setRemoteDescription) is what lets people onboard easily.
WebRTC is complex, even if it's a library (even if it's a library built into the browser they're already using). For a client/server voice interaction, I don't see why you would willingly use it. Ship voice samples over something else; maybe borrow some jitter buffer logic for playback.
My job currently involves voice and video conferencing and 1:1 calls, and WebRTC is so much complexity... it got our product going quickly, but when it does unreasonable things, it's a challenge to fix it; even though we fork it for our clients.
I could write an enormous rant about TURN [1]. But all of the webrtc protocol suite is designed for an internet that doesn't exist.
[1] Turn should allocate a rendesvous id rather than an ephemeral port when the turn client requests an allocation. Then their peer would connect to the turn server on the service port and request a connection to the rendesvous id, without needing the client to know the peer address and add a permission. It would require less communication to get to an end to end relayed connection. Advanced clusters could encode stuff in the id so the client and peer could each contact a turn server local to them and the servers could hook things up; less advanced clusters would need to share the turn server ip and service port(s) with the id.
> I could write an enormous rant about TURN [1]. But all of the webrtc protocol suite is designed for an internet that doesn't exist.
This is closer to being the real problem with WebRTC than the whole "it's making decisions about latency that I disagree with".
If you had a way to setup the tracks/channels over UDP connections that didn't involve P2P/STUN/TURN etc. but got to keep all the codec negotiation and things like AEC that would be awesome. MoQ isn't that though, because it's by people that don't actually see the whole problem end-to-end; just their little piece of it in the middle.
Delivery of first phoneme and delivery of the important information don't have to be coupled. Politicians on TV get very good at this particular trick, they've got a set of stock phrases which basically fill time while their brain gets in gear. We just need something to fill the gap so our System 1 doesn't lose confidence in the interaction.
I guess different approaches could be applicable for client to server vs server to client.
For client to server you want low latency, don't care about pauses introduced by communications (the model doesn't care), and could certainly tolerate a fallback to lower bandwidth text only (local SST) or more heavily compressed voice.
For server to client it needs to be high quality voice without pauses, but as the parent was suggesting you could potentially hide response latency (whether due to server or communication degradation) by using a human-like conversational "trick" of at least making some sound before brain is engaged and generating a response. "That's absolutely right! ..." would be a tad annoying, but "Hmm..." might be OK, especially if not done all the time, just as a locally initiated conversational filler when the server is slow to respond.
I do wonder if you actually need two models here. Audio-to-audio hindbrain on the client, and a beefy text-mode frontal lobe somewhere in the cloud, with the comms between them explicitly trained in as a potentially low-bandwidth steering connection transferring embeddings, not text.
> …but as a user, I would much rather wait an extra 200ms for my slow/expensive prompt to be accurate
I disagree with this SO strongly. I find the conversational voice mode to be a game changer because you can actually have an almost normal conversation with it. I'd be thrilled if they could shave off another 50-100ms of latency, and I might stop using it if they added 200ms. If I want deep research I'll use text and carefully compose my prompt; when I'm out and about I want to have a conversation with the Star Trek computer.
Interestingly I'm involved with a related effort at a different tech company and when I voiced this opinion it was clear that there was plenty of disagreement. This still surprises me since it seems so obvious to me that conversational fluidity is the number one most important feature.
To clarify, I meant waiting an extra 200ms if the alternative was dropping part of the prompt. During periods of zero congestion, the latency would be the same.
I prompt orchestrations most of the day, and am very particular about the fidelity of my context stack.
Yet I’ve used advanced voice mode on ChatGPT via the iOS app a lot. And I have not had a problem with it understanding my requests or my side of the conversation.
I have looked at the dictation of my side and seen it has blatant mistakes, but I think the models have overcome that the same way they do conference audio stt transcripts.
I have had times where the ~sandbox of those conversations and their far more limited ability to build useful corpus of context via web searches or by accessing prior conversation content.
The biggest problem I have had with adv voice was when I accidentally set the personality to some kind of non emotional setting. (The current config seems much more nuanced)
The AI who normally speaks with relative warmth and easy going nature turned into an emotionless and detached entity.
It was unable to explain why it was acting this way. I suspect the low latency did disservice there because when it is paired with something adversarial it was deeply troubling.
> when I'm out and about I want to have a conversation with the Star Trek computer.
But you’re not. And you won’t. You’ll never have a conversation with the Star Trek computer while you continue to place anything else above accuracy. Every time I see someone comparing LLMs to the Star Trek computers, it seems to be someone who doesn’t understand that correctness was their most important feature. I’m starting to get the feeling people making that comparison never actually watched or understood Star Trek.
A computer which gives you constant bullshit is something only the lowest of the Ferengi would try to sell.
> This still surprises me since it seems so obvious to me that conversational fluidity is the number one most important feature.
It’s not. It absolutely is not and will never be. Not unless all you’re looking for is affirmation, companionship, titillation. I suggest looking for that outside chat bots.
I have a lot of experience in this area (and some patent applications). For Alexa, the device established a connection back to the server and then kept that open, sending basically HTTP2/SPDY/Something like it over the wire after it detected the wake word. This allowed the STT start processing before you finish talking, so there is only a small delay in processing the last few chunks of your utterance.
The answer came back over the same connection.
In the case of OpenAI, they can't exactly keep a persistent connection open like Alexa does, but they can use HTTP2 from the phone and both iOS and Android will pretty much take care of that connection magically.
The author is absolutely right, a real time protocol isn't necessary. It's more important to get all the data. The user won't even notice a delay until you get over 500ms. Especially in the age of mobile phones, where most people are used to their real time human to human communications to have a delay.
(If you work at OpenAI or Anthropic, give me a shout, I'm happy to get into more details with you)
> The user won't even notice a delay until you get over 500ms
I think a lot of comments are getting so laser focused on the transport delays that they’re forgetting that the LLM pipeline isn’t instant.
The transport delays are additive on top of all of the other delays, which are already high.
Which I assume is why they reached for the lowest latency solution they could, because they need every bit of help they can get to start shrinking that end to end delay across the entire pipeline.
Analogies to human voice delay don’t work because in that case we treat the human as having no delay.
And that was the entire point of my comment. That your transport layer isn't your bottleneck. You can start processing before they finish speaking. Your bottleneck will always be what happens after that.
"The author is absolutely right, a real time protocol isn't necessary. It's more important to get all the data. The user won't even notice a delay until you get over 500ms"
Not my experience, running around 6,000 conversations per day with voice, with webrtc + cascading (stt/llm/tts) architecture.
Maybe I misunderstood your comment, but that 500ms is basically the floor of a stat of the art voice implementation these days - if you are lucky and don't skimp, and do various expensive things like speculative decoding and reasoning. 450ms on the LLM pass alone. Every ms counts in commercial applications of voice ai. If you add 200ms or 300ms to that, it really degrades the conversation.
We do a lot of voice stuff to support our business, largely with unsophisticated, non technical users. Last year's attempts, with measured turn to turn latencies of around 1200ms-1500ms, led to a lot of user confusion, interruptions, abandoned conversations and generally very unpleasant experiences. We are at around 700ms turn to turn now, depending on tool usage needed, and its approaching an OK experience, rivalling an interaction with an actual human. We are spending quite a lot to shave another 100ms off that. We do expensive, wasteful things such as speculative LLM passes, we do speculative tool executions (do a few LLM inferences as the user speaks, but don't actually execute non-idempotent tool calls before you know that that LLM pass is usable and the user did not say anything important at the tail end of their sentence) just to shave 100-200ms. When someone says 500ms is irrelevant I am sure they are describing some other use case, not human-to-AI voice interactions.
In my experience with voice AI, the problem is not with some occasional dropped webrtc packets. The real hard problem is with strong background noises, echo, and of course accents. WebRTC with its polished AEC implementations helps quite a lot at least with echos. I get the protocol is a major PITA to implement at OpenAI scale, but for anything but hyperscale applications there is lots of good, viable solutions and commercial providers (say, Daily for instance) that make it a no problem. The real problems to solve are still elswhere. But boy, add 500ms to my latency budget and you've killed my application.
I agree with everything you've said, I must have written it wrong.
What I was saying is the same as you -- the user will tolerate a total delay of 500ms, and then happiness starts to fall off. We had some Alexa utterances at 500ms, the most basic ones, but most took longer.
However, even with http2 and the like, we could get in that range because of the fact that it was sending data right away, so we were mostly done processing the STT by the time they were done speaking, and we were already working on the answer based on the first part of the utterance.
But I would need to see some really strong evidence to even think about using WebRTC.
As for webrtc - it was mainly for decent support in browsers and built in AEC. I think we will take another look at this design choice if we run out of ways to further optimize.
I am myself working on something similar, but i have noticed that if I try to pass on early speech from the user to the LLM to reduce latency, chances of interruptions get even higher. For example, the user may say something like “Yes” followed by a brief pause, leading the speech model to count that as a complete turn, triggering the LLM call. But then the user may add something more, so i have to cancel the previous request so that any irreversible state transitions can be avoided. Now due to the lower latency (due to speculative calls), I get an even smaller window to actually cancel the response or even to stop the model from speaking.
Detecting end of turn is a whole other issue. You can do the easy thing, which is just assign some number of milliseconds of silence as the end, or you can spend a lot of money asking the model to figure it out based on context.
Humans actually do the second thing, where we not only use our "model" to figure out end of turn, we actually predict what they are going to say based on context and will sometimes answer before they even finish.
This is pretty insightful thank you. Which provider are you guys using? Is it also over the phone or fully web/app based. Do you have any resources you can point me to learn about this?
> But nope, WebRTC has no buffering and renders based on arrival time. Like seriously, timestamps are just suggestions. It’s even more annoying when video enters the picture.
I felt that comment my bones. Why would anyone possibly have the need to know actual presentation timestamp and how that corresponds to actual realtime? Evidently, no one working on WebRTC has had to synchronise data streams from varying sources before with millisecond accuracy.
I was doing a demo for a video stabilisation using a webcam and IMU module in the browser. It turns out the latency between video->rtc->browser and sensor->websocket->browser are wildly different and not constant. The obvious solution would be to send UTC timestamps for the sensors data and synchronise in browser. Not possible, the video has no UTC timestamp reference. When you have control of both sides of the WebRTC pipe, you can do fun things like send the UTC timestamp of the start of the stream, but this won’t solve browser jitter. It worked well enough for a POC but the entire solution had to be reengineered.
This poor soul. There are few protocols I hate implementing more than WebRTC. Getting a simple client going means you need to quickly acclimate to SDP, TURN/STUN, ice-candidates, offers, peer-to-peer protocols, and the complex handshake that is implemented from scratch each time. I can't imagine re-writing the whole trenchcoat of protocols and unintended "best-practices".
The first time I was able to get a working webrtc datachannel setup with aiortc was when LLMs became a thing, before that it it was pretty much impossible full stop. Nobody knows what or how, there are no examples. It's a horrible protocol that just needs to die.
This is frustratingly one-sided writing. Yeah, WebRTC has limitations, but relying on a standard buys you a lot of correctness and reduces long-term engineering cost. The fact that WebRTC is complicated does not mean it is wrong; it means real-time media over the public internet is complicated.
Also, networking is inherently stateful. NAT traversal, jitter buffers, congestion control, packet loss, codec state, encryption, and session routing do not disappear because you put audio over TCP or WebSocket. Pretending otherwise is not architectural clarity. It is just moving the complexity somewhere less visible.
You might have noticed that the author started the blog post explaining themselves:
Like 6 years ago I wrote a WebRTC SFU at Twitch.
Originally we used Pion (Go) just like OpenAI,
but forked after benchmarking revealed that it was too slow.
I ended up rewriting every protocol, because of course I did!
Just a year ago, I was at Discord and I rewrote the WebRTC SFU in Rust.
Because of course I did! You’re probably noticing a trend.
Fun Fact: WebRTC consists of ~45 RFCs dating back to the early 2000s.
And some de-facto standards that are technically drafts (ex. TWCC, REMB).
Not a fun fact when you have to implement them all.
You should consider me a Certified WebRTC Expert.
Which is why I never, never want to use WebRTC again.
I think that they've done more than enough of 'trying the normal way' to be warranted in having an opinion the other way, don't you think?
Right but they also state they have never implemented TURN which IMO is a marker of WebRTC expertness. (I haven't btw, just the WebRTC experts I know absolutely have written or worked on at some point a TURN implementation)
It's not that strange. TURN has two main use cases: peer-to-peer when no viable direct path can be found and working around very strict firewalls. Based on the author's experience the first isn't relevant and the second isn't much of a concern for Twitch and Discord. For the latter case HTTP/3 is helping make TURN unnecessary because you can, as the author observes, run UDP over port 443.
It’s 2026 and teleconferencing is still such a shit show. There’s billions of dollars to be had and Zoom is at best mediocre, and it can be as bad as Microsoft Whatchamacallit. I’ve never not seen teleconferencing be a ham handed mess.
The most frustrating thing about FaceTime is it sometimes appears to significantly duck audio in order to avoid echoes. I can't predict on which devices it will happen, but it often does when I call my parents and it absolutely destroys the conversation. If they're telling me something and I make the slightest "uhuh" acknowledgment sound, their mic input gets effectively muted for a second or so and I miss what they say.
> WebRTC is designed to degrade and drop my prompt during poor network conditions
You want real time that's what you are going to deal with. If you don't want real time and instead imagine everything as STT -> Prompt -> TTS then maybe you shouldn't even be sending audio on the wire at all.
Hello Mr Author here. Apologies that my comment replies aren't as funny.
Every low-latency application has to decide the user experience trade-off between quality and latency. Congestion causes queuing (aka latency) and to avoid that, something needs to be skipped (lower quality).
The WebRTC latency vs. quality knob is fixed. It's great at minimizing latency, but suffers from a lack of flexibility. We still (try to) use WebRTC anyway, because like you implied, browser support has made it one of the only options.
Until now of course! WebTransport means you can achieve WebRTC-like behavior via a generic protocol. Choose how long you want to wait before dropping/resetting a stream, instead of that decision being made for you.
And yeah my point in the blog is that often the user wants streaming, but not dropping. Obviously you can stream audio input/output without WebRTC. The application should be able to decide when audio packets are lost forever... is it 50ms or 500ms or 5000ms? My argument is that voice AI shouldn't pick the 50ms option.
Isn’t the point that OpenAI’s use case does not require realtime?
When OpenAI responds, it has most of the audio in advance of when the user needs to hear it. It produces audio faster than real time, so a real time protocol is a bad fit.
Yep. Maybe there's some additional configuration I'm missing to mitigate the delay but clients don't seem to want to deal with the delay with STT -> Prompt -> TTS. They'll happily suffer occasional quality issues if the conversation feels "real".
I run the gemini live api over a mesh hosted managed webrtc cloud. works fantastic, and Ive been running it for 2 years. you can try websocket, handle ephemeral keys, ect ect. but when you speak with people running voice agents at scale in this space, many of the issues are solved with webRTC and pipecat and the many resources allocated to solved problems in this space. It certainly feels overkill, and it probably is, but once connection is established, it's pretty magical. the startup time and buffering has been solved for quicker voice connections too, https://github.com/pipecat-ai/pipecat-examples/tree/main/ins... (video is harder)
There're tons of ways to fine-tune WebRTC that it wouldn't corrupt audio in poor network - it has all of the controls to smoothly trade-off latency vs quality. Not just NACKs - FEC, disable PLC/Acceleration/Deceleration, larger JB (tons of parameters) etc.
Most of the glitches I heard with OpenAI's Voice were not WebRTC related - but rather, to my ear, they sounded more like realtime issues with their inference - which is a very different component to optimize.
I've been using LiveKit which is also WebRTC based and it is super annoying when speed slows down or speeds up at times when connection is not robust. We were using OpenAI's websocket based RealTime audio which was way too slow. So I don't know which one is better. Generally our users like the LiveKit implementation better so maybe WebRTC with enough clever hacks is the answer.
This blog was super insightful for me to understand what are the root problems in the current implementation though.
there are a lot of extremely smart people that have come back to webRTC time and time again because it continues to solve problems other methods and protocols can't. with saying that, quic is certainly interesting going forward, but i primarily stream voice + vision at 1fps so it just makes sense, and websockets fail and are insecure at scale for this use case (see https://www.daily.co/videosaurus/websockets-and-webrtc/) . also just listen to sean in this thread, dude knows whats up.
Why does the voice need to be sent to the server? Why not perform speech-to-text on-device? Is the p10 phone/laptop not capable of this yet, despite every "dictation" feature I see in every modern OS?
I haven't really experienced disconnections while using ChatGPT. Gemini is the frustrating part. Simply backgrounding the app (and the web version too) and resuming it causes the response or the conversation with an assigned ID to disappear. Haha.
Most of the problems happen because we want to simulate human conversations. While thats a good goal to have, another approach is to let the user know clearly they are talking to a bot. You will be surprised at how accomodating users can be when they know they are talking to a bot and want their queries resolved.
My biggest frustration with WebRTC was precisely captured in the article: even if you don't need p2p and your video source is the process on the same host with your browser, you have to dance around connection setup like you're on a different side of a planet
Exactly what I thought when I read the original article, though to be fair WebTransport is barely now entering the mainstream with Safari shipping support this year.
this misses a few key things but hits on many others
webrtc is a bad protocol, without a doubt. I do like websockets as an easy alternative, but you do need to reinvent decent portions of webrtc as a result
I like the idea of MoQ but it's not widely used. probably worth experimenting with, especially as video enters the chat
> and then a GPU pretends to talk to you via text-to-speech
OpenAI is speech-to-speech, there is no TTS in voice mode
> It takes a minimum of 8* round trips (RTT) to establish a WebRTC connection
signalling can be done long ahead of time, though I don't see this mentioned in the OpenAI blog. I also saw some new webrtc extensions that should reduce setup time further
ultimately though, it comes down to
> It’s not like LLMs are particularly responsive anyway
I expect to see a shift in how S2S models work to be lower latency like the new voice API models that OpenAI announced
to be fair, the new models were released the day after this MoQ blog was published
> OpenAI is speech-to-speech, there is no TTS in voice mode
Which results in the interesting situation where the transcript isn't what was said:
Q: Why do the voice transcripts sometimes not match the conversation I had?
A: Voice conversations are inherently multimodal, allowing for direct audio exchange between you and the model. As a result, when this audio is transcribed, the transcription might not always align perfectly with the original conversation.
"WebRTC is the problem" is bait; his real claim is "WebRTC has annoying transport-layer characteristics that hurt cloud Voice AI scaling"...
Having just had to tackle this again for my own startup, I'm reminded about what you would lose by ditching WebRTC - the audio DSP pipeline, transmit side VAD, echo cancellation, noise suppression, NAT traversal maturity, codec integration, browser ubiquity etc.
interesting read albeit over my head, but i spent half of yesterday comparing Gemini Live (websockets) vs gpt-realtime-2 and while gpt is super good, seemingly more robust. Gemini connects faster.
I've long had the feeling that WebRTC was intentionally over-engineered. Over-engineered and poorly documented.
IMO, tech standards should be simple and minimal and people should be able to implement whatever they want on top. I tend to stay away from complex web standards.
It can, in general knowing how to shuffle packets according to RFCs is a pretty decent gig. Pretty much every hyperscaler ends up building various LBs and the learning curve is too steep to just toss randos at it unsupervised, but at the same time it's not necessarily inventing anything new most of the time.
How is OpenAI Voice mode any different than a Whatsapp call? Ignoring the part that there is a GPU on the other side instead of a human. But what is the technical challenge in the voice call portion? It seems like that has been a solved problem for a long time now.
You just send packets to the other party's address and they send packets back to yours. Both parties know their address and you don't need a relay in the middle.
It is because most of their complexity is in routing packets. With IPv6 you can just have the thing handling the conversation directly addressable by the client. The last 64 bits of a v6 let you have billions of instances in a region.
I didn't make it all the way through the post, but I have to say I think he fundamentally understands the purpose of WebRTC. He calls himself an expert, and yeah he's written SFU's in go and rust and different companies ... but his technical credentials do not mean he's correct.
Maybe it's a comprehension issue on my end, but he seems to associate things like stun and dtls as related, compounding issues (particularly in round trip time), but they are really orthogonal.
Also, he spends too much time talking about how you can't resend packets, and reiterates that point by stating they tried really hard (at discord?). That's where he lost the plot, imo.
The RTC in WebRTC is about real time communication. Humans will naturally prefer the auditory experience of an occasional dropped packet, vs backed up audio or audio that plays at an uneven rate. To clarify, I'm talking about human speech here.
If you want to tolerate packet loss, use a protocol based on tcp instead of udp. But you know what happens when you send audio over poor network conditions with tcp? There will be pauses on the receiving end as it waits for the next correct packet. Let's say the delay is multiple seconds. What should the receiving end do when packets start flowing again? Plays the clogged audio at a natural clock? Attempt to play the audio back at a higher rate to "catch up" with any other channels? People, humans, do not generally prefer that experience.
Forget about WebRTC for a minute, but instead think about tcp vs udp for voice. Voip has been based on udp since the 90's for a reason.
I think you're not really engaging with his point, which is that RTC is a poor fit for communicating with an AI agent. I didn't read the blog as claiming that WebRTC is bad for what it is, only that it's a (very) poor choice for a voice-to-AI application.
Only if you expect to interact with the agent in a turn-taking format, with (possible) pauses between every turn.
ChatGPT’s voice mode is like speaking to someone in real time on a voice call, not input -> output.
> Humans will naturally prefer the auditory experience of an occasional dropped packet, vs backed up audio or audio that plays at an uneven rate
Yes but the difference here is there is only one human in the conversation. The other side can tolerate a 200ms delay in receiving or sending perfectly fine because it is not constrained to run in exactly real time like a human brain is.
I think he is right. This is an interesting point that I haven't considered before. The reason we skip 200ms instead of pausing for 200ms when we get missed packets in a WebRTC call is because we can't pause the human on the other side of the call. But we can pause AI just fine.
i haven't used the openai voice thing
but, if it's trying to respond in a natural way, with interruptions in both directions, it may still be a good idea. if there's a delay between you stopping and it starting talking, it feels weird
(you might be able to fake some of that on the client, but then you need a thicker client)
Which LLM can generate text so quickly a real-time conversation is viable?
Responding to some technical points first, but then after that I do see a future that isn't WebRTC. I don't think it matches where WebTransport+WebCodecs etc is going though.
> …but as a user, I would much rather wait an extra 200ms for my slow/expensive prompt to be accurate
This is the opposite of the feedback I get. Users want instant responses. If you have delay in generating responses/interruptions it kills the magic. You also don't want to send faster than real-time. If the user interrupts the model you just wasted a bunch of bandwidth sending 3 minutes of audio (but only played 10 seconds)
> TTS is faster than real-time
https://research.nvidia.com/labs/adlr/personaplex/ Voice AI for the latest/aspirational is moving away from what the author describes. It is trickled in/out at 20ms
> We really hope the user’s source IP/port never changes, because we broke that functionality.
That is supported. When new IP for ufrag comes in its supported
> It takes a minimum of 8* round trips (RTT)
That's wrong. https://datatracker.ietf.org/doc/draft-hancke-webrtc-sped/
> I’d just stream audio over WebSockets
You lose stuff like AEC. You also push complexity on clients. The simplicity of WebRTC (createOffer -> setRemoteDescription) is what lets people onboard easily. Lots of developers struggled with Realtime API + web sockets (lots of code and having to do stuff by hand)
----
I think if I had my choice I would pick Offer/Answer model and then doing QUIC instead of DTLS+SCTP. Maybe do RTP over QUIC? I personally don't feel strongly about the protocol itself. I don't know how to ship code to multiple clients (and customers clients) with a much large code footprint.
HELLO MR SEAN,
1. Of course users want lower latency, but they also want fewer instances where the LLM "misheard" them. It would be amazing to run A/B experiments on the trade-off between latency vs quality, but WebRTC makes that knob difficult to turn.
2. I'm obviously not an TTS expert, but what benefit is there to trickling out the result? The silicon doesn't care how quickly the time number increments?
3. Yeah, sometimes the client is aware when their IP changes and can do an ICE renegotiation. But often they aren't aware, and normally would rely on the server detecting the change, but that's not possible with your LB setup. It's not a big deal, just unfortunate given how many hoops you have to jump through already.
4. Okay, that draft means 7 RTTs instead of 8 RTTs? Again some can be pipelined so the real number is a bit lower. But like the real issue is the mandatory signaling server which causes a double TLS handshake just in case P2P is being used.
5. Of course WebRTC is easier for a new developer because it's a black box conferencing app. But for a large company like OpenAI, that black box starts to cause problems that really could be fixed with lower level primitives.
I absolutely think you should mess around with RTP over QUIC and would love to help. If you're worried about code size, the browser (and one day the OS) provides the QUIC library. And if you switch to something closer to MoQ, QUIC handles fragmentation, retransmissions, congestion control, etc. Your application ends up being surprisingly small.
The main shortcoming with RoQ/MoQ is that we can't implement GCC because QUIC is congestion controlled (including datagrams). We're stuck with cubic/BBR when sending from the browser for now.
1.) Latency vs quality doesn't come up enough to make people want to A/B test it unfortunately. At work I would say ~5 people care about WebRTC vs QUIC vs X. All effort is around the models (how can I provide tools to be support those doing that work)
2.) The model isn't processing just text anymore. Also taking into account breathing/emotion etc... not just spitting out big responses anymore. As it generates them it is taking into account the users response.
3.) It works with the LB setup today. Clients are sending ICE traffic, if it roams we lookup the ufrag and route appropriately.
4.) With DTLS 1.3 it is 1 RTT with SNAP[0] for WebRTC session. SCTP info goes in Offer/Answer, DTLS is packed into ICE. You are totally right about signaling though! [1] was my answer for doing WebRTC without signaling, couldn't get anyone to care though.
5.) I don't have anything that I need to tune. If I want to increase (or decrease) latency [3] is something I put into Transceiver. Otherwise I can't think of any 'change this WebRTC behavior' that has been asked by users/developers.
[0] https://datatracker.ietf.org/doc/draft-hancke-tsvwg-snap/
[1] https://github.com/pion/offline-browser-communication
[3] https://webrtc.googlesource.com/src/+/refs/heads/main/docs/n...
Latency versus reliability is a false dichotomy anyway. The alternative to WebRTC isn't to wait for the user to finish speaking before you send any of the audio. Open a websocket and send the coded audio packets as they're generated. Now you're still sending audio packets immediately, but if one is dropped, TCP retransmits it until it makes it through. If the connection is really slow, packets queue up, and the user has to wait, but it still works. You get the low latency in the best case and the robustness in the worst case.
You ultimately still need a jitter buffer large enough to absorb retransmisiones. Otherwise you’ve got stuttering audio. And dynamically adjusting this jitter buffer is hard
I'm not an expert. Can't we abuse that LLMs don't need to receive audio as a continuous stream without interruptions? Couldn't we just send data and pipe it into the LLM with deduplication (if resending happens)?
You’re absolutely correct. A jitter buffer is necessary for a human listener, but a LLM isn’t aware of a time lapse, just like it isn’t aware of the time since your last message in the conversion (unless the chat harness explicitly informs it).
Audio -> ASR - no jitter buffer TTS -> human - jitter buffer
> And dynamically adjusting this jitter buffer is hard
Unappreciated part of this entire conversation.
Human spoken conversation doesn’t really work like file buffering.
People can tolerate missing words surprisingly well. If a phrase is slightly clipped, masked by noise, or dropped, the listener can often infer it from context. That happens constantly in real speech.
But pauses and stalls are much more damaging. A sudden freeze in the middle of speech breaks turn-taking, timing, and attention. It feels like the speaker stopped thinking, the connection died, or the system got stuck.
For voice UX, a tiny omission is often less harmful than a perfectly complete sentence that freezes halfway.
> People can tolerate missing words surprisingly well. If a phrase is slightly clipped, masked by noise, or dropped, the listener can often infer it from context. That happens constantly in real speech.
LLMs are surprisingly good at this, too.
This entire blog post is based on assumptions that
1) WebRTC garbling is common
2) LLMs fall apart if there are any audio glitches
I would bet money that OpenAI explored and has statistics on both of those and how it impacts service. More than this blogger heaping snark upon snark to avoid having a realistic conversation about pros and cons
I think this is mixing domains quite a bit;
If I'm talking to a friend or peer and I'm on a crappy link, we can probably work it out. If I'm calling my lawyer from prison with my "one call" I really want my lawyer to get my instructions clearly and correctly, ideally the first time without a lot of coaching.
Where on this scale does "person talking to LLM" fit?
I believe there's a ton of research into the shannon limit and human speech. You can trivially observe how much redundancy there is by listening to a podcast at 1x, 1.2x, 1.5x, 2x, etc, and when you can't follow what's going on, you've found the "redundancy" built into that language. This number falls way off when you're listening to a person with an accent or when the recording is noisy or whatever.
You'll also find that your tolerance for lossy media is radically different based on latency and echos and jitter in the audio (which I believe is the point of the original "don't use webrtc" article...)
Finally, people may tolerate this, but the "phonem to token" thinger may be less tolerant, and will certainly not be able to magic correct meaning from lost packets, and if the resulting exchange is extremely expensive or important (from the lawyer and the "I'm in jail in poughkeepsie; I need bail!" exchange) you really want to take the time to get it right, not make things guess.
The misunderstanding the user comes down to understanding how the user prompts and what type of responses the user gets in return. I’m wondering if for anything code the llm could have an interrupter that it would first read what the user wrote and translate it to proper sentence structure and the return a truer value. I think the llm is having an understanding issue because everyone has a unique signature in how they explain something. That signature operates like a personal language of the user as to which most of us will run through different scenarios to come up with a conclusion from our personal signature/language in which we conduct ourselves. And since llms are gamed to get to the answer faster using less tokens it probably picks the average high level signature that can be used for multiple users.
> > …but as a user, I would much rather wait an extra 200ms for my slow/expensive prompt to be accurate
> This is the opposite of the feedback I get. Users want instant responses.
I am skeptical that you are getting feedback that users prefer instant wrong results to 200ms-lag correct results.
Deeply skeptical!
Oh, I can absolutely believe it. Humans are deeply irrational, especially about things that mess about in time frames too short for our conscious thought processes to kick in. Instant but confident sounding (and confident sounding because it's instant) will beat slower every time. You don't know which is correct until a long time after you've made a decision to trust it, or whether you like it.
> Instant but confident sounding (and confident sounding because it's instant) will beat slower every time.
Sure, but I am skeptical that users are actually saying "I prefer wrong answers over lag", which is what the post I responded to implied.
This is different to user's saying "I prefer quick answers to laggy answers", which is what I presume they may have said.
To actually settle this, the feedback must answer the question "Do you want wrong answers quickly or correct answers with an added 0.2 second delay?" because, well, those are the only two options right now.
Dunno. Feels like stated vs revealed preferences to me. Of course everyone will _say_ they want the wrong answers, but I can totally see users getting annoyed at slow responses, thinking that the developers should've traded accuracy for quicker responses. (or not thinking that at all, just demanding quicker responses unconditionally)
No I think they are saying no one would say they want wrong answers. People say they want fast answers and they are implying they should also be correct.
> actually saying
Yeah, I don't think that's the form of the feedback here.
Deeply false dichotomy!
The blog post glosses over the details and implies that 200ms of latency would be a magic solution. They do admit that WebRTC already has provisions for up to 200ms, so I guess they’re really implying that 400ms would be the happy case path for their alternative buffering, which is starting to get in the range where users would probably be annoyed.
Have you tried having conversational speech over a link with almost half a second of delay? It’s bad. You have to work hard to establish a turn taking routine with the other party and do extra mental work to identify your slot to talk.
The other half of this problem requires acknowledging that LLMs are actually pretty decent at interpreting input with gaps. You can drop words or even letters from LLM input and still get surprisingly decent results back. This post acts like a dropped packet means your response is going to send the LLM off on a wrong response or something.
100% agree. Sounds like they're either asking the wrong questions, or quoting answers selectively to suit this argument.
I think as a user I have 2 modes: 1. Q&A mode where it's basically Google search by voice. 2. I'm trying to process an idea I have with an LLM buddy.
My desires are pretty different in the two scenarios. Q&A mode if it's not quick to respond I'll think something is wrong with my phone.
Deep think mode I'm honestly kind of pissed off at how fast it tries to respond. I want it to slow down and give me a chance to process and use extra compute on its side (including newer models) so it doesn't just spew low thought bullshit at me.
It seems like the system could detect which of these two modes was happening and adapt, including protocol.
I haven't tried the voice mode since the new model updates, maybe it's gotten better.
Counter to everything I just said though and germain to the topic at hand, when I'm in q&a mode that's probably the worst time for it to drop audio as it changes the query significantly. vs when I'm talking at it for 2 minutes it could probably throw half away.
Especially when 200ms is the rule of thumb for things still feeling "instant" to users in terms of UX, this is like a rounding error in terms of latency when I regularly wait for actual minutes for an LLM to finish its bloody thinking and have to refresh through several "we're experiencing heavy load" errors.
> I am skeptical that you are getting feedback that users prefer instant wrong results to 200ms-lag correct results.
You are skeptical that people would prefer instant responses with 99.99% accuracy to waiting noticeably longer for a higher-accuracy rate?
The Internet, over its entire history, suggests otherwise.
Who claimed 99.99% accuracy?
A single dropped or missed word in a sentence can reverse the meaning.
I am skeptical that people would rather have wrong answers than lag. I am not claiming what the percentage is and neither are you, because no one measured it at the low lag.
I would be punching my phone if the stupid network causing a wrong prompt and the LLM sends me unrelated answers. Correctness should be foundational no matter what, then improve the latency as best as possible. We all understand that if the network is bad then the latency can not be guaranteed but correctness should be.
> You also don't want to send faster than real-time. If the user interrupts the model you just wasted a bunch of bandwidth sending 3 minutes of audio (but only played 10 seconds)
You only need to send ~1 second at a time. There's no reason to send 20ms or 10 min at a time. Both are stupid.
> . You also push complexity on clients. The simplicity of WebRTC (createOffer -> setRemoteDescription) is what lets people onboard easily.
WebRTC is complex, even if it's a library (even if it's a library built into the browser they're already using). For a client/server voice interaction, I don't see why you would willingly use it. Ship voice samples over something else; maybe borrow some jitter buffer logic for playback.
My job currently involves voice and video conferencing and 1:1 calls, and WebRTC is so much complexity... it got our product going quickly, but when it does unreasonable things, it's a challenge to fix it; even though we fork it for our clients.
I could write an enormous rant about TURN [1]. But all of the webrtc protocol suite is designed for an internet that doesn't exist.
[1] Turn should allocate a rendesvous id rather than an ephemeral port when the turn client requests an allocation. Then their peer would connect to the turn server on the service port and request a connection to the rendesvous id, without needing the client to know the peer address and add a permission. It would require less communication to get to an end to end relayed connection. Advanced clusters could encode stuff in the id so the client and peer could each contact a turn server local to them and the servers could hook things up; less advanced clusters would need to share the turn server ip and service port(s) with the id.
> I could write an enormous rant about TURN [1]. But all of the webrtc protocol suite is designed for an internet that doesn't exist.
This is closer to being the real problem with WebRTC than the whole "it's making decisions about latency that I disagree with".
If you had a way to setup the tracks/channels over UDP connections that didn't involve P2P/STUN/TURN etc. but got to keep all the codec negotiation and things like AEC that would be awesome. MoQ isn't that though, because it's by people that don't actually see the whole problem end-to-end; just their little piece of it in the middle.
Delivery of first phoneme and delivery of the important information don't have to be coupled. Politicians on TV get very good at this particular trick, they've got a set of stock phrases which basically fill time while their brain gets in gear. We just need something to fill the gap so our System 1 doesn't lose confidence in the interaction.
So you could just locally generate the "You're absolutely right! ..." prefix without even waiting for the response to stream in!
Do speech to text on the client and send the text/subtitles along with the audio.
If the connection is truly bad, upload your voice and quantify emotional payload.
I guess different approaches could be applicable for client to server vs server to client.
For client to server you want low latency, don't care about pauses introduced by communications (the model doesn't care), and could certainly tolerate a fallback to lower bandwidth text only (local SST) or more heavily compressed voice.
For server to client it needs to be high quality voice without pauses, but as the parent was suggesting you could potentially hide response latency (whether due to server or communication degradation) by using a human-like conversational "trick" of at least making some sound before brain is engaged and generating a response. "That's absolutely right! ..." would be a tad annoying, but "Hmm..." might be OK, especially if not done all the time, just as a locally initiated conversational filler when the server is slow to respond.
HarHar, that makes me think of those people who start each sentence with your name.
:) I guess that'd work too if they want to go with the Butt-Head persona!
I do wonder if you actually need two models here. Audio-to-audio hindbrain on the client, and a beefy text-mode frontal lobe somewhere in the cloud, with the comms between them explicitly trained in as a potentially low-bandwidth steering connection transferring embeddings, not text.
FWIW, the getUserMedia() portion of such a setup remains the same, so you don't lose AEC or anything else coupled there.
> …but as a user, I would much rather wait an extra 200ms for my slow/expensive prompt to be accurate
I disagree with this SO strongly. I find the conversational voice mode to be a game changer because you can actually have an almost normal conversation with it. I'd be thrilled if they could shave off another 50-100ms of latency, and I might stop using it if they added 200ms. If I want deep research I'll use text and carefully compose my prompt; when I'm out and about I want to have a conversation with the Star Trek computer.
Interestingly I'm involved with a related effort at a different tech company and when I voiced this opinion it was clear that there was plenty of disagreement. This still surprises me since it seems so obvious to me that conversational fluidity is the number one most important feature.
To clarify, I meant waiting an extra 200ms if the alternative was dropping part of the prompt. During periods of zero congestion, the latency would be the same.
It is very important, the low latency.
I prompt orchestrations most of the day, and am very particular about the fidelity of my context stack.
Yet I’ve used advanced voice mode on ChatGPT via the iOS app a lot. And I have not had a problem with it understanding my requests or my side of the conversation.
I have looked at the dictation of my side and seen it has blatant mistakes, but I think the models have overcome that the same way they do conference audio stt transcripts.
I have had times where the ~sandbox of those conversations and their far more limited ability to build useful corpus of context via web searches or by accessing prior conversation content.
The biggest problem I have had with adv voice was when I accidentally set the personality to some kind of non emotional setting. (The current config seems much more nuanced)
The AI who normally speaks with relative warmth and easy going nature turned into an emotionless and detached entity.
It was unable to explain why it was acting this way. I suspect the low latency did disservice there because when it is paired with something adversarial it was deeply troubling.
> when I'm out and about I want to have a conversation with the Star Trek computer.
But you’re not. And you won’t. You’ll never have a conversation with the Star Trek computer while you continue to place anything else above accuracy. Every time I see someone comparing LLMs to the Star Trek computers, it seems to be someone who doesn’t understand that correctness was their most important feature. I’m starting to get the feeling people making that comparison never actually watched or understood Star Trek.
A computer which gives you constant bullshit is something only the lowest of the Ferengi would try to sell.
> This still surprises me since it seems so obvious to me that conversational fluidity is the number one most important feature.
It’s not. It absolutely is not and will never be. Not unless all you’re looking for is affirmation, companionship, titillation. I suggest looking for that outside chat bots.
> This is the opposite of the feedback I get. Users want instant responses.
Did they really say they prefer fast response over accurate repsonse?
This is assuming the LLM can produce an accurate response.
Unlikely if the task gets inaccurately transmitted.
I have a lot of experience in this area (and some patent applications). For Alexa, the device established a connection back to the server and then kept that open, sending basically HTTP2/SPDY/Something like it over the wire after it detected the wake word. This allowed the STT start processing before you finish talking, so there is only a small delay in processing the last few chunks of your utterance.
The answer came back over the same connection.
In the case of OpenAI, they can't exactly keep a persistent connection open like Alexa does, but they can use HTTP2 from the phone and both iOS and Android will pretty much take care of that connection magically.
The author is absolutely right, a real time protocol isn't necessary. It's more important to get all the data. The user won't even notice a delay until you get over 500ms. Especially in the age of mobile phones, where most people are used to their real time human to human communications to have a delay.
(If you work at OpenAI or Anthropic, give me a shout, I'm happy to get into more details with you)
> The user won't even notice a delay until you get over 500ms
I think a lot of comments are getting so laser focused on the transport delays that they’re forgetting that the LLM pipeline isn’t instant.
The transport delays are additive on top of all of the other delays, which are already high.
Which I assume is why they reached for the lowest latency solution they could, because they need every bit of help they can get to start shrinking that end to end delay across the entire pipeline.
Analogies to human voice delay don’t work because in that case we treat the human as having no delay.
And that was the entire point of my comment. That your transport layer isn't your bottleneck. You can start processing before they finish speaking. Your bottleneck will always be what happens after that.
"The author is absolutely right, a real time protocol isn't necessary. It's more important to get all the data. The user won't even notice a delay until you get over 500ms"
Not my experience, running around 6,000 conversations per day with voice, with webrtc + cascading (stt/llm/tts) architecture.
Maybe I misunderstood your comment, but that 500ms is basically the floor of a stat of the art voice implementation these days - if you are lucky and don't skimp, and do various expensive things like speculative decoding and reasoning. 450ms on the LLM pass alone. Every ms counts in commercial applications of voice ai. If you add 200ms or 300ms to that, it really degrades the conversation.
We do a lot of voice stuff to support our business, largely with unsophisticated, non technical users. Last year's attempts, with measured turn to turn latencies of around 1200ms-1500ms, led to a lot of user confusion, interruptions, abandoned conversations and generally very unpleasant experiences. We are at around 700ms turn to turn now, depending on tool usage needed, and its approaching an OK experience, rivalling an interaction with an actual human. We are spending quite a lot to shave another 100ms off that. We do expensive, wasteful things such as speculative LLM passes, we do speculative tool executions (do a few LLM inferences as the user speaks, but don't actually execute non-idempotent tool calls before you know that that LLM pass is usable and the user did not say anything important at the tail end of their sentence) just to shave 100-200ms. When someone says 500ms is irrelevant I am sure they are describing some other use case, not human-to-AI voice interactions.
In my experience with voice AI, the problem is not with some occasional dropped webrtc packets. The real hard problem is with strong background noises, echo, and of course accents. WebRTC with its polished AEC implementations helps quite a lot at least with echos. I get the protocol is a major PITA to implement at OpenAI scale, but for anything but hyperscale applications there is lots of good, viable solutions and commercial providers (say, Daily for instance) that make it a no problem. The real problems to solve are still elswhere. But boy, add 500ms to my latency budget and you've killed my application.
I agree with everything you've said, I must have written it wrong.
What I was saying is the same as you -- the user will tolerate a total delay of 500ms, and then happiness starts to fall off. We had some Alexa utterances at 500ms, the most basic ones, but most took longer.
However, even with http2 and the like, we could get in that range because of the fact that it was sending data right away, so we were mostly done processing the STT by the time they were done speaking, and we were already working on the answer based on the first part of the utterance.
But I would need to see some really strong evidence to even think about using WebRTC.
Sorry, I misunderstood your comment.
As for webrtc - it was mainly for decent support in browsers and built in AEC. I think we will take another look at this design choice if we run out of ways to further optimize.
I am myself working on something similar, but i have noticed that if I try to pass on early speech from the user to the LLM to reduce latency, chances of interruptions get even higher. For example, the user may say something like “Yes” followed by a brief pause, leading the speech model to count that as a complete turn, triggering the LLM call. But then the user may add something more, so i have to cancel the previous request so that any irreversible state transitions can be avoided. Now due to the lower latency (due to speculative calls), I get an even smaller window to actually cancel the response or even to stop the model from speaking.
Detecting end of turn is a whole other issue. You can do the easy thing, which is just assign some number of milliseconds of silence as the end, or you can spend a lot of money asking the model to figure it out based on context.
Humans actually do the second thing, where we not only use our "model" to figure out end of turn, we actually predict what they are going to say based on context and will sometimes answer before they even finish.
This is pretty insightful thank you. Which provider are you guys using? Is it also over the phone or fully web/app based. Do you have any resources you can point me to learn about this?
We use a bunch, at the moment we mainly self host (and use pipecat) use Daily, and a few niche boutique suppliers who built things for us.
There is a great resource for learning this stuff - the CEO of Daily, Kwindla Kramer, hosted a series of 1hr sessions on low latency voice ai. Here:
https://youtube.com/playlist?list=PLzU2zoMTQIHjMPZ-OnpC3ozZs...
Some of this is a bit outdated but most of it is very valuable.
Kwindla posts a lot of extremely useful stuff on x and linkedin, incl. working, easily replicable sub 500ms setups.
Beautiful thanks. We are also looking at this and another complication is transcripts can get pretty messy updates, corrections etc.
> But nope, WebRTC has no buffering and renders based on arrival time. Like seriously, timestamps are just suggestions. It’s even more annoying when video enters the picture.
I felt that comment my bones. Why would anyone possibly have the need to know actual presentation timestamp and how that corresponds to actual realtime? Evidently, no one working on WebRTC has had to synchronise data streams from varying sources before with millisecond accuracy.
I was doing a demo for a video stabilisation using a webcam and IMU module in the browser. It turns out the latency between video->rtc->browser and sensor->websocket->browser are wildly different and not constant. The obvious solution would be to send UTC timestamps for the sensors data and synchronise in browser. Not possible, the video has no UTC timestamp reference. When you have control of both sides of the WebRTC pipe, you can do fun things like send the UTC timestamp of the start of the stream, but this won’t solve browser jitter. It worked well enough for a POC but the entire solution had to be reengineered.
This poor soul. There are few protocols I hate implementing more than WebRTC. Getting a simple client going means you need to quickly acclimate to SDP, TURN/STUN, ice-candidates, offers, peer-to-peer protocols, and the complex handshake that is implemented from scratch each time. I can't imagine re-writing the whole trenchcoat of protocols and unintended "best-practices".
Have you attempted to use the Microsoft Graph API to interact with email?
It's way better than the old powershell modules imo. What don't you like?
Ugh. Who's decided to Graph all the things.
The first time I was able to get a working webrtc datachannel setup with aiortc was when LLMs became a thing, before that it it was pretty much impossible full stop. Nobody knows what or how, there are no examples. It's a horrible protocol that just needs to die.
What platforms were you targeting that you found it painful! Sorry it was frustrating.
I hope it’s getting better with education/more libraries. It’s also amazing how easy Codex etc… can burn through it now
i like livekit for this reason and their ceo is cool
This is frustratingly one-sided writing. Yeah, WebRTC has limitations, but relying on a standard buys you a lot of correctness and reduces long-term engineering cost. The fact that WebRTC is complicated does not mean it is wrong; it means real-time media over the public internet is complicated.
Also, networking is inherently stateful. NAT traversal, jitter buffers, congestion control, packet loss, codec state, encryption, and session routing do not disappear because you put audio over TCP or WebSocket. Pretending otherwise is not architectural clarity. It is just moving the complexity somewhere less visible.
You might have noticed that the author started the blog post explaining themselves:
I think that they've done more than enough of 'trying the normal way' to be warranted in having an opinion the other way, don't you think?Yes,agreed. I also found it apparently obvious that they have proven their experts worth on this subject matter. Many times, over and over.
But ChatGPT said …
Right but they also state they have never implemented TURN which IMO is a marker of WebRTC expertness. (I haven't btw, just the WebRTC experts I know absolutely have written or worked on at some point a TURN implementation)
It's not that strange. TURN has two main use cases: peer-to-peer when no viable direct path can be found and working around very strict firewalls. Based on the author's experience the first isn't relevant and the second isn't much of a concern for Twitch and Discord. For the latter case HTTP/3 is helping make TURN unnecessary because you can, as the author observes, run UDP over port 443.
> This is frustratingly one-sided writing
Tangential, but by being that, it's also refreshingly human writing, vs the both-sidesy bullet listed AI pablum that's all around us these days.
I have zero take on the subject matter, but I like that the article had a detectably human flair.
And if it was AI written, god help us.
“How hard can it be?” the strawman asked.
It’s 2026 and teleconferencing is still such a shit show. There’s billions of dollars to be had and Zoom is at best mediocre, and it can be as bad as Microsoft Whatchamacallit. I’ve never not seen teleconferencing be a ham handed mess.
Facetime does alright in the consumer segment.
The most frustrating thing about FaceTime is it sometimes appears to significantly duck audio in order to avoid echoes. I can't predict on which devices it will happen, but it often does when I call my parents and it absolutely destroys the conversation. If they're telling me something and I make the slightest "uhuh" acknowledgment sound, their mic input gets effectively muted for a second or so and I miss what they say.
QUIC is also a standard.
> WebRTC is designed to degrade and drop my prompt during poor network conditions
You want real time that's what you are going to deal with. If you don't want real time and instead imagine everything as STT -> Prompt -> TTS then maybe you shouldn't even be sending audio on the wire at all.
Hello Mr Author here. Apologies that my comment replies aren't as funny.
Every low-latency application has to decide the user experience trade-off between quality and latency. Congestion causes queuing (aka latency) and to avoid that, something needs to be skipped (lower quality).
The WebRTC latency vs. quality knob is fixed. It's great at minimizing latency, but suffers from a lack of flexibility. We still (try to) use WebRTC anyway, because like you implied, browser support has made it one of the only options.
Until now of course! WebTransport means you can achieve WebRTC-like behavior via a generic protocol. Choose how long you want to wait before dropping/resetting a stream, instead of that decision being made for you.
And yeah my point in the blog is that often the user wants streaming, but not dropping. Obviously you can stream audio input/output without WebRTC. The application should be able to decide when audio packets are lost forever... is it 50ms or 500ms or 5000ms? My argument is that voice AI shouldn't pick the 50ms option.
Isn't the jitterBufferTarget [0] the latency vs. quality knob?
[0] https://developer.mozilla.org/en-US/docs/Web/API/RTCRtpRecei...
Close, but that's a minimum latency. We want a maximum latency knob.
> You want real time
Isn’t the point that OpenAI’s use case does not require realtime?
When OpenAI responds, it has most of the audio in advance of when the user needs to hear it. It produces audio faster than real time, so a real time protocol is a bad fit.
That is not the case. See get-realtime-translate[0 that's doing it as a trickle instead (not turn based).
[0] https://developers.openai.com/api/docs/models/gpt-realtime-t...
Yep. Maybe there's some additional configuration I'm missing to mitigate the delay but clients don't seem to want to deal with the delay with STT -> Prompt -> TTS. They'll happily suffer occasional quality issues if the conversation feels "real".
>Yep. Maybe there's some [dropped] issues if the conversation feels "real".
Can you repeat that please? It didn't make any sense. This conversation doesn't feel "real".
I run the gemini live api over a mesh hosted managed webrtc cloud. works fantastic, and Ive been running it for 2 years. you can try websocket, handle ephemeral keys, ect ect. but when you speak with people running voice agents at scale in this space, many of the issues are solved with webRTC and pipecat and the many resources allocated to solved problems in this space. It certainly feels overkill, and it probably is, but once connection is established, it's pretty magical. the startup time and buffering has been solved for quicker voice connections too, https://github.com/pipecat-ai/pipecat-examples/tree/main/ins... (video is harder)
There're tons of ways to fine-tune WebRTC that it wouldn't corrupt audio in poor network - it has all of the controls to smoothly trade-off latency vs quality. Not just NACKs - FEC, disable PLC/Acceleration/Deceleration, larger JB (tons of parameters) etc.
Most of the glitches I heard with OpenAI's Voice were not WebRTC related - but rather, to my ear, they sounded more like realtime issues with their inference - which is a very different component to optimize.
If you're just doing STT and TTS why would you not do that locally and steam text?
Because local STT and TTS is not good enough and LLMs understand it much better?
I've been using LiveKit which is also WebRTC based and it is super annoying when speed slows down or speeds up at times when connection is not robust. We were using OpenAI's websocket based RealTime audio which was way too slow. So I don't know which one is better. Generally our users like the LiveKit implementation better so maybe WebRTC with enough clever hacks is the answer.
This blog was super insightful for me to understand what are the root problems in the current implementation though.
there are a lot of extremely smart people that have come back to webRTC time and time again because it continues to solve problems other methods and protocols can't. with saying that, quic is certainly interesting going forward, but i primarily stream voice + vision at 1fps so it just makes sense, and websockets fail and are insecure at scale for this use case (see https://www.daily.co/videosaurus/websockets-and-webrtc/) . also just listen to sean in this thread, dude knows whats up.
Amazing read. Blog posts rarely keep my attention like this one.
Refreshing to read something not in the voice of an llm.
Why does the voice need to be sent to the server? Why not perform speech-to-text on-device? Is the p10 phone/laptop not capable of this yet, despite every "dictation" feature I see in every modern OS?
An eventual goal is likely to allow interacting with the LLM directly via audio tokens in input/output skipping tts and stt completely.
Excellent writeup. I wish we had awards for blog posts when the person is a domain expert in the post's subject.
Browser API reliability in general has a lot of undocumented edge cases — WebRTC isn't alone there.
I haven't really experienced disconnections while using ChatGPT. Gemini is the frustrating part. Simply backgrounding the app (and the web version too) and resuming it causes the response or the conversation with an assigned ID to disappear. Haha.
I believe Gemini is Websockets? I have the same experience with heavy/custom applications that try to roll their own media stuff.
You run into issues around AudioContext and resumption etc... it's a PITA to have to handle all those corner cases :(
I didn't understand - why is WebRTC good for Google Meet and not good for all other conferencing apps?
Most of the problems happen because we want to simulate human conversations. While thats a good goal to have, another approach is to let the user know clearly they are talking to a bot. You will be surprised at how accomodating users can be when they know they are talking to a bot and want their queries resolved.
Nice fun article. Gives me Why The Lucky Stiff vibes.
My biggest frustration with WebRTC was precisely captured in the article: even if you don't need p2p and your video source is the process on the same host with your browser, you have to dance around connection setup like you're on a different side of a planet
Why worry for OpenAI. Their product will fail if it doesn’t work. Then they will figure it all out later.
>> ... I say hi to <strike> Scarlett Johansson <strike>
Had a nice chuckle.
Exactly what I thought when I read the original article, though to be fair WebTransport is barely now entering the mainstream with Safari shipping support this year.
I remember using webrtc data channel for p2p video. Browser to browser UDP is neat :) fun memories. Thank you for the read
this misses a few key things but hits on many others
webrtc is a bad protocol, without a doubt. I do like websockets as an easy alternative, but you do need to reinvent decent portions of webrtc as a result
I like the idea of MoQ but it's not widely used. probably worth experimenting with, especially as video enters the chat
> and then a GPU pretends to talk to you via text-to-speech
OpenAI is speech-to-speech, there is no TTS in voice mode
> It takes a minimum of 8* round trips (RTT) to establish a WebRTC connection
signalling can be done long ahead of time, though I don't see this mentioned in the OpenAI blog. I also saw some new webrtc extensions that should reduce setup time further
ultimately though, it comes down to
> It’s not like LLMs are particularly responsive anyway
I expect to see a shift in how S2S models work to be lower latency like the new voice API models that OpenAI announced
to be fair, the new models were released the day after this MoQ blog was published
> OpenAI is speech-to-speech, there is no TTS in voice mode
Which results in the interesting situation where the transcript isn't what was said:
Q: Why do the voice transcripts sometimes not match the conversation I had?
A: Voice conversations are inherently multimodal, allowing for direct audio exchange between you and the model. As a result, when this audio is transcribed, the transcription might not always align perfectly with the original conversation.
"WebRTC is the problem" is bait; his real claim is "WebRTC has annoying transport-layer characteristics that hurt cloud Voice AI scaling"...
Having just had to tackle this again for my own startup, I'm reminded about what you would lose by ditching WebRTC - the audio DSP pipeline, transmit side VAD, echo cancellation, noise suppression, NAT traversal maturity, codec integration, browser ubiquity etc.
You don't need NAT traversal when talking to a cloud service.
We have browser-based HW used inside a construction site manager’s site office behind a random FW
Just use UDP
interesting read albeit over my head, but i spent half of yesterday comparing Gemini Live (websockets) vs gpt-realtime-2 and while gpt is super good, seemingly more robust. Gemini connects faster.
Just give me mpegts in <video> element, I'm dying.
Probably because WebTransport is the lesser known alternative to WebRTC.
WebTransport requires some speicific server setup.
cldouflare doesn't support WebTransport well.
I've long had the feeling that WebRTC was intentionally over-engineered. Over-engineered and poorly documented.
IMO, tech standards should be simple and minimal and people should be able to implement whatever they want on top. I tend to stay away from complex web standards.
This is interesting. Does niche knowledge in this area command $1mn salary?
It can, in general knowing how to shuffle packets according to RFCs is a pretty decent gig. Pretty much every hyperscaler ends up building various LBs and the learning curve is too steep to just toss randos at it unsupervised, but at the same time it's not necessarily inventing anything new most of the time.
> “Here’s a million dollars to implement WebRTC for the fourth time”
“Hell no”
> “Umm…”
How is OpenAI Voice mode any different than a Whatsapp call? Ignoring the part that there is a GPU on the other side instead of a human. But what is the technical challenge in the voice call portion? It seems like that has been a solved problem for a long time now.
Yet another victim of IPv4, and you still find countless detractors of IPv6 on every thread where it's mentioned.
IPv4 support is necessary, but IPv6 isn't
How would ipv6 handle it
You just send packets to the other party's address and they send packets back to yours. Both parties know their address and you don't need a relay in the middle.
It's not really relevant in this case since one endpoint is a massive server farm.
It is because most of their complexity is in routing packets. With IPv6 you can just have the thing handling the conversation directly addressable by the client. The last 64 bits of a v6 let you have billions of instances in a region.