Streaming AI agent desktops with gaming protocols

(blog.helix.ml)

83 points | by quesobob 12 days ago ago

51 comments

vladgur 2 days ago ago

I’m curious how far are we from giving coding agents access to these desktop agents so that when we are using say Claude Code to build a native desktop app, the coding agents can actually see and act on the desktop UI that it is building

[-]
- ErikBjare 2 hours ago ago
  
  I did that a year ago, I imagine it would work better today.
- talking_penguin 21 hours ago ago
  
  The streaming architecture is designed for exactly this - we originally built it for autonomous agents that need persistent development environments. The missing pieces are mostly integration work (mapping Claude's tool use format to our desktop APIs). Would be very interested to hear if others are working on similar integrations - the combination of LLM coding agents + real desktop environments feels like it unlocks a lot of interesting workflows.
- drphilwinder 2 days ago ago
  
  This is a great point. Not that far. We also snapshot the desktop for "slow" non-streaming updates to the UI. We could push these into Claude itself to act on or describe or whatever.
- jsight 2 days ago ago
  
  For web apps, I'd guess that many of us already do that via Playwright or other MCPs. I'd bet there are people doing something similar with desktop apps too.
  
  [-]
  - vladgur a day ago ago
    
    unfortunately there is nothing like that for the desktop apps.
    Even mobile app devs have https://github.com/mobile-next
    
    [-]
    - jsight a day ago ago
      
      Why not https://github.com/hrrrsn/mcp-vnc ?
- lewq 2 days ago ago
  
  That's the next move :-D
theelix 2 days ago ago

Moonlight-Web? I guess it's https://github.com/MrCreativ3001/moonlight-web-stream but there's no information in the article
> Moonlight expects: Each client connects to start their own private game session
Nope, it's a Wolf design choice, eg. Sunshine allows users to concorrenly connect to the same instance/game

[-]
- lewq 2 days ago ago
  
  Wolf now supports multiple clients connecting to the same session via the wolf-ui branch that landed recently. After lots of stability work we are now running that mode in production (and in the latest release) https://github.com/helixml/helix/releases/tag/2.5.3
lewq 2 days ago ago

Author of helix code here. Here's a demo of the full system working. https://youtu.be/vVmnpcnLDGM?si=b6LxW6lmM7843LY0
We're opening the private beta where we provide a hosted environment for testing, or you can install the latest Helix release and run the installer with --code to try it on your own GPUs

[-]
- lewq 2 days ago ago
  
  https://github.com/helixml/helix/releases/tag/2.5.3
- lewq 2 days ago ago
  
  Join our discord for the beta https://discord.com/invite/VJftd844GE
_pdp_ 2 days ago ago

IMHO, the goals is not to have to watch what agents do and let them do the work.
I would personally invest into making agents more autonomous (yes hard problem today) then building a desktop video session protocol to watch them do the work.

[-]
- lewq 2 days ago ago
  
  Yeah that's what we did :) https://youtu.be/vVmnpcnLDGM?si=b6LxW6lmM7843LY0
- majormajor 2 days ago ago
  
  >I would personally invest into making agents more autonomous (yes hard problem today) then building a desktop video session protocol to watch them do the work.
  Seems difficult to research better autonomy without extensive monitoring. You need specific data on before/after effects of changes, for instance.
  
  [-]
  - talking_penguin 21 hours ago ago
    
    You're right. Monitoring is critical for this. You need to know exactly what changed and why.
    This is actually where streaming + the actual architecture behing this becomes valuable beyond just "letting users watch." In Helix, every agent interaction is persisted with timing data (DurationMs), tool calls, error states, and even user feedback (thumbs up/down). Sessions track the full conversation history bidirectionally, so you can reconstruct exactly what the agent saw and did at each step.
asmor 2 days ago ago

> because we’re all going to become managers of coding agents whether we like it or not
I will join the woodworking people before that happens, thanks.

[-]
- DrewADesign 2 days ago ago
  
  A career change that left me as a recent graduate in a decimated marketplace missing the bottom ten rungs on the ladder and no interest in getting back into the software world has led me to advanced manufacturing as a metal worker. I code a little, move heavy steel pieces periodically which is a nice way to break up the standing/sitting but not nearly as much as a general laborer, solve lots of problems, keep my trigonometry muscles toned, am forced to take breaks, get paid for my overtime, there’s a union that the company ownership is totally willing to work with, and when I’m not at work, work isn’t with me. There’s something very satisfying about leaving work with exercised muscles, smelling slightly of cutting oil. The money sucks comparatively so early in my career, but the rate increases more for performance than seniority so its rising quickly, the benefits are good, the career trajectory is pointing upwards, and longevity-wise, it’s certainly a whole lot better than gig work.
  There’s a huge crisis in US manufacturing: we’re bleeding craft knowledge because off-shoring let companies hire existing experienced workers for decades, so they never had to train a new generation of tradespeople. Now all those folks are dying and retiring and they need people to pick up that deep knowledge quickly. Codifying and automating is going to kill jobs either way, but one factory employing a few people making things for other factories with local materials is better than everything perpetually shifting to the cheap labor market du jour. I’m feeling much more optimistic about the future of this than the future of tech careers.
  I think over the next few years, a very large percentage of folks in tech will find themselves on the other side of the fence, quickly realize that their existing expertise doesn’t qualify them for any other white collar jobs where vibe coding experience is a bullet point in the pluses section, that tech consulting is declining even faster than salaried jobs, and that they’re vastly less qualified than the competition for blue collar jobs. Gonna be a rough road for a lot of folks. I wouldn’t invest in SF real estate any time soon.
  
  [-]
  - seemaze 2 days ago ago
    
    There’s 1,000 established industries that don’t offer the rapid growth and pay outs of the modern tech ecosystem. I’m excited to see some of the current industrial backwaters soak up technical talent freed up by the SV AI brain drain.
    To think we’ve handsomely paid our best and brightest the last few decades in pursuit of.. advertising?
    
    [-]
    - majormajor 2 days ago ago
      
      > To think we’ve handsomely paid our best and brightest the last few decades in pursuit of.. advertising?
      I think "efficiency" is more accurate there. Even post-Google/ad-tech-boom the overall trends that started decades earlier continued to be: (1) faster turnaround time on communications, (2) faster delivery of result artifacts, (3) faster knowledge of changes in the market and faster response.
      Advertising is a particularly visible field with lots of money to throw at those things (active investment trading is another). But practically every other industry has chased those same things as well, all the way down to things like parking meters.
      Personally I'm not convinced that this is such a great thing anyway - does anyone enjoy their boss messaging you at 11PM on any day they want whenever they get the fancy? - but that's the larger reason so much brainpower has been invested into it.
    - DrewADesign 2 days ago ago
      
      Hopefully the exodus from the tech industry won't kill demand for too many job markets that are close comfortable cousins to the tech world.
    - tomnipotent 2 days ago ago
      
      > don’t offer the rapid growth
      How are these industries going to absorb new headcount without the revenue to support it?
  - R_D_Olivaw a day ago ago
    
    I'm curious, hoping you might shed some insight.
    I'm not in tech per se, but in education with a seasoning of tech.
    Would you have any insight into what the field for education or training might look like for your new field?
    I think (perhaps feel is better) that manufacturing will make some sort of comeback and that the gratification from it is actually quite beneficial for the human condition.
    So, I'm interested in trying to glean what I might pivot into from the perspective of training and education.
    
    [-]
    - DrewADesign 21 hours ago ago
      
      I don’t know exactly what educators roles are, but there are lots of training programs out there sponsored either by government job placement programs or industry associations. In Chicago, JARC is one example. Also, a lot of the larger companies have big training programs to onboard inexperienced workers into skilled trades. Most of the classes themselves seem to be taught by people in those trades rather than teachers, but surely that isn’t always the case, and the curriculum development probably involves professional educators. Some of the things are pretty techy— automating industrial robots for example — but in don’t have any direct knowledge of that.
- danielbln 2 days ago ago
  
  You will always be able to produce artisanal hand-set code, same as how artisanal woodworking exists alongside industrial manufacturing. There will be a lot less demand for it, and compensation will align accordingly, but it won't go away.
- dude250711 2 days ago ago
  
  Either the crap truly works and nobody is needed or it does not work. Where is this half-arsed human-agent hybrid vision coming from? The land of plateaued LLM gains?
  
  [-]
  - lewq 2 days ago ago
    
    Yes
LunicLynx 2 days ago ago

I find this very curious. I don’t think agents care about UI, humans do. So in the end the UI is not required. As soon as the ai can get into the physical world. The whole IT world is done for. All of this will be automated away. The IT and CS only ever started to make us more productive more connected to improve our physical well being. When we don’t need to touch computers anymore there is no need for …

[-]
- lewq 2 days ago ago
  
  Vision language models have been trained on how to operate human UIs though, so at least for a while, computer use will be an interesting area to explore. I think debugging web apps and building UIs is a particularly fruitful area for this
kaspermarstal 2 days ago ago

> The Wolf maintainer has done heroic work ...
I commend the fact they acknowledge the maintainer's work, but seeing the singular 'maintainer', I can't help but notice the weight on that one person's shoulders.

[-]
- lewq 2 days ago ago
  
  I should have said creator. He seems to have a healthy community backing him, but we should ask him!
jarym 2 days ago ago

Trying to do something similar but using kasm[0] as the backend.
[0] https://kasm.com

[-]
- lewq 2 days ago ago
  
  Fascinating, wanna compare notes on a call some time?
  
  [-]
  - jarym a day ago ago
    
    Sure thing!!
reactordev 2 days ago ago

Whilst impressive to “bend a protocol to your will”, why did you not just take Moonlight and build on top of it, making your own?
No shoehorns needed. Just take what you like and build what you need.

[-]
- lewq 2 days ago ago
  
  It's nice for unmodified moonlight clients to be able to connect - they have tons of them, you can even run it on a Nintendo DS
  
  [-]
  - reactordev 2 days ago ago
    
    But is the ability to run it on the DS a feature? I highly doubt it.
    I’m not trashing anything, I’m just saying that if they focused on what their market is, it would be clear no one is going to be coding/working on a Nintendo DS.
    
    [-]
    - lewq 2 days ago ago
      
      I suppose but we got it working, and the primary interface is webrtc in the browser, and going via moonlight internally is just an implementation detail that got us here quickly. We are open to refactoring in the future of course :)
momocowcow 2 days ago ago

What’s the most intricate system that’s been written with this?

[-]
- lewq 2 days ago ago
  
  Itself, more recently
luizfwolf 2 days ago ago

Hi, quite interesting project but have a hard time to understand why would stream a desktop.
From my (ignorant) understanding, the important part is the context of the LLM in the task. Some conversations you need visuals, some you don't. What's the advantage of giving a full desktop streaming instead of using integrations?

[-]
- lewq 2 days ago ago
  
  There's also value in being able to run multiple agents in parallel with their own isolated filesystems and runtimes. One agent won't tread on the toes of another whatever they do. You can let it loose and it doesn't matter if it breaks something, you can just spin up another one
- lewq 2 days ago ago
  
  Mainly so you can give the agent access to the desktop as well. Then it can debug your web app in Chrome Dev tools but also you can pair with it with streaming that is so good it feels local
eisbaw 2 days ago ago

xpra has video streaming and allows for sharing

[-]
- lewq 2 days ago ago
  
  Interesting, thanks!
vladgur 2 days ago ago

Another question regarding Helix - its being built as a platform for private air-gap-ready ai agents that can work against private LLM models.
Are there appliances or easy to deploy hardware that allow one to run these private models on-premise vs cloud

[-]
- lewq 2 days ago ago
  
  Hey! Yeah we are working with partners on fully integrated hardware+software stack for this. We particularly like the RTX 6000 Pro Blackwell chips for this
CuriouslyC 2 days ago ago

This is beautiful madness.

[-]
- talking_penguin 21 hours ago ago
  
  Ha, we'll take it! It's definitely ambitious - we're essentially building a full remote desktop streaming stack on top of an AI stack on private data.
  The 'madness' part is probably the part where we said 'let's do hardware-accelerated H.264 encoding and real-time agent control and make it all work with sub-100ms latency.' We think the problem space is worth the complexity.
mxkopy 2 days ago ago

I’ve also independently concluded Moonlight was the best way to go after trying my hand at a very similar task. I didn’t want to dig through moonlight’s source, but I’m sure if you’re dedicated enough it would pay dividends later on, it basically does everything you’d need for realtime control in the setting of simulating human input.