Async/Await on the GPU

(vectorware.com)

107 points | by Philpax 5 hours ago ago

34 comments

Arifcodes 22 minutes ago ago

The interesting challenge with async/await on GPU is that it inverts the usual concurrency mental model. CPU async is about waiting efficiently while I/O completes. GPU async is about managing work distribution across warps that are physically executing in parallel. The futures abstraction maps onto that, but the semantics are different enough that you have to be careful not to carry over intuitions from tokio/async-std.
The comparison to NVIDIA's stdexec is worth looking at. stdexec uses a sender/receiver model which is more explicit about the execution context. Rust's Future trait abstracts over that, which is ergonomic but means you're relying on the executor to do the right thing with GPU-specific scheduling constraints.
Practically, the biggest win here is probably for the cases shayonj mentioned: mixed compute/memory pipelines where you want one warp loading while another computes. That's exactly where the warp specialization boilerplate becomes painful. If async/await can express that cleanly without runtime overhead, that is a real improvement.
zozbot234 4 hours ago ago

I'm not quite seeing the real benefit of this. Is the idea that warps will now be able to do work-stealing and continuation-stealing when running heterogenous parallel workloads? But that requires keeping the async function's state in GPU-wide shared memory, which is generally a scarce resource.

[-]
- nxobject an hour ago ago
  
  God, as someone who took their elective on graphics program when GPGPU and computer shaders first became a thing, reading this makes me realize I definitely need an update on what modern GPU uarchs are like now.
  Re: heterogenous workload: I'm told by a friend in HPC that the old advice about avoiding diverging branches within warps is no longer much of an issue – is that true?
  
  [-]
  - zozbot234 an hour ago ago
    
    That advice applies within warps, to single 'threads' (effectively SIMD lanes) whereas the article is consistently about running heterogenous tasks on different warps.
- LegNeato 3 hours ago ago
  
  Yes, that's the idea.
  GPU-wide memory is not quite as scarce on datacenter cards or systems with unified memory. One could also have local executors with local futures that are `!Send` and place in a faster address space.
- pjmlp 2 hours ago ago
  
  This is already happening in C++, NVidia is the one pushing the senders/receivers proposal, which is one of the possible co-routine runtimes to be added into C++ standard library.
- jmalicki 2 hours ago ago
  
  A ton of GPU workloads require leaving large amounts of RAM resident on the GPU and running computation with some new data from the CPU.
xiphias2 2 hours ago ago

Really cool experiment (the whole company).
Training pipelines are full of data preparation that are first written on CPU then moving to GPU and always thinking of what to keep on CPU and what to put on GPU, when is it worth to create a tensor, or should it be tiling instead. I guess your company is betting on solving problems like this (and async-await is needed for serving inference requests directly on the GPU for example).
My question is a little bit different: how do you want to handle the SIMD question: should a rust function be running on the warp as a machine with 32 long arrays as data types, or always ,,hope'' for autovectorization to work (especially with Rust's iter library helpers).

[-]
- Cieric 2 hours ago ago
  
  I'm not even sure a 32 wide array would be good either since on AMD warps are 64 wide. I wouldn't go fully towards auto vectorization with though.
  
  [-]
  - zozbot234 an hour ago ago
    
    Warp SIMD-width should be a build-time constant. You'd be using a variable-length vector-like interface that gets compiled down to a specified length as part of building the code.
    
    [-]
    - Cieric an hour ago ago
      
      Now that I could agree with, the only place where hiccups have started to occur are with wave intrinsics where you can share data between thread in a wave without halting execution. I'm not sure disallowing it would be the best idea as it cuts out possible optimizations, but outright allowing it without the user knowing the number of lanes can cause it's own problems. My job is the fun time of fixing issues in other peoples code related to all of this. I have no stakes in rust though, I'd rather write a custom spirv compiler.
      
      [-]
      - zozbot234 an hour ago ago
        
        A compile time constant can still be surfaced to the user though. The code would simply be written to take the actual value into account and this would be reflected during the build.
        
        [-]
        
        Cieric an hour ago ago
        
        I don't have a lot of faith there, but that's mainly due to my experience being correcting peoples assumption that all gpus waves are 32 lanes. I might be biased there specifically since it's my job to fix those issues though.
ismailmaj 40 minutes ago ago

Warp specialization is an abomination that should be killed and I'm glad this could be an alternative.
I hope they can minimize the bookkeeping costs because I don't see it gain traction in AI if it hurts big kernels performance.
GZGavinZhao 2 hours ago ago

One concern I have is that this async/await approach is not "AOT"-enough like the Triton approach, in the sense that you know how to most efficiently schedule the computations on which warps since you know exactly what operations you'll be performing at compile time.
Here with the async/await approach, it seems like there needs to be manual book-keeping at runtime to know what has finished, what has not, and _then_ consider which warp should we put this new computation in. Do you anticipate that there will be measurable performance difference?

[-]
- zozbot234 an hour ago ago
  
  If you can tell deterministically whether an 'async' computation is going to be finished, you can most likely use a type-system-like static analysis to ensure that programs are scheduled correctly and avoid any reference to values that are yet to be computed. But this is not possible in many cases, where dynamic scheduling will be preferable.
- LegNeato an hour ago ago
  
  Doing things at compile time / AOT is almost always better for perf. We believe async/await and futures enables more complex programs and doing things you couldn't easily do on the GPU before. Less about performance and more about capability (though we believe async/await perf will be better in some cases, time will tell).
the__alchemist 2 hours ago ago

Et tu, GPU?
I am, bluntly, sick of Async taking over rust ecosystems. Embedded and web/HTTP have already fallen. I'm optimistic this won't take hold in GPU; well see. Async splits the ecosystem. I see it as the biggest threat to Rust staying a useful tool.
I use rust on the GPU for the following: 3d graphics via WGPU, cuFFT via FFI, custom kernels via Cudarc, and ML via Burn and Candle. Thankfully these are all Async-free.

[-]
- notnullorvoid 3 minutes ago ago
  
  I don't see the utility of async on the GPU.
  > Async splits the ecosystem. I see it as the biggest threat to Rust staying a useful tool.
  Someone somewhere convinced you there is a async coloring problem. That person was wrong, async is an inherent property of some operations. Adding it as a type level construct gives visibility to those inherent behaviors, and with that more freedom in how you compose them.
shayonj 4 hours ago ago

Very cool to see this and something I have been curious about myself and exploring the space as well. I'd be curious what are some parallels and differentiations between this and NVIDIA's stdexec (outside of it being in Rust and using Future, which is also cooL)
textlapse 3 hours ago ago

What's the performance like? What would the benefits be of converting a streaming multiprocessor programming model to this?

[-]
- LegNeato 3 hours ago ago
  
  We aren't focused on performance yet (it is often workload and executor dependent, and as the post says we currently do some inefficient polling) but Rust futures compile down to state machines so they are a zero-cost abstraction.
  The anticipated benefits are similar to the benefits of async/await on CPU: better ergonomics for the developer writing concurrent code, better utilization of shared/limited resources, fewer concurrency bugs.
  
  [-]
  - textlapse 2 hours ago ago
    
    warp is expensive - essentially it's running a 'don't run code' to maintain SIMT.
    GPUs are still not practically-Turing-complete in the sense that there are strict restrictions on loops/goto/IO/waiting (there are a bunch of band-aids to make it pretend it's not a functional programming model).
    So I am not sure retrofitting a Ferrari to cosplay an Amazon delivery van is useful other than for tech showcase?
    Good tech showcase though :)
    
    [-]
    - zozbot234 2 hours ago ago
      
      I think you're conflating GPU 'threads' and 'warps'. GPU 'threads' are SIMD lanes that are all running with the exact same instructions and control flow (only with different filtering/predication), whereas GPU warps are hardware-level threads that run on a single compute unit. There's no issue with adding extra "don't run code" when using warps, unlike GPU threads.
      
      [-]
      - textlapse an hour ago ago
        
        My understanding of warp (https://docs.nvidia.com/cuda/cuda-programming-guide/01-intro...) is that you are essentially paying the cost of taking both the branches.
        I understand with newer GPUs, you have clever partitioning / pipelining in such a way block A takes branch A vs block B that takes branch B with sync/barrier essentially relying on some smart 'oracle' to schedule these in a way that still fits in the SIMT model.
        It still doesn't feel Turing complete to me. Is there an nvidia doc you can refer me to?
        
        [-]
        
        rowanG077 35 minutes ago ago
        
        That applies inside a single warp, notice the wording:
        > In SIMT, all threads in the warp are executing the same kernel code, but each thread may follow different branches through the code. That is, though all threads of the program execute the same code, threads do not need to follow the same execution path.
        This doesn't say anything about dependencies of multiple warps.
        
        [-]
        
        textlapse 18 minutes ago ago
        
        It's definitely possible, I am not arguing against that.
        I am just saying it's not as flexible/cost-free as you would on a 'normal' von Neumann-style CPU.
        I would love to see Rust-based code that obviates the need to write CUDA kernels (including compiling to different architectures). It feels icky to use/introduce things like async/await in the context of a GPU programming model which is very different from a traditional Rust programming model.
        You still have to worry about different architectures and the streaming nature at the end of the day.
        I am very interested in this topic, so I am curious to learn how the latest GPUs help manage this divergence problem.
Arch485 3 hours ago ago

Very cool!
Is the goal with this project (generally, not specifically async) to have an equivalent to e.g. CUDA, but in Rust? Or is there another intended use-case that I'm missing?

[-]
- zozbot234 an hour ago ago
  
  The closest equivalent to that is rust-gpu, which this project is pretty closely involved with.
bionhoward an hour ago ago

genius, great idea and follow through, please keep it up, this could improve the ML industry tremendously, maybe some einops inspired interface for this would be good?
firefly2000 3 hours ago ago

Is this Nvidia-only or does it work on other architectures?

[-]
- LegNeato 3 hours ago ago
  
  Currently NVIDIA-only, we're cooking up some Vulkan stuff in rust-gpu though.
  
  [-]
  - monster_truck 3 hours ago ago
    
    I don't have anything to offer but my encouragement, but there are _dozens_ of ROCm enjoyers out there.
    In years prior I wouldn't have even bothered, but it's 2026 and AMD's drivers actually come with a recent version of torch that 'just works' on windows. Anything is possible :)
  - firefly2000 2 hours ago ago
    
    Does the lack of forward progress guarantees (ITS) on other architectures pose challenges for async/await?