Faster sorting with SIMD CUDA intrinsics (2024)

(winwang.blog)

91 points | by winwang 3 days ago ago

11 comments

  • ashvardanian 3 days ago ago

    The article covers extremely important CUDA warp-level synchronization/exchange primitives, but it's not what is generally called SIMD in the CUDA land .

    Most "CUDA SIMD" intrinsics are designed to process a 32-bit data pack containing 2x 16-bit or 4x 8-bit values (<https://docs.nvidia.com/cuda/cuda-math-api/cuda_math_api/gro...>). That significantly shrinks their applicability in most domains outside of video and string processing. I've had pretty high hopes for DPX on Hopper (<https://developer.nvidia.com/blog/boosting-dynamic-programmi...>) instructions and started integrating them in StringZilla last year, but the gains aren't huge.

    • winwang 3 days ago ago

      Oh wow, TIL, thanks. I usually call stuff like that SWAR, and every now-and-then I try to think of a way to (fruitfully) use it. The "SIMD" in this case was just an allusion to warp-wide functions looking like how one might use SIMD in CPU code, as opposed to typical SIMT CUDA.

      Also, StringZilla looks amazing -- I just became your 1000th Github follower :)

      • ashvardanian 3 days ago ago

        Thanks, appreciate the gesture :)

        Traditional SWAR on GPUs is a fascinating topic. I've begun assembling a set of synthetic benchmarks to compare DP4A vs. DPX (<https://github.com/ashvardanian/less_slow.cpp/pull/35>), but it feels incomplete without SWAR. My working hypothesis is that 64-bit SWAR on properly aligned data could be very useful in GPGPU, though FMA/MIN/MAX operations in that PR might not be the clearest showcase of its strengths. Do you have a better example or use case in mind?

        • bobmcnamara 2 days ago ago
        • winwang 3 days ago ago

          I don't -- unfortunately not too well-versed in this field! But I was a bit fascinated with SWAR after I randomly thought of how to prefix-sum with int multiplication, later finding out that it is indeed an old trick as I suspected (I'm definitely not on this thread btw): https://mastodon.social/@dougall/109913251096277108

          As for 64-bit... well, I mostly avoid using high-end GPUs, but I was of the impression that i64 is just simulated. In fact, I was thinking of using the full warp as a "pipeline" to implement u32 division (mostly as a joke), almost like anti-SWAR. There was some old-ish paper detailing arithmetic latencies in GPUs and division was approximately more than 32x multiplication (...or I could be misremembering).

  • DennisL123 3 days ago ago

    Interesting stuff. Not sure if I read this right that it‘s 16 und 32 bit values of integers that get sorted. If yes, I‘d love to see if the GPU implementation can beat a competitive Radix sort implementation on a CPU.

    • winwang 3 days ago ago

      It's 32 32-bit values which get sorted. I don't think a GPU sort would beat a CPU sort at this scale, even if you don't take kernel launch time into account. CPUs are simply too fast for (super-)small data, especially with AVX-512. But if we're talking about a larger amount of data, that would be a different story, i.e. as part of a normal gpu mergesort.

      • maeln 3 days ago ago

        It is also useful if your data already lives on the GPU memory. For example, when you need to z-sort a bunch of particles in a 3d renderer particle system.

      • exDM69 3 days ago ago

        A 32 way GPU sorting algorithm might be just what I need for sorting and deduplicating triangle id's in a visibility buffer renderer I am working on.

        Thanks for sharing.

        • winwang 2 days ago ago

          As someone who doesn't know very much about graphics (ironically), you're welcome and hope it helps!

  • fourseventy 2 days ago ago

    What are the biggest use cases of GPU accelerated sorting?