Show HN: FlashAttention-2 in Cute, from Scratch

(blog.echen.io)

3 points | by echen314 11 hours ago ago

1 comments

echen314 11 hours ago ago

Author here. Some context on why I wrote this and how it's different from other public FA-2 content.
Most FA-2 walkthroughs cover either the algorithm or a Triton implementation. Both abstract away the parts that take the most time to actually understand if you're trying to read the original CUDA source: the swizzling, the LDSM/LDSM_T atoms, the V-transpose, the SMEM layouts, the fragment/register layout, partition_fragment behavior, the async pipelining. This post walks the production CUTLASS 3.x (CuTe) kernel line-by-line on Ampere. I add diagrams that help visualize the weird layout stuff going in the background to make this approachable to even the most elementary CUDA dev.
Repro details: kernel hits 88–105% of production FA-2's throughput on A100 across hdim=64/128 up to 64K seq, peaking at ~63% of fp16 tensor-core peak (the alg is just stripped version of original, no novelty on this aspect). Tested with bitwise-identical output vs the production reference.
Happy to dig into specifics in the comments — particularly interested if anyone has counterexamples to the sVtNoSwizzle no-op claim, or has done the equivalent investigation on Hopper.