2 points | by arun99-99 11 hours ago ago
1 comments
If you're preparing for systems or performance-engineering roles, this repo shows how a simple matmul evolves into a high-performance kernel.
It demonstrates:
why loop order matters
how cache locality dominates performance
how tiling + registers change everything
how multithreading scales
You can run all benchmarks with one script and see ~100× speedup from naive → optimized.
Good practice for:
low-level optimization
ML systems
HPC
performance interviews
Repo: https://github.com/arun-reddy-a/matmul-cpu
If you're preparing for systems or performance-engineering roles, this repo shows how a simple matmul evolves into a high-performance kernel.
It demonstrates:
why loop order matters
how cache locality dominates performance
how tiling + registers change everything
how multithreading scales
You can run all benchmarks with one script and see ~100× speedup from naive → optimized.
Good practice for:
low-level optimization
ML systems
HPC
performance interviews
Repo: https://github.com/arun-reddy-a/matmul-cpu