Show HN: KVBoost – chunk-level KV cache reuse for HuggingFace, 5–48x faster TTFT

(pythongiant.github.io)

18 points | by pythongiant 9 hours ago ago

22 comments

pythongiant 4 hours ago ago

Here's the repository incase anyone wants to have a look at the code. leave a star if you find it interesting :P https://github.com/pythongiant/KVBoost
hexnuts 8 hours ago ago

Bad site design, if I can't scroll to see the next slide, that's just broken.

[-]
- pythongiant 6 hours ago ago
  
  Makes sense, fixing that. thanks!
stpedgwdgfhgdd 7 hours ago ago

I just dont get why people choose Python and not e.g. Go for high performance problems.

[-]
- Yoric 6 hours ago ago
  
  Go is pretty good at performance, but pretty bad at expressing domain-specific logics. Python is the opposite, but once you have isolated the parts that need to be optimized, it's quite easy to rewrite them in a native language (in particular, the Rust-Python bindings are really good, although in this project, it's C++).
- sigmoid10 7 hours ago ago
  
  Python is a very convenient skeleton for gluing together high performance modules that were written in C or cuda. Writing boilerplate code in those to adapt them to your project is much more inconvenient.
- larme 6 hours ago ago
  
  Go is not high performance enough. Like what others said, you implement the high performance part in C++ and use python to glue them.
- pythongiant 6 hours ago ago
  
  my initial choice was to use Rust for this actually (Probably should've too :P) but i went with python for an initial mvp/skeleton for a future rewrite
  
  [-]
  - pythongiant 5 hours ago ago
    
    [flagged]
x0ruman 7 hours ago ago

The functionality is impressive, but the website needs some work

[-]
- pythongiant 5 hours ago ago
  
  Thanks! this is a weekend project that i am working on in the side just to learn more about ml engineering and custom cuda kernels. didnt think much about the website
pythongiant 9 hours ago ago

KVBoost is a chunk-level KV cache reuse library for HuggingFace models (pip install kvboost). It supports two recompute strategies (selective boundary and CacheBlend), int8/int4 KV quantization for 2–4x RAM reduction, disk-backed cold storage, and 11 architectures including Llama, Qwen, Gemma, Mistral, and Phi. On Qwen2.5-3B we measured 47.9x TTFT speedup on an 8-turn conversation, 21x on code context reuse, 100–743x faster than MLX, and 3–41x faster than vLLM-MLX — including interior chunk reuse where vLLM gets zero hits. Outputs are token-for-token identical to baseline under greedy decoding. Works best on 3B+ models with 500+ token shared context. GitHub: https://github.com/pythongiant/KVBoost

[-]
- snovv_crash 8 hours ago ago
  
  Even the things that should be normal dashes are em-dashes
  
  [-]
  - mrob 7 hours ago ago
    
    En-dashes are not em-dashes, and they're standard typography for numeric ranges.
    https://en.wikipedia.org/wiki/Dash#Ranges_of_values
- arjie 8 hours ago ago
  
  I don't get it. The output of the CacheBlend paper is in LMCache. Did you compare against vLLM with LMCache? This is confusing.
  
  [-]
  - pythongiant 6 hours ago ago
    
    [flagged]
- pferdone 8 hours ago ago
  
  slop
sakex 7 hours ago ago

Is this based on paged attention with hashing of the pages?

[-]
- pythongiant 5 hours ago ago
  
  [flagged]
npodbielski 3 hours ago ago

Drop in replacement for what exactly? Can I use it with llama.cpp and Vulkan? Or vLLM and ROCm?

[-]
- pythongiant 3 hours ago ago
  
  KVBoost is a drop-in replacement for AutoModelForCausalLM. Same API surface (KVBoost.from_pretrained(...), engine.generate(...)), but with cross-request KV reuse, FlashAttention-2, AWQ layer streaming, and speculative decoding bolted on.
undefined 9 hours ago ago

[deleted]