Go is pretty good at performance, but pretty bad at expressing domain-specific logics. Python is the opposite, but once you have isolated the parts that need to be optimized, it's quite easy to rewrite them in a native language (in particular, the Rust-Python bindings are really good, although in this project, it's C++).
Python is a very convenient skeleton for gluing together high performance modules that were written in C or cuda. Writing boilerplate code in those to adapt them to your project is much more inconvenient.
my initial choice was to use Rust for this actually (Probably should've too :P) but i went with python for an initial mvp/skeleton for a future rewrite
Thanks! this is a weekend project that i am working on in the side just to learn more about ml engineering and custom cuda kernels. didnt think much about the website
KVBoost is a chunk-level KV cache reuse library for HuggingFace models (pip install kvboost). It supports two recompute strategies (selective boundary and CacheBlend), int8/int4 KV quantization for 2–4x RAM reduction, disk-backed cold storage, and 11 architectures including Llama, Qwen, Gemma, Mistral, and Phi. On Qwen2.5-3B we measured 47.9x TTFT speedup on an 8-turn conversation, 21x on code context reuse, 100–743x faster than MLX, and 3–41x faster than vLLM-MLX — including interior chunk reuse where vLLM gets zero hits. Outputs are token-for-token identical to baseline under greedy decoding. Works best on 3B+ models with 500+ token shared context. GitHub: https://github.com/pythongiant/KVBoost
KVBoost is a drop-in replacement for AutoModelForCausalLM. Same API surface (KVBoost.from_pretrained(...), engine.generate(...)), but with cross-request KV reuse, FlashAttention-2, AWQ layer streaming, and speculative decoding bolted on.
Here's the repository incase anyone wants to have a look at the code. leave a star if you find it interesting :P https://github.com/pythongiant/KVBoost
Bad site design, if I can't scroll to see the next slide, that's just broken.
Makes sense, fixing that. thanks!
I just dont get why people choose Python and not e.g. Go for high performance problems.
Go is pretty good at performance, but pretty bad at expressing domain-specific logics. Python is the opposite, but once you have isolated the parts that need to be optimized, it's quite easy to rewrite them in a native language (in particular, the Rust-Python bindings are really good, although in this project, it's C++).
Python is a very convenient skeleton for gluing together high performance modules that were written in C or cuda. Writing boilerplate code in those to adapt them to your project is much more inconvenient.
Go is not high performance enough. Like what others said, you implement the high performance part in C++ and use python to glue them.
my initial choice was to use Rust for this actually (Probably should've too :P) but i went with python for an initial mvp/skeleton for a future rewrite
[flagged]
The functionality is impressive, but the website needs some work
Thanks! this is a weekend project that i am working on in the side just to learn more about ml engineering and custom cuda kernels. didnt think much about the website
KVBoost is a chunk-level KV cache reuse library for HuggingFace models (pip install kvboost). It supports two recompute strategies (selective boundary and CacheBlend), int8/int4 KV quantization for 2–4x RAM reduction, disk-backed cold storage, and 11 architectures including Llama, Qwen, Gemma, Mistral, and Phi. On Qwen2.5-3B we measured 47.9x TTFT speedup on an 8-turn conversation, 21x on code context reuse, 100–743x faster than MLX, and 3–41x faster than vLLM-MLX — including interior chunk reuse where vLLM gets zero hits. Outputs are token-for-token identical to baseline under greedy decoding. Works best on 3B+ models with 500+ token shared context. GitHub: https://github.com/pythongiant/KVBoost
Even the things that should be normal dashes are em-dashes
En-dashes are not em-dashes, and they're standard typography for numeric ranges.
https://en.wikipedia.org/wiki/Dash#Ranges_of_values
I don't get it. The output of the CacheBlend paper is in LMCache. Did you compare against vLLM with LMCache? This is confusing.
[flagged]
slop
Is this based on paged attention with hashing of the pages?
[flagged]
Drop in replacement for what exactly? Can I use it with llama.cpp and Vulkan? Or vLLM and ROCm?
KVBoost is a drop-in replacement for AutoModelForCausalLM. Same API surface (KVBoost.from_pretrained(...), engine.generate(...)), but with cross-request KV reuse, FlashAttention-2, AWQ layer streaming, and speculative decoding bolted on.