Post

[Note] FreeKV: Boosting KV Cache Retrieval For Efficient LLM Inference

Summary

FreeKV, a training-free algorithm-system co-optimization framework that boosts the efficiency of KV retrieval, while maintaining near-lossless model accuracy across diverse scenarios.

Why KV retrieval

For complex tasks involving long generation, the accuracy of KV dropping will drop a lot. See as the Figure bellow.

image-20260316201517926

KV retrieval challenge

This method require the complete KV cache retained, so retrieval methods often offload the KV cache to CPU memory to circumvent GPU memory limitations.

  • No Offload (Quest) → out-of-memory errors are inevitable for long contexts;
  • Offload (ArkVale, ShadowKV, InfiniGen)→ high latency
    • Low bandwidth of CPU-GPU connection → $recalling$ the selected KV tuples from CPU to GPU incurs long latency;
    • Select KV tuples from entire context → considerable $selection$ overhead.

image-20260316203135733

image-20260316204341708

FreeKV Algorithm design

Speculative Retrieval → The similarity between the query vectors of adjacent generated tokens is quite high (mean 0.84).

Fine-Grained Correction → While the mean similarity remains high, certain decoding steps exhibit outliers with significantly lower similarity.

Hybrid Layouts and Streamed Recall → …

Speculative Retrieval

The attention computation of step $i$ is launched by reusing the KV tuples recalled during step $i-1$.

image-20260316205411214

FreeKV adopts page-wise selection, utilizing the min-max pooled keys within each page as the page summary, similar to Quest.

Fine-Grained Correction

  • Correction is triggered only if $C_i\lt \tau$, where $C_i$ is the cosine similarity of query vectors and $\tau$ is predefined threshold.
  • Once the KV heads requiring correction, FreeKV initiates selection and recall for these KV heads before the attention computation. For KV heads that do not need correction, recall is deferred and overlapped with other operation.

image-20260316211115015

Hybrid Layouts and Streamed Recall

This post is licensed under CC BY 4.0 by the author.