Every large language model inference request recomputes the same tokens. The key-value cache, the intermediate state that lets a model avoid reprocessing the full prompt on every turn, is usually ephemeral. It lives in GPU memory, dies when the request ends, and is rebuilt from scratch for the next conversation.

LMCache treats that assumption as a bug.

The open-source project, stewarded by researchers at the University of Chicago and the startup Tensormesh, has grown from a research prototype into a production layer integrated with vLLM, SGLang, NVIDIA Dynamo, and PyTorch. Its core insight is simple: the KV cache should be persistent, tiered, and engine-independent. The cache should survive a crash. It should be shareable across serving instances. And it should be reusable for any prompt position, not just exact prefix matches.

That last point is what separates LMCache from older caching approaches. Most inference frameworks support prefix caching: if a new request starts with the same tokens as a previous one, the system skips recomputation for the matching prefix. LMCache extends that to non-prefix reuse using a technique called CacheBlend, which caches KV blocks at any position in the prompt and selectively recomputes tokens where quality would degrade. In practice, this means a RAG pipeline that retrieves different document chunks per query can still reuse cached representations from earlier retrievals, even when the exact prefix changes.

The project is vendor-neutral, a design choice that has accelerated adoption. LMCache runs on NVIDIA, AMD, Arm, and Ascend hardware. It offloads KV cache to CPU memory, local SSD, Redis, Valkey, S3-compatible object storage, and InfiniStore. It transfers cache between prefill and decode workers using NVLink, RDMA, or plain TCP. The daemon process runs independently from the inference engine, so a crash in vLLM or SGLang does not wipe the cache. That is a meaningful operational win: in production, engine restarts are common, and losing the entire cached context on restart wastes the prefill compute of every active session.

The numbers back the approach. In benchmarks on AMD MI300X for agentic workloads, LMCache reported reductions in time-to-first-token and throughput improvements. Cohere uses LMCache with CoreWeave for efficient inference. Redis published a case study showing faster responses and lower cost per request. The project crossed 5,000 GitHub stars in August 2025 and joined the PyTorch Foundation in October 2025.

What makes LMCache more than a caching library is the observability stack. The project exposes Kubernetes-standard health metrics alongside KV-cache-specific counters: request-level and token-level prefix cache hit rates, lifecycle tracking, and per-user usage. For teams running long-context agents or multi-turn chatbots, those metrics turn the cache from a black box into a tunable resource. You can see exactly where cache misses cost latency and adjust the tiering policy accordingly.

The research lineage matters here. The lead author, Yihua Cheng, and the team behind LMCache also published CacheGen (SIGCOMM 2024), which compressed and streamed KV cache for fast serving, and CacheBlend (EuroSys 2025), which enabled cached knowledge fusion for RAG. LMCache is the productionization of that research line. The paper, published on arXiv in 2025, frames the project as an enterprise-scale KV cache layer, but the architecture is equally relevant for a single-instance deployment running a 7B model on a consumer GPU.

The biggest open question is whether persistent KV cache will change how applications are built. If the cache lives across sessions and survives engine restarts, developers can treat the model’s context window as a durable store rather than a scratchpad. That has implications for agentic systems that accumulate state over hours or days. It also raises privacy and security questions: if cached KV blocks contain user data, the cache itself becomes a data store that needs access controls, encryption at rest, and lifecycle management.

LMCache does not solve those problems yet. The project provides the plumbing. The policy layer is still the application’s responsibility.

For AI builders, the takeaway is practical. LMCache removes a structural inefficiency in LLM inference that most teams have accepted as inevitable. The KV cache is not a temporary byproduct. It is reusable infrastructure, and treating it as such cuts latency and cost without changing the model.