llama.cpp released b9568 on June 8, adding multi-turn processing (MTP) support for Gemma 4’s E2B and E4B assistant variants. The update is small in lines changed but large in what it signals: the open-source inference engine is now tracking Google’s most architecturally interesting model family, and doing so on hardware that fits in a pocket.
The release notes are terse. Four commits. A converter update to handle smaller assistant models. New masked_embd tensors for the gemma4-assist architecture. A debug-removal cleanup. A filter for masked_embedding tensors during conversion. Nothing flashy. But the underlying work is about making Gemma 4’s assistant models runnable, and that matters more than the commit count suggests.
Gemma 4, released by Google in April 2026, is not a single model. It is a family that includes base models, instruction-tuned variants, and what Google calls “assistants” — smaller, specialized models designed to be called by a larger “anchor” model during multi-turn interactions. The E2B and E4B variants are the two- and four-billion-parameter assistants. They handle sub-tasks like rewriting, fact-checking, or formatting while the anchor model maintains the overall conversation context.
This is a departure from the typical single-model inference pattern that llama.cpp has optimized for since its inception. Most local inference today runs one model end-to-end. The model sees a prompt, generates a response, and the loop resets. Gemma 4’s assistant architecture breaks that pattern. The anchor model can call an assistant mid-generation, wait for its output, and incorporate that output into the ongoing response. That is a fundamentally different execution model.
llama.cpp’s b9568 does not yet support the full anchor-assistant pipeline. What it adds is the MTP infrastructure for the assistant models themselves. The converter now knows how to handle the masked_embd tensors that distinguish assistant models from base models. The gemma4-assist architecture is recognized at load time. The debug scaffolding from the initial implementation is gone. These are the plumbing changes that make the assistant models load and run correctly.
The broader implication is that llama.cpp is preparing for agentic workflows on local hardware. Multi-turn processing is the technical foundation for models that can iterate, call tools, and delegate sub-tasks. Gemma 4’s assistant architecture is one version of that pattern. Others will follow. OpenAI’s GPT-4o has tool-use baked in. Anthropic’s Claude has a tool-use API. Google’s own Gemini models support function calling. All of these rely on some form of multi-turn processing where the model can pause, receive new input, and continue.
Until now, running those patterns locally has been difficult. llama.cpp’s single-turn inference loop is fast and memory-efficient, but it does not naturally support the pause-and-resume cycle that multi-turn processing requires. The b9568 release is a step toward closing that gap.
The release also carries a notable absence: the macOS Apple Silicon (arm64, KleidiAI enabled) build is disabled, linked to pull request 23780. KleidiAI is Arm’s optimized inference library for Apple Silicon. Its disable suggests either a compatibility issue with the new Gemma 4 assistant code or a broader refactor of the KleidiAI integration. Either way, it is a reminder that adding architectural complexity has downstream costs. Every new model variant, every new tensor type, every new execution pattern increases the surface area for build failures.
The Windows x64 (SYCL) build is also disabled, linked to pull request 23705. SYCL is Intel’s cross-platform parallel programming model. Its disable suggests ongoing churn in the Intel GPU backend. For a project that ships 23 platform-specific builds, managing compatibility across all of them is a constant tax.
What matters for AI builders is the direction of travel. llama.cpp is the most widely used local inference engine in the open-source ecosystem. It runs on everything from a Raspberry Pi to a 64-core Threadripper. Its maintainers, led by Georgi Gerganov, have historically been conservative about adding architectural complexity. The project’s success comes from doing one thing well: running transformer models fast on commodity hardware.
B9568 signals that the project is now willing to take on the complexity of multi-model, multi-turn workflows. Gemma 4’s assistant architecture is the first test case. If it works, expect support for other multi-turn patterns to follow. Tool-use, function calling, and agentic loops are all variations on the same theme.
The practical impact is modest today. Most users running llama.cpp will not notice the difference between b9567 and b9568. The Gemma 4 assistant models are not yet widely used. The anchor-assistant pipeline is not yet implemented. But the infrastructure is being laid.
For the local AI community, this is a sign that the era of single-model inference is ending. The future is models that call models. llama.cpp is getting ready for that future, one commit at a time.