llama.cpp b9585: Fixing Granite Speech on the Edge — key point 1

llama.cpp b9585 fixes Granite Speech inference by applying an embedding scale when deepstack is not used.

llama.cpp b9585: Fixing Granite Speech on the Edge — key point 2

The fix ensures Granite Speech works correctly on edge hardware like phones and laptops for local voice interfaces.

llama.cpp b9585: Fixing Granite Speech on the Edge — key point 3

This open-source community maintenance is critical for making specialized models usable outside data centers.

Software / T-2026-3178

llama.cpp b9585: Fixing Granite Speech on the Edge

llama.cpp b9585 fixes Granite Speech model inference on edge devices. A small commit with big implications for on-device AI.

Tessera Newsroom · 4 min read · June 10, 2026

Source ggerganov/llama.cpp b9585 (github.com)

FIGURE T-2026-3178

9585 SOFTWARE

The latest release of llama.cpp, tagged b9585 on June 9, does one thing. It fixes inference for IBM’s Granite Speech model when the deepstack architecture is not in use. The fix is a single line: applying an embedding scale factor. Yet this tiny commit tells a larger story about where AI inference is heading and who is doing the work to get it there.

llama.cpp is the open-source inference engine that brought large language models to consumer hardware. Created by Georgi Gerganov, the project now has over 116,000 GitHub stars. It supports everything from Apple Silicon to Android phones to IBM mainframes. The b9585 release ships binaries for 23 platform variants, including Ubuntu s390x, Windows ARM64, and Android arm64.

The fix targets Granite Speech, a family of speech-to-text and text-to-speech models IBM open-sourced in early 2025. Granite Speech is not a general-purpose chatbot. It is a specialized model for voice interfaces, transcription, and speech synthesis. Running such a model locally on a phone or laptop avoids sending audio data to a cloud API. That matters for latency, for privacy, and for offline use.

The bug was subtle. Granite Speech uses a “deepstack” architecture where multiple transformer layers are stacked. When deepstack is active, the model applies an embedding scale internally. When deepstack is not used, the scale must be applied explicitly. The llama.cpp graph implementation was skipping that step. The result: degraded inference quality on any configuration that did not use deepstack.

Pull request 24357, authored by gabe-l-hart with a suggestion from Xuan Son Nguyen of Hugging Face, fixed the issue. The commit message is direct: “apply embedding scale when deepstack is not used.” No fanfare. No blog post. Just a correction that makes a model work correctly on edge hardware.

This is the kind of work that does not make headlines. It does not involve a new model release, a funding round, or a policy debate. It is engineering maintenance. But it is the maintenance that determines whether a model is usable outside a data center.

The edge inference pipeline

llama.cpp occupies a unique position in the AI stack. It is not a training framework. It is not a model hub. It is the runtime that translates model weights into usable output on whatever hardware is available. The project supports CUDA, Vulkan, ROCm, OpenVINO, SYCL, and plain CPU inference. The b9585 release includes binaries for ROCm 7.2, OpenVINO 2026.0, and CUDA 12 and 13.

The breadth of the build matrix is itself a statement. The project ships for IBM’s s390x architecture, a mainframe platform. It ships for Android ARM64. It ships for Windows x64 with HIP support. Each binary represents a decision to support a hardware platform that a user might actually own.

Two build targets are disabled in b9585: macOS Apple Silicon with KleidiAI, and Ubuntu x64 with SYCL FP32. The KleidiAI disable links to pull request 23780, which is not yet resolved. The SYCL disable links to pull request 23705. These are temporary. The project ships what works.

Why this matters for AI builders

The Granite Speech fix is a reminder that on-device AI is not just about model quality. It is about the quality of the inference engine. A model can be state-of-the-art, but if the runtime mishandles a scale factor, the output is worthless.

For builders deploying voice interfaces on phones, the difference between working and broken inference is the difference between a product that ships and one that does not. Granite Speech is not the only speech model in the ecosystem, but it is one of the few designed for local execution. IBM published the model under an open license. The llama.cpp community made it run on a phone.

The fix also highlights the importance of the open-source inference layer. No single company owns llama.cpp. The project is maintained by a distributed group of contributors. The Granite Speech fix came from gabe-l-hart, with input from Nguyen at Hugging Face. The review process is public. The commit is signed with a GPG key. The release is automated via GitHub Actions.

This is not a polished commercial product. It is a community engine that ships 23 platform variants on a Tuesday afternoon.

The quiet frontier

The AI industry obsesses over frontier models. GPT-5, Claude 4, Gemini Ultra. These models require clusters of GPUs and millions of dollars in compute. They are impressive. They are also inaccessible to most users.

The frontier that matters for adoption is the one that runs on the device in your pocket. That frontier is built in projects like llama.cpp, in pull requests that fix embedding scales, in binary releases that support mainframes and phones alike.

b9585 is not a landmark release. It does not add a new model architecture or double inference speed. It fixes a bug in a speech model that most users have never heard of. But it makes that model work. And that is the entire point.

The next time a voice assistant responds instantly on a phone without internet, the chain of software that made it possible includes a commit from June 9, 2026, that applied an embedding scale when deepstack was not used. That is the work. That is the frontier.

Tessera Newsroom

Editorial

Masthead Contact

T-REL / SOFTWARE

CVE-2026-64281: The kernel bug that could freeze an AI cluster

A Linux kernel bug in svcrdma can permanently freeze NFS-over-RDMA connections, a transport layer critical to AI cluster storage.

Tessera Newsroom · July 27, 2026

Software / T-2026-3615

llama.cpp b10142 ports MiniMax-M3: a vision model runs on a laptop

llama.cpp b10142 brings MiniMax-M3 vision to local hardware, with sparse attention, multi-stream support, and a rewritten inference kernel.

Tessera Newsroom · July 27, 2026

Software / T-2026-4457

Anthropic SDK v0.119.0 Adds Explicit Context Window Exceeded Stop Reason

Anthropic's Python SDK v0.119.0 introduces a dedicated stop reason for context window overflow, forcing agent developers to handle a failure mode that was previously invisible.

Tessera Newsroom · July 24, 2026