llama.cpp b9803: The quiet infrastructure of local AI — key point 1

llama.cpp b9803 fixes an OpenCL profiling bug where incomplete batches lose timing data on shutdown.

llama.cpp b9803: The quiet infrastructure of local AI — key point 2

The release ships 27 prebuilt binaries covering more hardware platforms than any commercial AI inference stack.

llama.cpp b9803: The quiet infrastructure of local AI — key point 3

KleidiAI and openEuler builds are disabled, reflecting a priority on correctness over peak performance.

Software / T-2026-5811

llama.cpp b9803: The quiet infrastructure of local AI

llama.cpp b9803 ships with a minor OpenCL fix and 27 platform binaries. The release tells a bigger story about where local AI is headed.

Tessera Newsroom · 5 min read · June 26, 2026

Source ggerganov/llama.cpp b9803 (github.com)

FIGURE T-2026-5811

9803 SOFTWARE

The llama.cpp b9803 release landed on June 26 with a single patch note: “opencl: flush profiling batch at shutdown for incomplete batches.” The fix, tied to pull request #25016, addresses a case where OpenCL profiling data goes unrecorded when a batch of inference work is interrupted before completion. It is a narrow bug, the kind that only surfaces when a process crashes or a user kills a running model mid-generation.

But the release is not really about the bug. It is about the 27 downloadable binaries that accompany it.

llama.cpp now ships prebuilt binaries for macOS Apple Silicon (arm64), macOS Intel (x64), iOS, Ubuntu on x64, arm64, and s390x, Android arm64, and Windows on x64 and arm64. The GPU backends covered include Vulkan, ROCm 7.2, OpenVINO 2026.2, SYCL (both FP32 and FP16), CUDA 12, CUDA 13, HIP, and OpenCL for Adreno. The release also lists openEuler builds for Ascend 310p and 910b accelerators, though those are marked DISABLED. The KleidiAI-enabled macOS build, which taps ARM’s Kleidi microkernel library for matrix multiplication, is also DISABLED in this release, linked to pull request #23780.

What matters is not the bug fix. It is that this project, maintained by Georgi Gerganov and a rotating cast of contributors, now compiles and distributes for more hardware platforms than any commercial AI inference stack on the market. No other project ships a single binary that runs a 70-billion-parameter language model on an Android phone, a Raspberry Pi-class s390x mainframe, a Windows laptop with an AMD GPU via HIP, and an Intel Arc GPU via SYCL.

The KleidiAI disable is worth watching. ARM’s Kleidi microkernel library, announced in 2024, promises significant matmul performance gains on ARM CPUs. The fact that the build is disabled suggests either an integration issue or a regression. For a project that prides itself on raw token throughput, a disabled optimization is a signal that the maintainers value correctness over peak benchmark numbers. That is the right call for a project used in production by developers who need deterministic behavior.

The broader picture is that llama.cpp has become the de facto reference implementation for local inference. It is not the fastest on any single piece of hardware. NVIDIA’s TensorRT-LLM and vLLM beat it on H100 clusters. Apple’s MLX is faster on Apple Silicon for specific model architectures. But no other project covers the long tail of hardware that llama.cpp does.

This matters for AI builders in a specific way. The industry narrative around inference has focused on hyperscaler deployment: serving models from massive GPU clusters in data centers. That is the business model for OpenAI, Anthropic, and Google. But the most interesting applications of large language models will not run in data centers. They will run on devices: a farmer in rural India querying a crop-disease model on a mid-range Android phone, a field medic running a diagnostic model on a laptop with no internet connection, a factory floor where a vision-language model runs on an edge server with an Intel Arc GPU because the data cannot leave the premises.

llama.cpp is the infrastructure that makes those use cases possible. The project’s ggml tensor library abstracts away the hardware details so that a developer can write inference code once and run it on any backend. The b9803 release ships 27 prebuilt combinations because the project treats platform support as a first-class feature, not an afterthought.

The OpenCL fix is a small example of a larger philosophy. OpenCL is not a glamorous backend. It is the lowest-common-denominator GPU compute API, supported on everything from Intel integrated graphics to Qualcomm Adreno to AMD Radeon. It is slow, verbose, and unloved by most AI engineers. But it is everywhere. By flushing profiling data on shutdown for incomplete batches, the maintainers ensure that developers debugging performance on obscure hardware get accurate timing information. That is the kind of engineering that does not make headlines but makes production systems reliable.

The release also highlights a tension in the open-source AI ecosystem. llama.cpp is maintained by a small group of volunteers. The project has 118,000 stars on GitHub and 19,900 forks. It is used by companies like GitHub (Copilot used llama.cpp in early prototypes), by startups like Ollama and LM Studio, and by thousands of individual developers. Yet the maintainers do not have a dedicated QA team, a hardware lab, or a budget for CI runners across all 27 target platforms. The fact that the builds exist at all is a testament to the project’s community and to Gerganov’s insistence on broad compatibility.

The KleidiAI disable and the openEuler DISABLED status are reminders that platform support is not free. Each backend requires maintenance, testing, and debugging. When a dependency changes or a hardware vendor updates its SDK, something breaks. The project’s approach is to disable the broken backend rather than hold the release. That is pragmatic, but it also means that some users will find their hardware unsupported until the next release cycle.

For AI builders, the takeaway is straightforward. If you are building an application that runs a language model on a device, llama.cpp is your best bet for broad hardware compatibility. The tradeoff is that you will not get the absolute best performance on any single platform. If you need maximum throughput on an H100, use TensorRT-LLM. If you need maximum throughput on an M4 Mac, use MLX. But if you need to ship a product that works on a customer’s machine, whatever that machine is, llama.cpp is the only option that covers the field.

The b9803 release does not contain a new model architecture or a breakthrough in inference speed. It contains a bug fix for OpenCL profiling and 27 platform binaries. That is the quiet work of making local AI real.

Tessera Newsroom

Editorial

Masthead Contact

T-REL / SOFTWARE

Alibaba's Page Agent runs GUI automation from a single script tag

Alibaba open-sources Page Agent, an in-page JavaScript GUI agent that manipulates the DOM directly with natural language, no screenshots or browser extensions needed.

Tessera Newsroom · June 26, 2026

Software / T-2026-0648

Stably AI's Orca turns the agent IDE into a fleet command center

Stably AI's open-source Orca lets developers run five coding agents in parallel, each in its own worktree, and merge the best result. A sign of how agentic coding is maturing.

Tessera Newsroom · June 25, 2026

Software / T-2026-5069

llama.cpp b9775: The quiet infrastructure of local AI

llama.cpp b9775 ships a fix for speculative decoding errors. The real news is what the release tells us about the state of local AI infrastructure.

Tessera Newsroom · June 24, 2026