Software / T-2026-3986

llama.cpp b9637 adds native Cohere2MoE support, shrinking the gap between open-source and proprietary inference

Q: llama.cpp b9637 adds native Cohere2MoE support, shrinking the gap between open-source and proprietary inference — key point 1

llama.cpp release b9637 adds a dedicated parser for Cohere2MoE (North Code) models, fixing crashes and silent failures when loading Cohere weights.

Q: llama.cpp b9637 adds native Cohere2MoE support, shrinking the gap between open-source and proprietary inference — key point 2

Cohere's MoE architecture uses a learned gating mechanism and different expert-count configuration, unlike the Mixtral-style top-k selection the generic parser assumed.

Q: llama.cpp b9637 adds native Cohere2MoE support, shrinking the gap between open-source and proprietary inference — key point 3

The parser, contributed by developer CISC, lets North Code run locally on consumer hardware without cloud dependency, expanding llama.cpp's native architecture support.

llama.cpp b9637 ships a dedicated Cohere2MoE parser, bringing native support for North Code models to the open-source inference stack. The move highlights how MoE architecture is…

Tessera Newsroom · 4 min read · June 15, 2026

Source ggerganov/llama.cpp b9637 (github.com)

FIGURE T-2026-3986

9637 SOFTWARE

The latest release of llama.cpp, tagged b9637, ships a dedicated parser for Cohere2MoE models, also known as North Code. The change, merged in pull request #24615, is a narrow technical fix. It adds a specialized code path for parsing the architecture that Cohere uses in its latest generation of models. But the narrowness is the point.

For months, running Cohere’s Command R+ or North Code on llama.cpp required workarounds. The models use a variant of mixture-of-experts (MoE) that does not map cleanly onto the generic MoE parser that the project has carried since early 2024. Users reported silent failures, suboptimal performance, or outright crashes when loading Cohere weights. The new parser, contributed by a developer named CISC, fixes that.

The release notes are terse. “chat: add dedicated Cohere2MoE (North Code) parser,” reads the commit message, with a follow-up line: “Some renames to make @CISC happy :>”. The tone is casual. The effect is structural.

Cohere’s MoE architecture differs from the Mixtral-style MoE that most open-weight models use. In a Mixtral model, each token activates two experts out of eight, and the router is a simple top-k selection. Cohere’s implementation uses a learned gating mechanism and a different expert-count configuration. The generic parser in llama.cpp assumed the Mixtral pattern. The new parser handles Cohere’s pattern natively, without a translation layer.

This matters because Cohere has been pushing its North Code model as a developer-facing coding assistant, competing with models like DeepSeek Coder and Code Llama. Cohere offers an API, but the company also releases weights under a permissive license. Until b9637, running those weights locally on llama.cpp required patching or tolerating degraded behavior.

The b9637 release also ships binaries across 22 platforms, including Apple Silicon arm64, Intel x64, Ubuntu with ROCm 7.2, Vulkan, OpenVINO, SYCL, and Windows with CUDA 12 and CUDA 13. The KleidiAI-enabled build for Apple Silicon is marked as DISABLED, with a link to an open pull request. The project maintains a build matrix that covers everything from s390x mainframes to Android phones.

For the local AI community, b9637 closes a specific gap. The number of architectures that llama.cpp supports natively is growing. The project started with a single transformer decoder. It now handles LLaMA, Mistral, Mixtral, Qwen, DeepSeek, Phi, Gemma, Falcon, Command R, and now Cohere2MoE. Each new parser is a signal that the model ecosystem is diversifying, and that the open-source inference stack is keeping pace.

The implication for AI builders is practical. If you are deploying a coding assistant on consumer hardware, you can now evaluate North Code without a cloud dependency. The model runs on a MacBook, a Linux workstation, or a Windows gaming machine with a CUDA GPU. The parser change is invisible to the end user. The model just works.

This is the quiet work that makes local AI viable. Not a new model release, not a benchmark claim, not a funding round. A parser. A fix for a specific architecture. A commit that makes a developer’s life easier.

The broader pattern is worth watching. MoE architectures are proliferating because they offer better performance per parameter. A 7B MoE model can match a 13B dense model on many tasks, while using less compute at inference time. But MoE adds complexity to the inference stack. Each implementation has its own routing logic, expert-count, and gating mechanism. The generic parser cannot keep up.

llama.cpp’s approach has been to add dedicated parsers as new architectures emerge. The project now has parsers for Mixtral, DeepSeek MoE, Qwen MoE, and Cohere2MoE. Each one is a bet that the architecture will have staying power. The Cohere2MoE parser is a bet that North Code is not a one-off.

The release also highlights the role of individual contributors in open-source infrastructure. CISC, the developer who wrote the parser, is not a Cohere employee. The work was done on their own time, submitted as a pull request, and merged by the maintainers. This is how the open-source AI stack gets built: one parser at a time, by people who need the tool to work.

For the industry, the takeaway is that the gap between proprietary and open-source inference is shrinking. Not because open-source models are catching up to GPT-5 or Claude 4, but because the software stack is becoming more capable. A model is only as useful as the software that runs it. With b9637, North Code is now runnable on the most widely deployed local inference engine in the world.

The KleidiAI build being disabled is a reminder that not everything works. The project is transparent about failures. The link to the open pull request tells users exactly where the problem is and what the status is. That transparency is a feature of the open-source model, not a bug.

The next milestone for llama.cpp will be support for the next generation of MoE architectures. Cohere is already working on North Code 2. DeepSeek has a new MoE variant. Mistral is rumored to be shipping a larger MoE model. Each new architecture will require a new parser, or a generalization of the existing one.

For now, b9637 is a small release with a specific fix. It makes one more model work on one more piece of hardware. That is the kind of progress that does not make headlines but does make local AI real. The parser is live. The model runs. The developer moves on to the next problem.

Tessera Newsroom

Editorial

Masthead Contact

T-REL / SOFTWARE

OpenAI's v2.52.0 Python SDK ships content provenance checks: the quiet turn toward verifiable AI

OpenAI's openai-python v2.52.0 adds content provenance checks. What looks like a minor SDK release signals a strategic shift toward verifiable AI output.

Tessera Newsroom · August 1, 2026

Software / T-2026-6937

OpenWork takes on Claude Cowork by making AI workflows portable

different-ai's OpenWork is an open-source alternative to Claude Cowork, built around portable AI workflows.

Tessera Newsroom · July 31, 2026

Software / T-2026-8901

Hugging Face’s speech-to-speech pipeline makes local voice agents a CLI install away

Hugging Face ships speech-to-speech, a modular voice-agent pipeline that runs locally with open models. The implications for AI research and deployment.

Tessera Newsroom · July 30, 2026