Software / T-2026-5908

llama.cpp b9837 adds a reasoning-preserve flag, and that matters more than it sounds

Q: llama.cpp b9837 adds a reasoning-preserve flag, and that matters more than it sounds — key point 1

llama.cpp b9837 adds a single CLI flag, --reasoning-preserve, to keep reasoning tokens like in model output.

Q: llama.cpp b9837 adds a reasoning-preserve flag, and that matters more than it sounds — key point 2

The flag addresses agent builders' need to see chain-of-thought tokens, which default stripping breaks for multi-step systems.

Q: llama.cpp b9837 adds a reasoning-preserve flag, and that matters more than it sounds — key point 3

The release also disables KleidiAI on macOS Apple Silicon, highlighting ongoing rough edges on non-NVIDIA hardware.

llama.cpp b9837 ships a `--reasoning-preserve` flag for chat templates. The feature is small. The problem it solves is not.

Tessera Newsroom · 6 min read · June 29, 2026

Source ggerganov/llama.cpp b9837 (github.com)

FIGURE T-2026-5908

9837 SOFTWARE

The llama.cpp b9837 release ships exactly one user-facing change. The release notes are a single line: “jinja, chat: add —reasoning-preserve flag.” That is it. No model support bump. No new backend. No quantization format change. A single CLI flag.

That flag is worth paying attention to.

llama.cpp is the most widely deployed open-source inference engine for large language models. The project, started by Georgi Gerganov in March 2023, has accumulated 119,000 stars on GitHub and ships prebuilt binaries for macOS, Linux, Windows, Android, and iOS. It runs on CPU, GPU (CUDA, Vulkan, ROCm, SYCL, OpenVINO), and specialized accelerators like Apple’s Neural Engine via KleidiAI. The b9837 release alone offers 27 downloadable asset variants across architectures and backends. This is not a niche project. It is the default runtime for tens of thousands of developers running local models.

What the --reasoning-preserve flag actually does is subtle. It modifies how llama.cpp’s Jinja-based chat template engine handles the reasoning tokens that models like DeepSeek-R1, Qwen3, and certain fine-tuned Llama variants emit during chain-of-thought generation. Many reasoning models wrap their internal monologue in special tokens — <think> and </think> in the case of DeepSeek-R1, or [Reasoning] and [/Reasoning] in others. The default behavior in llama.cpp has been to strip these tokens from the output by default, because most chat templates treat them as internal metadata, not as part of the final response. The --reasoning-preserve flag tells the engine to keep them.

That sounds like a minor formatting preference. It is not.

The tension here is between two competing design philosophies. The first says that reasoning tokens are implementation details. The model’s chain-of-thought is scaffolding. The user wants the answer, not the draft. This is the default in most hosted APIs. OpenAI’s chat completions endpoint strips chain-of-thought by default. Anthropic’s Claude does not expose its internal reasoning at all in the standard API. The second philosophy says that reasoning tokens are valuable output. They let users inspect the model’s logic. They enable debugging. They make it possible to build tools that audit or steer the reasoning process. They are, for some applications, the product.

The --reasoning-preserve flag is an explicit choice for the second philosophy. It is also a signal about where the open-source inference ecosystem is heading.

Consider what happens without the flag. A developer running DeepSeek-R1 locally through llama.cpp gets back clean responses. The model thinks internally, produces a chain-of-thought, and llama.cpp’s chat template engine discards it. The developer sees only the final answer. That is fine for a chatbot. It is not fine for a research tool that needs to audit reasoning, for an agent framework that wants to extract intermediate steps, or for a user who wants to understand why the model produced a given output.

The flag exists because developers asked for it. The pull request that introduced it, numbered 25105, was merged on June 29. The commit message is as sparse as the release notes: “jinja, chat: add —reasoning-preserve flag.” The discussion around it, visible in the pull request thread, centers on a practical problem: models that emit reasoning tokens break downstream parsers when those tokens are silently removed. A developer building a multi-step agent that chains calls to a reasoning model cannot afford to lose the intermediate reasoning. The chain-of-thought is the only signal the agent has about why a sub-task succeeded or failed.

This is the deeper story. The --reasoning-preserve flag is a small piece of plumbing, but it sits at the intersection of three trends that define the current moment in AI software.

The first trend is the normalization of chain-of-thought reasoning. DeepSeek-R1, released in January 2025, popularized the idea of visible reasoning tokens. Qwen3 followed. Several fine-tuned Llama variants adopted the pattern. The technique is no longer experimental. It is a standard feature of the model landscape. Inference engines have to handle it.

The second trend is the shift from chat to agents. A chatbot that strips reasoning tokens is fine. An agent that strips reasoning tokens is broken. The agent needs to see the model’s reasoning to decide whether to trust the output, whether to retry, whether to escalate. The --reasoning-preserve flag is a direct response to agent builders who found that the default behavior made their systems unreliable.

The third trend is the fragmentation of the model ecosystem. There is no single standard for how reasoning tokens are formatted. DeepSeek uses <think> and </think>. Qwen3 uses [Reasoning] and [/Reasoning]. Some fine-tunes use XML-style tags. Others use markdown comments. The Jinja template engine in llama.cpp has to handle all of them. The --reasoning-preserve flag is a stopgap. It tells the engine: do not strip anything. Let the developer decide.

There is a parallel here with an earlier moment in the llama.cpp project. In late 2023, the project added support for GGUF metadata fields that let model authors specify chat templates directly in the model file. Before that, every inference engine had its own hardcoded template logic. The GGUF metadata approach won because it pushed the complexity to the model author. The --reasoning-preserve flag is the same pattern in reverse. It pushes the complexity to the developer. The model author can emit reasoning tokens however they want. The developer decides whether to keep them.

The release also includes a notable absence. The macOS Apple Silicon build with KleidiAI enabled is listed as “DISABLED” in the release assets, with a link to pull request 23780. KleidiAI is Arm’s library for accelerating matrix multiplication on mobile and edge hardware. Its disablement suggests ongoing integration work or a regression. For developers targeting Apple Silicon, this means the CPU-only path remains the default. The performance gap between the CPU path and the KleidiAI-accelerated path can be significant on devices like the M4 iPad Pro or the iPhone 16 Pro Max. The disablement is a reminder that even the most mature open-source inference project still has rough edges on non-NVIDIA hardware.

What does this mean for AI builders?

If you are building a product that uses a reasoning model, the default behavior of most inference engines will strip your reasoning tokens. You need to explicitly opt in to keeping them. The --reasoning-preserve flag in llama.cpp b9837 is one such opt-in. Expect similar flags to appear in other engines. Expect hosted API providers to add analogous parameters. The demand is there.

If you are building an agent framework, you should test with the flag enabled. Your agent’s ability to debug its own reasoning depends on it. If you are building a fine-tuned model that emits reasoning tokens, you should document the token format clearly. The Jinja template engine can handle custom formats, but only if the model author specifies them in the GGUF metadata.

The most interesting implication is for the frontier labs. DeepSeek, Qwen, and the Llama fine-tune ecosystem have made reasoning tokens a standard feature. The labs that keep reasoning hidden — OpenAI, Anthropic, Google DeepMind — are now the outliers. The --reasoning-preserve flag is a small piece of open-source infrastructure that embodies a bet: that visible reasoning is not a bug, but a feature. That users want to see the draft. That agents need the chain.

The bet may be wrong. But it is the bet that the most widely deployed open-source inference engine is making. And the engine has 119,000 stars.

Tessera Newsroom

Editorial

Masthead Contact

T-REL / SOFTWARE

Strix turns LLM agents into autonomous pentesters

Strix is an open-source platform that uses autonomous AI agents to dynamically find and validate software vulnerabilities, offering a developer-first CLI and CI/CD integration.

Tessera Newsroom · June 29, 2026

Software / T-2026-9579

OpenSpec proposes a spec layer for AI coding, but the hard part is still the model

Fission-AI's OpenSpec adds a spec layer to AI coding workflows. The idea is sound, but execution depends on models it cannot control.

Tessera Newsroom · June 28, 2026

Software / T-2026-8437

AWS's Agent Toolkit gives coding agents an official backstage pass

AWS releases an official toolkit bundling MCP servers, curated skills, and IAM guardrails for AI coding agents. The move signals a platform play for the agent middleware layer.

Tessera Newsroom · June 27, 2026