Research / T-2026-0356

Liquid AI's 8B-A1B MoE: 38T Tokens, 128K Context, and a Bet on Reasoning

Q: Liquid AI's 8B-A1B MoE: 38T Tokens, 128K Context, and a Bet on Reasoning — key point 1

Liquid AI released LFM2.5-8B-A1B, a MoE model with 8B total parameters but only 1B active per token, pretrained on 38T tokens with a 128K context window.

Q: Liquid AI's 8B-A1B MoE: 38T Tokens, 128K Context, and a Bet on Reasoning — key point 2

The model uses a reasoning-only architecture, producing explicit chain of thought before every answer, leveraging MoE's cheap compute for long reasoning traces on edge hardware.

Q: Liquid AI's 8B-A1B MoE: 38T Tokens, 128K Context, and a Bet on Reasoning — key point 3

Targeted reinforcement learning improved hallucination rates from 7.46% to 63.47% on AA-Omniscience Index, and a preference optimization stage addresses the doom loop problem of repeating phrases.

Liquid AI's new 8B-A1B MoE, trained on 38T tokens with a 128K context window, challenges assumptions about on-device model capability.

Tessera Newsroom · 4 min read · May 30, 2026

Source Liquid AI reveals 8B-A1B MoE trained on 38T (liquid.ai)

FIGURE T-2026-0356

8B RESEARCH

Liquid AI released LFM2.5-8B-A1B on May 28, a mixture-of-experts model with 8 billion total parameters but only 1 billion active per token. The headline numbers are easy to miss in a week full of frontier-model news, but they deserve attention: 38 trillion tokens of pretraining, a 128,000-token context window, and a deliberate shift to a reasoning-only architecture.

The model is the sequel to LFM2-8B-A1B, which Liquid shipped in October 2025. The differences are striking. Pretraining scale tripled from 12T to 38T tokens. Context window quadrupled from 32K to 128K. Vocabulary doubled from 65,536 to 128,000 tokens, with particular gains in non-Latin scripts: Hindi tokenization efficiency improved 120%, Thai 238%, Vietnamese 118%, and Arabic 39%. The old model was a general-purpose dense-plus-MoE hybrid. The new one is reasoning-only, producing an explicit chain of thought before every answer.

This is the bet. Liquid argues that MoE models run in a compute-bound regime on edge hardware, where a small number of active parameters makes each reasoning token cheap. So they lean into that cheapness, generating long chains of thought without the latency penalty that would cripple a dense model of similar total size. The result, they claim, is “competitive with much larger dense and MoE models on instruction following and agentic tasks.”

The benchmarks back parts of that claim. LFM2.5-8B-A1B scores 91.84 on IFEval, up from 79.44. MATH500 jumps from 74.80 to 88.76. AIME25 goes from 20.00 to 42.53. The Berkeley Function Calling Leaderboard v3 score moves from 45.07 to 64.36. The AA-Omniscience Index, which rewards correct answers and penalizes hallucinations, improves from -78.42 to -24.70. The non-hallucination rate on that benchmark climbs from 7.46 to 63.47.

That last number is the most interesting. Edge models hallucinate more because they have less parametric knowledge. Liquid addressed this with a targeted reinforcement learning stage that uses an avg@k-based reward over a diverse knowledge dataset. The goal is to reinforce abstention on queries beyond reliable knowledge while preserving existing knowledge. The result is a sharper knowledge boundary and clearer expression of uncertainty. This is a practical innovation that matters more for deployed products than many capability improvements.

Liquid also tackled the doom loop problem. Long reasoning traces in small models tend to get stuck repeating phrases like “Wait…” The team added a preference optimization stage that identifies tokens triggering looping behavior in specific contexts and redistributes probability mass toward plausible alternatives. During RL, a lightweight shaping reward discourages excessive use of common restart words. The company promises a dedicated blog post with full details on the pipeline, objective, and empirical results.

The model is available on Hugging Face and the Liquid Playground, with day-one support for llama.cpp, MLX, vLLM, and SGLang. That matters. Sparse MoE inference is notoriously difficult to make fast on consumer hardware. Liquid claims the fastest throughput in its size class on both CPU and GPU. If that holds, it changes what is possible on an entry-level laptop or a phone.

What is missing from the announcement is equally revealing. Liquid does not claim that LFM2.5-8B-A1B beats GPT-4o or Claude 4.5 on general knowledge. It does not claim state-of-the-art on MMLU or HumanEval. The benchmarks it publishes are weighted toward instruction following, function calling, math reasoning, and agentic workflows. This is a tool-calling model for on-device agents, not a general-purpose chatbot.

The competitive landscape is shifting. Microsoft Phi-4, Apple’s on-device models, and Google’s Gemma series all target similar use cases. But most of those are dense models. Liquid is betting that MoE with reasoning is the right architecture for edge deployment, where memory bandwidth is the bottleneck and compute is cheap. A 1B active parameter model that can reason through a math problem or chain multiple tool calls is a different proposition from a 1B dense model that cannot.

The practical implication for builders is straightforward. If you are deploying AI on consumer hardware, the tradeoff between model size and capability just shifted. LFM2.5-8B-A1B suggests that a small active-parameter count, when combined with massive pretraining, explicit reasoning, and targeted RL for hallucination reduction, can produce a model that feels much larger than it is. The 38T token pretraining budget is the key enabler. That is more tokens than many 70B dense models see.

The open question is whether the reasoning-only design limits the model in practice. Liquid made an explicit choice to produce a chain of thought before every answer. That works well for math, tool calling, and instruction following. It may be wasteful for simple factual queries where a direct answer is sufficient. The company is betting that the latency cost of reasoning is worth the quality gain. On edge hardware, where every millisecond counts, that bet is not trivial.

LFM2.5-8B-A1B is a reminder that the frontier is not the only place where interesting AI research happens. The edge is getting serious attention, and the architectures that win there may look very different from the dense transformers that dominate the data center.

Tessera Newsroom

Editorial

Masthead Contact

T-REL / RESEARCH

The company that ships AI without an AI team

Mid-market firms are deploying custom AI without hiring ML engineers, commissioning assistants and automation from generalist software agencies. We weigh what that route buys.

Tessera Newsroom · July 15, 2026

Research / T-2026-3444

Microsoft Study: Claude Code and Copilot CLI Users Merged 24% More Pull Requests

Researchers at Microsoft studied the early 2026 rollout of Claude Code and Copilot CLI, finding a 24% lift in pull requests merged and adoption driven by peer networks.

Tessera Newsroom · July 14, 2026

Research / T-2026-7866

Arvind Narayanan at ICML 2026: AI adaptation is the slow work of decades

Arvind Narayanan's ICML 2026 keynote argues AI adaptation will take decades, not years — and that the real bottleneck is organizational, not technical.

Tessera Newsroom · July 14, 2026