Business / T-2026-9016

Step 3.7 Flash: A 198B-Parameter MoE Model Built for the Agentic Workflow

Q: Step 3.7 Flash: A 198B-Parameter MoE Model Built for the Agentic Workflow — key point 1

Step 3.7 Flash is a 198B-parameter sparse MoE vision-language model engineered for agentic workflows, not just chat.

Q: Step 3.7 Flash: A 198B-Parameter MoE Model Built for the Agentic Workflow — key point 2

It scores 56.3 on SWE-Bench Pro and leads ClawEval-1.1 at 67.1, signaling production reliability for multi-turn tool orchestration.

Q: Step 3.7 Flash: A 198B-Parameter MoE Model Built for the Agentic Workflow — key point 3

Advisor Mode achieves 97% of Claude Opus 4.6's SWE-Bench Verified performance at roughly one-ninth the per-task cost.

StepFun's Step 3.7 Flash is a 198B-parameter MoE model designed for high-frequency agentic workloads, offering 400 tokens/sec and competitive SWE-Bench scores.

Tessera Newsroom · 4 min read · June 1, 2026

Source Step 3.7 Flash (producthunt.com)

FIGURE T-2026-9016

3.7 BUSINESS

StepFun, the Beijing-based AI lab behind the Step series of models, released Step 3.7 Flash on May 29. The model is a 198B-parameter sparse Mixture-of-Experts (MoE) vision-language model that activates about 11B parameters per token. It delivers up to 400 tokens per second and supports a 256k context window.

The headline is not the raw throughput. It is what StepFun built the model to do. Step 3.7 Flash is engineered for agentic workflows — the kind where a model does not just answer a question but parses a financial report, runs multi-step search loops, and calls tools across terminals, browsers, and APIs without drifting off course. StepFun calls it “a high-efficiency Flash model for real-world agents.”

That framing matters. The generative AI market is moving past chat. The next frontier is autonomous agents that act on behalf of users, and the models that power them need to be fast, cheap, and reliable at executing long-horizon tasks. Step 3.7 Flash is a bet that the winning architecture for that market is a sparse MoE model optimized for tool orchestration, not a monolithic dense model.

What the benchmarks show

Step 3.7 Flash posts competitive numbers across several agentic and coding benchmarks. On SWE-Bench Pro, it scores 56.3, placing it second behind Claude Opus 4.7 at 64.3 and ahead of DeepSeek V4 Flash at 55.6 and Gemini 3.5 Flash at 55.1. On Terminal-Bench 2.1, it scores 59.5, behind DeepSeek V4 Flash at 62.0 and Gemini 3.5 Flash at 76.2.

The model leads ClawEval-1.1 with a score of 67.1, significantly ahead of the next closest competitor at 59.8. ClawEval-1.1 measures resistance to adversarial traps and adherence to system policies during multi-turn orchestration. That is a direct signal for production reliability.

On Toolathlon, Step 3.7 Flash scores 49.5, behind DeepSeek V4 Flash at 52.8 and Gemini 3.5 Flash at 56.5. On HLE with Tool, it scores 47.2, ahead of DeepSeek V4 Flash at 45.1 and Gemini 3.5 Flash at 40.2.

The multimodal results are strong. Step 3.7 Flash scores 79.2 on SimpleVQA (Search), first place among models listed, and 95.3 on V* (Python), behind only Kimi K2.6 at 96.9 and Gemini 3 Flash at 96.3.

The pricing play

StepFun priced Step 3.7 Flash aggressively. Input tokens cost $0.20 per million on a cache miss and $0.04 per million on a cache hit. Output tokens cost $1.15 per million. That puts it in the same range as other Flash-class models from DeepSeek and the Gemini family.

The pricing matters because agentic workflows burn tokens fast. A model that runs a multi-step search loop, calls several tools, and iterates on code can consume millions of tokens per task. At Step 3.7 Flash’s output price, a task that generates 100,000 output tokens costs $0.115. That is cheap enough to run thousands of agentic tasks per day without breaking a budget.

The advisor mode

Step 3.7 Flash supports what StepFun calls Advisor Mode. The small executor model drives the trajectory end-to-end — calling tools, reading results, iterating — and consults a larger advisor model only at inflection points where its own judgment falls short. StepFun says this is their implementation of the advisor strategy described by Anthropic.

With Advisor Mode enabled, Step 3.7 Flash reaches 97% of Claude Opus 4.6’s coding performance on SWE-Bench Verified at roughly one-ninth the per-task cost: $0.19 versus $1.76 per task. That is a dramatic cost reduction for production coding agents.

The advisor mode points to a broader architectural pattern. The market is converging on a two-tier model stack: a fast, cheap executor for the hot path and a slower, more capable advisor for escalations. Step 3.7 Flash is built for that stack from the ground up.

Availability and ecosystem

Step 3.7 Flash is available on the StepFun Open Platform, OpenRouter, and NVIDIA NIM. StepFun is partnering with DeepInfra, Fireworks AI, and Modal to expand availability. The model supports vLLM, SGLang, Hugging Face Transformers, and llama.cpp for local deployment.

For local inference, Step 3.7 Flash requires significant hardware. The Q4_K_S quantized GGUF weights are 111.5 GB, and StepFun recommends at least 128 GB of unified memory for Mac Studio or MacBook Pro devices. That limits local deployment to high-end workstations, but the cloud API makes it accessible for most developers.

What it means for the AI market

Step 3.7 Flash is a credible competitor in the Flash-class model segment. It matches or beats DeepSeek V4 Flash and Gemini 3.5 Flash on several key agentic benchmarks, particularly ClawEval-1.1 and HLE with Tool. Its pricing is competitive. Its advisor mode offers a practical path to frontier-level coding performance at a fraction of the cost.

The model also signals that the Chinese AI ecosystem is not slowing down. StepFun, DeepSeek, and others are shipping models that compete directly with frontier labs on agentic capabilities. The gap between open-weight models and closed-source frontier models is narrowing, especially on the tasks that matter for production agents.

The open question is adoption. StepFun is less well-known than DeepSeek or Mistral in Western markets. The model’s availability on OpenRouter and NVIDIA NIM helps, but building developer trust and ecosystem integrations takes time. Step 3.7 Flash has the benchmarks and the pricing. Now it needs the users.

Step 3.7 Flash is a concrete example of where the market is going: fast, cheap, agent-first models that can see, reason, and act. The labs that win the agentic workflow market will be the ones that deliver reliability at scale. StepFun just made its case.

Tessera Newsroom

Editorial

Masthead Contact

T-REL / BUSINESS

Zro's private inference pitch: coding agents without the cloud

Zro launches a private inference engine for coding agents, challenging the cloud-based model by running LLMs entirely on-device.

Tessera Newsroom · July 19, 2026

Business / T-2026-0749

Kit For AI: The MCP memory layer that asks what RAG infrastructure is for

Kit For AI offers a memory layer for AI agents via MCP tools, removing the need to build RAG pipelines. The commentary explores what this means for the AI infrastructure market.

Tessera Newsroom · July 18, 2026

Business / T-2026-8266

YAGNI's AI moment: why the oldest rule in software is suddenly new

The YAGNI principle is back in the news. What it means for AI agents, agentic systems, and the economics of software development in 2026.

Tessera Newsroom · July 16, 2026