StepFun, the Beijing-based AI lab behind the Step series of models, released Step 3.7 Flash on May 29. The model is a 198B-parameter sparse Mixture-of-Experts (MoE) vision-language model that activates about 11B parameters per token. It delivers up to 400 tokens per second and supports a 256k context window.
The headline is not the raw throughput. It is what StepFun built the model to do. Step 3.7 Flash is engineered for agentic workflows — the kind where a model does not just answer a question but parses a financial report, runs multi-step search loops, and calls tools across terminals, browsers, and APIs without drifting off course. StepFun calls it “a high-efficiency Flash model for real-world agents.”
That framing matters. The generative AI market is moving past chat. The next frontier is autonomous agents that act on behalf of users, and the models that power them need to be fast, cheap, and reliable at executing long-horizon tasks. Step 3.7 Flash is a bet that the winning architecture for that market is a sparse MoE model optimized for tool orchestration, not a monolithic dense model.
What the benchmarks show
Step 3.7 Flash posts competitive numbers across several agentic and coding benchmarks. On SWE-Bench Pro, it scores 56.3, placing it second behind Claude Opus 4.7 at 64.3 and ahead of DeepSeek V4 Flash at 55.6 and Gemini 3.5 Flash at 55.1. On Terminal-Bench 2.1, it scores 59.5, behind DeepSeek V4 Flash at 62.0 and Gemini 3.5 Flash at 76.2.
The model leads ClawEval-1.1 with a score of 67.1, significantly ahead of the next closest competitor at 59.8. ClawEval-1.1 measures resistance to adversarial traps and adherence to system policies during multi-turn orchestration. That is a direct signal for production reliability.
On Toolathlon, Step 3.7 Flash scores 49.5, behind DeepSeek V4 Flash at 52.8 and Gemini 3.5 Flash at 56.5. On HLE with Tool, it scores 47.2, ahead of DeepSeek V4 Flash at 45.1 and Gemini 3.5 Flash at 40.2.
The multimodal results are strong. Step 3.7 Flash scores 79.2 on SimpleVQA (Search), first place among models listed, and 95.3 on V* (Python), behind only Kimi K2.6 at 96.9 and Gemini 3 Flash at 96.3.
The pricing play
StepFun priced Step 3.7 Flash aggressively. Input tokens cost $0.20 per million on a cache miss and $0.04 per million on a cache hit. Output tokens cost $1.15 per million. That puts it in the same range as other Flash-class models from DeepSeek and the Gemini family.
The pricing matters because agentic workflows burn tokens fast. A model that runs a multi-step search loop, calls several tools, and iterates on code can consume millions of tokens per task. At Step 3.7 Flash’s output price, a task that generates 100,000 output tokens costs $0.115. That is cheap enough to run thousands of agentic tasks per day without breaking a budget.
The advisor mode
Step 3.7 Flash supports what StepFun calls Advisor Mode. The small executor model drives the trajectory end-to-end — calling tools, reading results, iterating — and consults a larger advisor model only at inflection points where its own judgment falls short. StepFun says this is their implementation of the advisor strategy described by Anthropic.
With Advisor Mode enabled, Step 3.7 Flash reaches 97% of Claude Opus 4.6’s coding performance on SWE-Bench Verified at roughly one-ninth the per-task cost: $0.19 versus $1.76 per task. That is a dramatic cost reduction for production coding agents.
The advisor mode points to a broader architectural pattern. The market is converging on a two-tier model stack: a fast, cheap executor for the hot path and a slower, more capable advisor for escalations. Step 3.7 Flash is built for that stack from the ground up.
Availability and ecosystem
Step 3.7 Flash is available on the StepFun Open Platform, OpenRouter, and NVIDIA NIM. StepFun is partnering with DeepInfra, Fireworks AI, and Modal to expand availability. The model supports vLLM, SGLang, Hugging Face Transformers, and llama.cpp for local deployment.
For local inference, Step 3.7 Flash requires significant hardware. The Q4_K_S quantized GGUF weights are 111.5 GB, and StepFun recommends at least 128 GB of unified memory for Mac Studio or MacBook Pro devices. That limits local deployment to high-end workstations, but the cloud API makes it accessible for most developers.
What it means for the AI market
Step 3.7 Flash is a credible competitor in the Flash-class model segment. It matches or beats DeepSeek V4 Flash and Gemini 3.5 Flash on several key agentic benchmarks, particularly ClawEval-1.1 and HLE with Tool. Its pricing is competitive. Its advisor mode offers a practical path to frontier-level coding performance at a fraction of the cost.
The model also signals that the Chinese AI ecosystem is not slowing down. StepFun, DeepSeek, and others are shipping models that compete directly with frontier labs on agentic capabilities. The gap between open-weight models and closed-source frontier models is narrowing, especially on the tasks that matter for production agents.
The open question is adoption. StepFun is less well-known than DeepSeek or Mistral in Western markets. The model’s availability on OpenRouter and NVIDIA NIM helps, but building developer trust and ecosystem integrations takes time. Step 3.7 Flash has the benchmarks and the pricing. Now it needs the users.
Step 3.7 Flash is a concrete example of where the market is going: fast, cheap, agent-first models that can see, reason, and act. The labs that win the agentic workflow market will be the ones that deliver reliability at scale. StepFun just made its case.