Research / T-2026-3941

General Instinct shrinks a 245 GB model to fit on a single GPU

Q: General Instinct shrinks a 245 GB model to fit on a single GPU — key point 1

General Instinct compressed Qwen3.5-122B-A10B (245 GB BF16) into a 48 GiB GGUF file, outperforming Google's Gemma-4-26B-A4B on MMLU-Pro and GPQA-D.

Q: General Instinct shrinks a 245 GB model to fit on a single GPU — key point 2

InstinctRazor preserves always-active layers (router, norms, SSM) while aggressively quantizing routed experts, then uses on-policy distillation to recover lost capability.

Q: General Instinct shrinks a 245 GB model to fit on a single GPU — key point 3

The compressed model fits on a single RTX 4090 or 16 GB MacBook Pro, but HN users noted saturated benchmarks and missing comparisons against Unsloth's mixed-quantization method.

YC P26 startup General Instinct open-sources InstinctRazor, compressing a 245 GB MoE model into 48 GiB while beating Google's Gemma-4-26B on MMLU-Pro and GPQA-D.

Tessera Newsroom · 4 min read · June 6, 2026

Source Launch HN: General Instinct (YC P26) – Frontier models on edge devices (news.ycombinator.com)

FIGURE T-2026-3941

245 RESEARCH

The best models assume datacenter hardware. Most physical systems have the opposite constraints.

That is the problem General Instinct set out to solve. The YC P26 startup, founded by Guanming and Bill, launched on Hacker News this week with an open-source tool called InstinctRazor. The headline result: they compressed Qwen3.5-122B-A10B, a roughly 245 GB BF16 mixture-of-experts model, into a 48 GiB GGUF file. The compressed model is smaller than Google’s Gemma-4-26B-A4B while outperforming it on MMLU-Pro and GPQA-D.

That is not a small delta. Gemma-4-26B is Google’s latest edge-targeted MoE, released with much fanfare in April. General Instinct’s compressed model beats it while being physically smaller. The comparison is not apples-to-apples — one is a quantized Qwen variant, the other a native BF16 Google model — but the benchmark numbers suggest the quantization pipeline is doing something right.

The technique is worth understanding. Most quantization treats all parameters equally. InstinctRazor does not. It preserves the parts of the model that are always active — the router, norms, Gated-DeltaNet and SSM layers, the vision pathway — and quantizes the routed experts much more aggressively. The routed experts are the parts of an MoE model that only activate for specific tokens. They account for most of the parameter count but contribute less to every forward pass. By compressing them harder, the team keeps the model’s core reasoning intact while shedding most of the weight.

Then comes on-policy distillation. The team uses the original model to generate training data for the compressed version, recovering capability lost during aggressive quantization. This is not new in isolation — distillation is standard practice — but applying it to a sub-4-bit MoE quantization pipeline is less common. The blog post claims the distillation step recovers “significant” benchmark performance, though the team has not published ablation numbers showing how much each component contributes.

The practical implications are immediate. With an 8k context window, peak VRAM usage sits around 7.6 to 8 GB. That fits on a single RTX 4090 or a MacBook Pro with 16 GB of unified memory. The model can also run in a “small GPU” configuration where experts are streamed from system RAM, further reducing VRAM requirements at the cost of latency. For anyone deploying models on robots, drones, or consumer devices, that changes the calculus.

The HN comments surfaced the obvious skepticism. User BoorishBears pointed out that MMLU-Pro and GPQA-D are nearly saturated benchmarks — you could erase gains from half the compute going into recent models and barely make a dent. That is a fair critique. General Instinct’s benchmark claims may not hold up on harder, less saturated evaluations like SWE-bench or MATH-500. The team did not publish results on those.

User XenophileJKO raised a structural point: MoE models optimize for computation cost at the expense of memory efficiency. Edge devices need the opposite. Compressing an MoE model to fit on edge hardware is fighting the architecture’s natural tradeoff. The team’s approach — streaming experts from system RAM — mitigates this, but it adds latency that may matter for real-time robotics applications.

User rohansood15 asked for direct comparisons against existing 3-bit quantization methods from Unsloth and Bartowski. Guanming responded by pointing to the blog post, which compares against HQQ and AWQ but not Unsloth’s mixed-quantization approach. That is a gap. Unsloth’s method is widely used in the open-source community and represents the current practical state of the art for MoE quantization. A head-to-head comparison would be useful.

The bigger picture is what General Instinct represents. The company is not selling a model. It is selling a pipeline — InstinctRazor is open-source, and the team is positioning itself as the infrastructure layer for edge AI. The Y Combinator backing signals that investors see a market here. The question is whether the pipeline generalizes beyond Qwen models and whether the distillation step can scale to larger frontier models.

For AI builders, the takeaway is straightforward. The assumption that frontier models require datacenter hardware is breaking down. General Instinct is not the only team working on this — Apple’s on-device models, Google’s Gemma, and Microsoft’s Phi series all target edge deployment — but the compression ratios are striking. A 245 GB model fitting in 48 GiB while beating a Google edge model on standard benchmarks is the kind of result that makes you re-examine what “frontier” means on device.

The open question is whether the capability holds up in deployment. Benchmarks are not robotics. A model that scores well on MMLU-Pro may still fail on a real-time perception task with a 50 millisecond latency budget. General Instinct is asking for feedback from people deploying models on robots and edge devices. That is the right audience to test against.

The company’s name is General Instinct. The product is InstinctRazor. The bet is that the industry has been over-indexing on model size and under-indexing on model efficiency. If that bet is right, the next generation of edge devices will run models that today require a rack of GPUs. That is not a small thing.

Tessera Newsroom

Editorial

Masthead Contact

T-REL / RESEARCH

AI is outcounterexampling human mathematicians

LLMs have disproved three major open conjectures in two months. The formalization pipeline is now faster than peer review. Mathematics is entering a new regime.

Tessera Newsroom · July 21, 2026

Research / T-2026-1399

AI advice made people less accurate but more confident, study finds

A new study finds that access to AI advice collapsed participants' willingness to say 'I don't know' from 44% to 3%, while accuracy dropped and confidence surged.

Tessera Newsroom · July 20, 2026

Research / T-2026-9458

GPT-5.6 Used a Prompt to Close a 30-Year Gap in Convex Optimization

A Reddit thread reports that GPT-5.6 closed a 30-year gap in convex optimization with a single prompt. The proof compiles in Lean 4 without unproven steps.

Tessera Newsroom · July 19, 2026