Skip to main content
← BACK TO WRITING

Kimi K2 Thinking, Six Months Later: What the Moment Actually Was

2025.11.09·7 MIN READ·AI · MACHINE LEARNING · OPEN SOURCE · LLM · REASONING · RETROSPECTIVE
↓ Jump to section

When Moonshot AI released Kimi K2 Thinking on 2025-11-06, I wrote that it beat GPT-5 and Claude Sonnet 4.5 across most major benchmarks, that the open-weight frontier had arrived, and that enterprises paying $10 per million tokens should be asking why.

Six months on, two of those three claims need correction, and the third was mostly right for the wrong reasons. I'm going to walk through what I got wrong, what the landscape looks like from April 2026, and what the "open-weight moment" actually was. It was a real moment, just not the one I described.

What I got wrong

The benchmark framing. The numbers I quoted at the time were Moonshot's own published scores: 44.9% on Humanity's Last Exam, 60.2% on BrowseComp, 71.3% on SWE-Bench Verified. I noted they were self-reported; I didn't treat that caveat as seriously as I should have.

In December 2025, the US AI Safety Institute at NIST published an independent evaluation of K2 Thinking. Their numbers weren't close to Moonshot's. On CVE-Bench (cyber reasoning), K2 scored 50.5 against GPT-5's 65.6. On SWE-Bench Verified, which Moonshot had reported at 71.3%, CAISI measured 56.2% against Claude Opus 4's 66.7%. Their summary described K2 Thinking as "only a modest improvement over DeepSeek V3.1." Nathan Lambert, writing at Interconnects at the time of launch, had already flagged that the benchmark numbers didn't reproduce in the public Kimi chat interface. That behaviour is the hallmark of targeting benchmarks rather than training the underlying capability.

The honest framing, knowing what I know now: K2 Thinking posted strong self-reported numbers at release, led a handful of axes where its evaluation methodology lined up with how the models were trained, and trailed the US frontier models by 10-15 points on independent benchmarks the lab hadn't optimised for. That's still a good open-weight model. It isn't GPT-5 with the serial numbers filed off.

The "Kimi moment" framing. I described K2 Thinking as the moment the closed-open gap collapsed. It turned out to be more like the opening act. By the time the CAISI report landed in mid-December, Zhipu had already shipped GLM-5 (which went on to top the BenchLM open-weight leaderboard); Alibaba shipped Qwen3.5 in February with 1M context and performance in the Sonnet 4.5 range on local hardware; DeepSeek announced V4 and R2 in late February with claimed 81% on SWE-Bench Verified (still not third-party verified, as of writing); Meta shipped Llama 5 in April with 600B parameters and a 5M-token context.

Through all of that, K2 Thinking's own HuggingFace download count ran at about 10% of DeepSeek R1's and under 5% of gpt-oss. It wasn't the model people actually reached for. The trajectory was real, but the specific model I wrote about was a four-to-six-week incumbent that got rotated out before the ink dried.

What the moment actually was

Here's what I think the November 2025 moment really was, with six months of hindsight.

The capability gap on knowledge benchmarks is now effectively zero. On straightforward reasoning, coding, and retrieval evals, the spread between the best open-weight model and the best closed model is in the low single digits. Qwen3.5-397B running on a local Blackwell card answers the same questions as Claude Sonnet 5, modulo a couple of points on the hardest tasks.

On agentic workloads, the closed models still lead, but by less than the cost difference implies. GPT-5.4 Pro posts 89.3% on BrowseComp; Gemini 3.1 Pro hits 85.9%; Claude Opus 4.7 lands at 79.3%. The best open-weight models are in the 60-70% range on the same benchmark. If your workload is agentic (tool use, multi-step browsing, long-horizon tasks) the closed models are still the default. If it's a chat-shaped knowledge task, open-weight is now a defensible choice.

The cost side is what actually changed the math. NVIDIA's Blackwell Ultra platform brought inference costs to about $0.02 per million tokens for frontier-class open-weight serving, roughly a 15x reduction versus Hopper-generation hardware. SemiAnalysis's InferenceX numbers have Blackwell Ultra at up to 50x the performance and 35x the cost reduction for agentic workloads versus H100-class. For an enterprise running 5-10 million tokens a month, self-hosting an open-weight model on Azure or AWS Bedrock flipped from "academic exercise" to "obviously cheaper than the API." That's the structural change, and it's what makes the open-weight story material even when no specific Chinese model is sitting on top of the leaderboard.

The open-weight crown stopped staying put. This is the secondary story worth internalising. Between November 2025 and April 2026, the top-of-leaderboard open-weight model was, in rough order: Kimi K2 Thinking (for 4-6 weeks), GLM-5 / GLM-5.1, Qwen3.5-397B, Qwen3.6-Plus. The velocity is such that "which Chinese lab currently leads" is a bad question to ask; the honest framing is "the Chinese open-weight labs collectively are on a weekly cadence of competitive releases." Zhipu, Alibaba, Moonshot, and DeepSeek are trading leads. Betting on any one is unwise; betting on the category is reasonable.

What the closed labs did in the same period

For context, because the "closed is stagnant" framing was also wrong.

OpenAI absorbed GPT-5.3-Codex into GPT-5.4 (2026-03-05), unified their coding and reasoning SKU, and dropped prices to $2.50 / $15 per million tokens on the main model, which is competitive with self-hosting Qwen3.5 if you're under the 5-10M token/month break-even. GPT-5.4 was the first model past the human baseline on OSWorld. A successor (codenamed "Spud", probably GPT-5.5 or GPT-6) finished pretraining in late March.

Anthropic shipped Opus 4.6 (2026-02-05), Sonnet 5 (2026-04-01, with 2M context out of beta and 92.4% on SWE-Bench Verified), and Opus 4.7 (2026-04-16). The latter was a step-change in agentic coding that put clear distance between the closed frontier and the open one on long-horizon tasks.

Google shipped Gemini 3 Pro in late November, Gemini 3.1 Pro in February (which leads on 13 of 16 benchmarks Google measures and hit 77.1% on ARC-AGI-2), and Gemini 3 Flash in April. The Gemini 3 family is genuinely good at agentic workloads and is the quietest-to-talk-about frontier lab at the moment.

The "closed labs were going to lose to open" framing was directional truth on cost and stagnant on capability. Six months later, the capability floor rose on both sides; the cost ceiling dropped on both sides. The gap compressed. Neither side stopped shipping.

What to watch now

The EU AI Act's full applicability kicks in on 2026-08-02. Chinese frontier models clearly qualify as GPAI under Article 53. Open-source exemptions hinge on license terms. Permissive licences (MIT, Apache 2.0, standard Chinese open-source licences) clear the bar; restrictive commercial licences don't. DeepSeek has no disclosed EU or US corporate presence, which complicates enforcement in ways that may matter more by autumn.

CCP-alignment evaluation is now first-class. The CAISI methodology measured K2 Thinking at ~26% alignment with CCP talking points when prompted in Chinese, versus ~7% English censorship. That's a signal enterprise buyers now weigh alongside capability. Regulated industries in the US and Europe have explicit guidance against shipping customer-facing tools that produce CCP-aligned output on sensitive queries.

DeepSeek V4 and R2 are the biggest open-weight wildcards. Announced late February, claimed 81% on SWE-Bench Verified and 90% HumanEval, pricing at $0.30 per million tokens. Not yet third-party verified. If those numbers hold up under independent evaluation, they move the frontier materially. If they don't, we'll know the pattern (self-reported numbers from Chinese labs that don't replicate in independent evaluation) is durable.

Agentic benchmarks are maturing. SWE-Bench Pro, Terminal-Bench 2.0, OSWorld, APEX-Agents, BrowseComp. HLE has split by methodology enough that different leaderboards report meaningfully different winners. Artificial Analysis has Gemini 3.1 Pro at 44.7%; BenchLM has an Anthropic preview at 64.7%. The leaderboard isn't a ranking anymore; it's a question of whose methodology you trust.

What to do with all of this

If you're making a real decision, a platform bet, a contract, a compliance posture, the practical read from six months of hindsight is:

Don't trust first-week benchmark numbers on any release, from any lab. Wait for independent evaluation. CAISI takes about a month; third-party aggregators take a few weeks. If the question is urgent, use the model for your specific workload for a week before committing.

Don't pick a single open-weight model and call it your platform. The leader rotates faster than any procurement cycle. Pick the category: "our platform supports swapping the backing model within a week", and invest in the pipeline that makes swaps cheap.

Do take the cost story seriously. Blackwell-era inference prices mean open-weight serving is a real option for any workload running above single-digit millions of tokens per month. That's a structural change that persists regardless of which specific model is in vogue this week.

And for what it's worth, if you read my November post: the direction was right, the Kimi-was-the-moment framing was not. Benchmark numbers are a marketing surface. Treat them accordingly.

Primary source for this retrospective: NIST/CAISI Evaluation of Kimi K2 Thinking, which is the correction worth bookmarking alongside any post that cited the Moonshot launch numbers at face value.