they let the model run Kaggle competitions alone for 24 hours. it kept getting better.
MiniMax M2.7: open weights, $0.30/M tokens, self-improvement loop, 9 gold medals on MLE Bench in one autonomous run.
April 13, 2026Not "it performed well." Not "it achieved a competitive score." It improved its own approach, round by round, without anyone directing it, for the entire 24 hours.
That is the part of the MiniMax M2.7 release that I cannot stop thinking about.
The benchmark story is fine and you can find it anywhere. 56.22% on SWE-Pro, approaching Claude Opus's best level. 55.6% on VIBE-Pro for end-to-end project delivery. 66.6% medal rate on MLE Bench Lite, second only to Opus 4.6 at 75.7% and GPT-5.4 at 71.2%.
These numbers are impressive for an open-weights model at $0.30 per million input tokens. That's the paragraph everyone wrote.
Here is the paragraph nobody wrote: those MLE Bench Lite numbers were achieved by running M2.7 on 22 machine learning competitions on a single A30 GPU, over three separate 24-hour trials, using a simple harness built around three components -- short-term memory, self-feedback, and self-optimization -- and letting it run without human direction.
After each round, the model generated a markdown file containing what it had learned. It then wrote a self-criticism of its own current results, identifying where it went wrong and what it might try differently. The next round started from that memory and criticism chain. Over 100 iterations within each 24-hour window.
The medal rate kept going up.
Not in aggregate across the three trials -- within each individual trial. The model kept finding better approaches the longer it ran. By hour 24, its best run had accumulated 9 gold medals, 5 silver medals, and 1 bronze across 22 competitions. The graph MiniMax published shows a consistent upward slope within each trial. It did not plateau. It did not oscillate. It improved.
I have been watching the "AI will improve itself" conversation for three years and it has mostly been either vaporware or academic demos that don't transfer to production. This is neither. This is a research team handing a production model a harness with a memory mechanism and a self-criticism loop and asking it to work on real ML competition problems -- not synthetic tasks designed to make the loop look good -- and watching it get better over a day without touching it.
The architecture underneath this is a 230 billion parameter MoE model that activates 10 billion parameters per token. 256 local experts, 8 activated per input. A 4.3% activation rate that keeps inference costs at a price point ($0.30 input / $1.20 output per million tokens) that makes it deployable as infrastructure rather than as an occasional research call.
200K context window. 62 layers. NVIDIA's team spent one month post-release optimizing two kernel changes -- a fused QK RMS Norm kernel and FP8 MoE integration from TensorRT-LLM -- and got 2.5x throughput improvement in vLLM and 2.7x in SGLang on Blackwell Ultra. From two kernel patches. In one month.
The open weights landed on HuggingFace yesterday. NVIDIA NIM has free API access right now.
What MiniMax actually did to build M2.7 is worth understanding specifically, because it changes how you should think about what model iteration means.
After the previous M2-series releases, MiniMax used M2.7 internally -- an early version of it -- to run its own ML research workflow. The model updated memory, built skills for reinforcement learning experiments, and improved its own learning process based on results it generated. The self-evolution loop they demonstrated publicly on MLE Bench is not a demo built for the release. It is the same loop they ran internally to accelerate their own model development.
MiniMax used M2.7 to help build M2.7.
The release blog says this plainly: "With human productivity already fully unleashed, the natural next step was to initiate self-evolution of both the model and the organization." That sentence is either corporate spin or one of the more honest descriptions of where frontier AI labs are actually operating. Given that they published a working implementation of the self-evolution harness alongside the model weights, I am inclined toward the latter.
Here is what I find genuinely hard to reason about.
The self-improvement loop works because the model can evaluate its own outputs against ground truth -- in ML competitions, the ground truth is the competition leaderboard. The model submits, gets a score, updates its memory, adjusts its approach. The feedback signal is unambiguous.
This only works when there is an objective ground truth to measure against. ML competitions have that. Code either passes tests or it doesn't. Math proofs are either correct or not. The class of problems where this loop is applicable -- where the model can get unambiguous feedback and iterate -- turns out to be almost exactly the class of problems that matters most for software engineering and research automation.
The loop does not generalize to everything. Design decisions, product strategy, communication -- anywhere the feedback signal is noisy or delayed or subjective, the loop breaks. But for the class of technical tasks that constitute most of what high-value engineering work actually is, it's close enough to applicable that the MLE Bench result is not an artifact of the benchmark. It is a preview of how model-driven technical work is about to change.
The number that I think about more than the medal rate: under three minutes.
That is the production incident recovery time that MiniMax reports M2.7 achieved on multiple occasions internally, running live production troubleshooting -- monitoring metrics, trace analysis, database verification, SRE-style decision-making -- as an autonomous agent. Under three minutes for the kind of incident that a human SRE team typically resolves in fifteen to forty-five.
This is a specific, falsifiable, real-world claim about production performance, not a benchmark. I cannot verify it independently. MiniMax has no incentive to publish it if it's not at least directionally true, because it will be immediately tested by anyone deploying this in an SRE context.
If it holds under testing -- if M2.7 running in a simple harness with production tooling access actually reduces incident MTTR to under three minutes reliably -- the implications for infrastructure teams are more significant than any benchmark number.
the model ran 24 hours on kaggle competitions.
it improved every round.
it published its own self-criticism after each one and used it to do better next time.
that is not a research paper. that is a shipped model available on huggingface today with open weights.
the self-improvement loop is not coming. it is here, for the class of problems where feedback is unambiguous.
which is most of engineering.
the $0.30 per million tokens matters too. frontier agentic capability at sub-frontier price means the roi threshold for running this on real tasks collapses. that is how adoption actually happens.
P.S. The vLLM chunked prefill interaction is clean for M2.7 -- standard MoE transformer, no SSM layers, no correctness landmines. The two kernel patches NVIDIA shipped (fused QK RMS Norm, FP8 MoE from TensorRT-LLM) are already in vLLM main. If you are deploying on Blackwell hardware, pull the latest vLLM nightly before benchmarking. The 2.5x improvement is real and you are leaving it on the table if you're on an older build.
i write these when i have something worth saying. no schedule. no algorithm. if you want to know when the next one goes up -- leave your email.
no spam. no sequence. just the note, when it exists.