Skip to main content

90% of Meta's model parameters are embeddings. they've been running them on tensor cores for years.

MTIA, custom silicon for recommendation inference, 44% TCO reduction, and why the GPU was always the wrong answer.

April 8, 2026

That sentence is the reason Meta has shipped six custom AI chips in 24 months.

Let me back up.

When people talk about GPU inference, they usually mean transformer inference. Attention. GEMM. The operations H100 tensor cores were designed for. The matrix multiplications that dominate GPT-4, Claude, Llama. That workload is real and it is genuinely hard and NVIDIA is genuinely good at it.

It is not Meta's main workload.

Meta's main workload is ranking and recommendation. Every time 3 billion people open Facebook or Instagram, a model runs to decide which posts to show them, which ads to serve, what order everything appears in. That model is not a transformer doing attention over tokens. It is a Deep Learning Recommendation Model doing embedding table lookups over sparse categorical features -- post IDs, user IDs, page IDs, ad IDs -- followed by some MLP layers.

90% of the parameters in those models are embeddings. Not weights. Embeddings. Giant lookup tables.

Embedding lookup is not matrix multiplication. It is random memory access. You take a user ID, you look up their embedding vector in a 64GB table, you retrieve it. The GPU's tensor cores -- the specialized matrix multiply units that NVIDIA has been iterating on for seven generations, the hardware that justifies the H100's existence -- are completely idle during that lookup. You are paying $3/hr for tensor core capacity you are not using, to do a memory access that any chip with sufficient DRAM could do.

Meta figured this out in 2020 and started building a different chip.

The Meta Training and Inference Accelerator -- MTIA -- is not a GPU. It is not trying to be a GPU. It does not have HBM. It does not have tensor cores optimized for dense matrix math at scale. It has 256MB of shared on-chip SRAM, LPDDR5 DRAM at 204.8 GB/s across 16 channels, and 64 processing elements arranged in an 8x8 grid, all tuned for the memory access patterns of recommendation model inference.

LPDDR instead of HBM is the design decision that tells you everything. HBM is expensive, high-bandwidth, designed for dense compute. LPDDR is cheap, lower-bandwidth, designed for capacity and power efficiency. For embedding lookup -- random access into giant tables, not sequential streaming of weight matrices -- LPDDR is the right call. You need capacity and fast random access. You do not need 3.35 TB/s of HBM bandwidth that your workload is never going to saturate.

MTIA 200 in production: 44% lower total cost of ownership than GPUs. Not by outperforming GPUs on the workload. By being architecturally correct for the workload while the GPU is architecturally wrong for it.

The paper Meta published at ISCA 2025 is one of the most honest production engineering documents I have read in years. They describe not just the chip but the productionization experience -- the part that always gets left out of research papers because it is embarrassing.

24% of their initial MTIA servers had ECC memory errors.

Here is why that happened: LPDDR does not have built-in Error Correcting Code support the way HBM or server DRAM does. The memory controller has to implement ECC instead. During design, Meta did not have production-scale error rate data for LPDDR in data center conditions, so they had to decide without knowing: enable inefficient controller-based ECC, or run without ECC and handle occasional errors differently?

They ran without ECC on part of the fleet. Their reasoning, stated plainly in the paper: "inference results are inherently statistical." If a bit flips during an ad ranking operation and one user gets a slightly wrong ad recommendation, the impact is unmeasurable against the noise of normal recommendation variance. You do not need perfect numerical fidelity for a workload where the correct answer is "approximately the right ad."

That is not a compromise. That is correct reasoning about what the workload actually requires. GPUs run ECC by default and pay the power and bandwidth overhead for it on every operation. MTIA ran without it on inference workloads where it doesn't matter, found the error rate acceptable, and added monitoring to catch servers where it wasn't.

They also found a deadlock in 0.1% of servers under high load -- the Control Core waiting for the host, the host waiting for the NoC, the NoC waiting for the Control Core. A subtle PCIe transaction ordering bug that only surfaced at production scale. They found it, fixed it in firmware, and documented it in a paper that most chip companies would have quietly buried.

Six chips in 24 months.

The industry cadence is one chip every one to two years. A chip design takes three to four years from architecture to silicon in traditional cycles. Meta is shipping one every six months.

The mechanism: modular chiplets. MTIA 400, 450, and 500 share the same chassis, rack, and network infrastructure. You change the chiplet, drop it into the existing physical footprint, and go. No new data center buildout. No new rack configuration. No new power distribution. The hardware ecosystem is already deployed. You are only changing the compute and memory dies.

MTIA 450 is MTIA 400 with doubled HBM bandwidth -- because by the time 450 was designed, GenAI inference had grown large enough that the recommendation-only chip wasn't the only thing Meta needed anymore. They added HBM for the transformer workloads. Same chassis. Six months later.

MTIA 500 follows. Then a chip every six months after that.

This is not a research program. Meta has deployed hundreds of thousands of MTIA chips in production. They are serving billions of users on them right now. They target 35% of Meta's total inference fleet on MTIA hardware by end of 2026.

The thing I keep sitting with: the GPU was always the wrong answer for recommendation inference. It was the available answer. Every company that runs recommendation at scale -- Meta, TikTok, Google, Amazon -- has known for years that GPUs are a poor fit for embedding lookup workloads. They ran on GPUs because custom silicon takes years to build and the scale required to justify it is enormous.

Meta reached the scale in 2020 and started building. It took four years to get to 44% TCO reduction. It is now shipping a new generation every six months and expanding from recommendation to GenAI inference.

Google did the same thing in 2016 with TPUs. They had the workload, they had the scale, they built the chip. Eight years later, Ironwood TPU is their first chip described as "purpose-built for inference" and Anthropic is committed to 3.5 gigawatts of TPU capacity starting 2027.

AWS has Inferentia since 2019. Microsoft has Maia 200. Every hyperscaler with sufficient inference volume has concluded the same thing: the GPU is the wrong shape for the inference workload, and at sufficient scale, paying a 44-100% TCO premium for the wrong shape becomes the largest line item in the infrastructure budget.

NVIDIA knows this. The Groq LPU acquisition -- $20 billion for a chip that does inference via SRAM with no HBM -- is NVIDIA buying the answer to the problem before someone else's answer takes market share.

The question is not whether GPU-first inference economics hold. They don't, at scale, for anyone with enough volume to justify custom silicon.

The question is how long it takes for the rest of the market to reach that scale.

At the token volumes Anthropic, OpenAI, Google, and Meta are serving in 2026 -- the answer is: now.

90% of the parameters are embeddings.

the tensor cores were idle the whole time.

it took four years and hundreds of thousands of custom chips in production to say that out loud in a peer-reviewed paper.

the gpu was the answer to a question that kept changing. the companies that noticed the question changed first are the ones building the next decade's infrastructure.

P.S. The MTIA paper's section on "safe overclocking" is worth reading separately. They found unused frequency headroom in production silicon -- the chip was hitting its power limits before its thermal limits -- and pushed the clock speed up in firmware after deployment. Not in the design phase. After the chips were in the field. Hardware optimization via software update, in production, on a fleet of hundreds of thousands of chips. That is the kind of thing that only happens when you own the full stack from silicon to serving framework. No GPU vendor gives you that lever.

i write these when i have something worth saying. no schedule. no algorithm. if you want to know when the next one goes up -- leave your email.

no spam. no sequence. just the note, when it exists.