How to Calculate Home Inference Speed for MoE Models
Also, adding a second GPU is (Mostly) useless for MoE models when the experts live in system RAM
1. Background Information
Mixture‑of‑Experts (MoE) architectures are the AI community’s answer to “more parameters without more slowdown.” Instead of processing a trillion-parameter model every time, MoEs dynamically select only a handful of “experts” per token to perform the computation.
Most people running an AI model at home can’t fit the entire MoE model into GPU VRAM, but they have a workstation/server that can add a lot of RAM in order to run the model. You can offload the experts to the system RAM; use the fast GPU VRAM for shared/common weights, and slower system RAM for experts. This is a lot faster than running in only RAM with no GPU, but in this situation, RAM memory bandwidth still becomes your throttle.
2. The Simplified Formula
Here’s a formula that captures the pace of MoE inference:
Where:
Active params = number of parameters involved in each token (e.g. 32 B for Kimi K2). These are the common weights, not involved with the MoE experts.1
Quantization bits = approx 4 bits if you're using Q4.
Common weight percentage = fraction of those active params stored on GPU.2
Expert percentage = fraction stored in system RAM.
GPU bandwidth = GPU memory bandwidth (usually in GB/sec).
RAM bandwidth = system RAM memory bandwidth (usually in GB/sec).
This model ignores prompt-prefill time and assumes compute isn’t the bottleneck, which is generally true for most computers.3
3. A Boring Concrete Example: Kimi K2 Setup
Let’s break it down with an example with real specs:
Active parameters: Kimi K2 has 32B4 active parameters
Expert share: 64% experts, 36% common
Quantization: Q4 = ~4 bits per parameter
System with AMD Epyc 9005 CPU with 8‑channel DDR5‑6400 RAM: ~409 GB/s system RAM bandwidth
RTX 3090 GPU: ~935 GB/s VRAM memory bandwidth
Remember, 32b active means that we have to load each of those 32 billion model weights to do math on them for every token. So we need to read into the CPU/GPU 32 billion numbers, before we can even start compute for a single token. No wonder AI is so reliant on memory bandwidth! Fortunately, we can load those model weights in parallel with compute, so that’s why we’re ignoring CPU or GPU processing times, if you have a fast CPU/GPU.
Plug those values in:
You can also use this equation to calculate the speed you’d be generating tokens at if you didn’t have a GPU5. Just set “GPU bandwidth” to whatever your system memory speed is:
4. The Interesting Part: A Second GPU Doesn’t Help Much
That’s honestly an understatement. A second GPU is basically useless.
The total amount of common weights that are active every token (approximately 1/3 of the model active weights, which is 32b for Kimi K2, 22b for Qwen 3, 37b for Deepseek R1) can fit into a single 3090 (with 24GB of memory) most mid-high end GPUs at Q4 quantization. For Kimi K2, 11.7b (32b × 34%) of the params are non-local-experts-MoE; this fits on a 16GB GPU with plenty of room to spare.6 At 4 bit quantization, Kimi K2 weights will be just ~6GB; Deepseek isn’t much bigger. For all practical purposes, even with a 16GB GPU, you can treat the first GPU as “space for the common weights and context”. Even at native 8 bits, the 11.7b common weights (~11.7 GB at native 8 bits) plus context will fit on one 24GB RTX 3090.
The bulk of Kimi K2, ~600 GB of it (at Q4), are all the non-common weights. The other 21.1b7 parameters that are used when generating the token are randomly8 selected from the MoE experts.
So therefore, even if you offload 24 GB of weights of experts to a second 3090, you’re only fast-tracking 600/24 = 4% of expert lookups. Calculating it:
Your 2nd GPU speeds up only 4% of expert traffic.
Main bottleneck remains: 96% of experts still fetched from RAM.
You go only 0.5 tokens/sec faster (from 32.2 to 32.7 tokens/sec) by adding the 2nd CPU.
5. Why That Insight Matters
This example demonstrates clearly what matters when running MoE models:
If you are running MoE models with experts offloaded to system RAM, adding GPUs isn’t the solution.
Faster DDR, or better, more channels of DDR, makes more sense.
Switching to GPUs with more VRAM (e.g. 32GB RTX 5090), doesn’t help as much as you’d expect either.
\(\begin{align*} \text{Time per token} &\approx \bigl(32\,\text{B} \times \tfrac{4\,\text{bits}}{8\,\text{bits/byte}}\bigr) \times \Bigl(\tfrac{0.36}{1792\ \text{GB/s}} + \tfrac{0.64\times0.98}{410\ \text{GB/s}} + \tfrac{0.64\times0.02}{1792\ \text{GB/s}}\Bigr) \\[6pt] &\approx \frac{16\,\text{GB}\times0.36}{1792\ \text{GB/s}} + \frac{16\,\text{GB}\times0.64\times0.98}{410\ \text{GB/s}} + \frac{16\,\text{GB}\times0.64\times0.02}{1792\ \text{GB/s}} \\[6pt] &\approx 27.8\,\text{ms} \;\Rightarrow\;\sim 36.0\,\text{tokens/s} \end{align*} \)
Getting a RTX 5090 gets you just +3.8 tokens/sec compared to a 3090 alone.
6. Obligatory ChatGPT Generated TL;DR
1. Use the formula above to understand your token/sec rate for MoE models.
2. Put as much of your model into VRAM, but keep expectations realistic; VRAM above common tokens + context doesn’t do much.
3. If you want faster performance running big MoE models, upgrade RAM or choose GPUs with faster VRAM—not more GPUs.
By focusing on where the math (and memory) actually lies, you make smarter hardware decisions for MoE inference—getting more tokens per dollar, not just more GPUs on your desk.
This is a bit of an oversimplification. Models like Kimi K2 have 8 MoE “local” experts, and 1 “shared” expert that always runs for every single token. Here, we are including that 1 shared expert as included in the “common weights”, since those weights will be used for every token.
As an example: Kimi K2 is 1T params in size, with 32B active. That means approximately 32 billion parameters are needed for every token. For Kimi K2, about 36% of the parameters are the same for every token, and 64% are experts that are different for each token.
Actually not true at all, but if you have a RTX 3090 then you’re definitely fine on compute GPU-wise. CPU-wise, well, if you’re running on a Core 2 Duo then you’re going to be compute bound. Calculating the amount of compute a MoE model uses deserves another article, though.
Actually about 32.86b, but who’s counting.
Assuming your system CPU is fast enough.
There’s room for all the context too! Deepseek and Kimi use MLA, so the KV cache with at full 128k token context fits in ~10GB. If you’re at Q4, then Kimi is just 6GB, so you have 10GB for context on a mid tier 16GB GPU. You might not be able to fit all 128k tokens, but it’s definitely enough for regular use.
Again, people with sharp eyes may notice that this adds up to 32.8b, not 32b exact. Moonshot AI rounded the numbers a bit on Kimi K2’s marketing materials.
Mostly randomly to an external observer. Obviously not completely randomly.