Transformer models have proven highly effective for many NLP tasks. While scaling up with larger dimensions and more layers can increase their power, this also significantly increases computational complexity. Mixture of Experts (MoE) architecture offers an elegant solution by introducing sparsity, allowing models to scale efficiently without proportional computational cost increases.
In this post, you will learn about Mixture of Experts architecture in transformer models. In particular, you will learn about:
- Why MoE architecture is needed for efficient transformer scaling
- How MoE works and its key components
- How to implement MoE in transformer models
Let’s get started.

Mixture of Experts Architecture in Transformer Models
Photo by realfish. Some rights reserved.
Overview
This post covers three main areas:
- Why Mixture of Experts is Needed in Transformers
- How Mixture of Experts Works
- Implementation of MoE in Transformer Models
Why Mixture of Experts is Needed in Transformers
The Mixture of Experts (MoE) concept was first introduced in 1991 by Jacobs et al.. It uses multiple “expert” models to process input, with a “gate” mechanism selecting which expert to use. MoE experienced a revival with the Switch Transformer and Mixtral models in 2021 and 2024 respectively. In transformer models, MoE activates only a subset of parameters for each input, allowing large models to be defined while using only a portion for each computation.
Consider the Mixtral model architecture:

Mixtral Model Architecture
As covered in the previous post, the MLP block introduces non-linearity to transformer layers. The attention block only shuffles information from the input sequence using linear combinations. The “intelligence” of transformer models primarily resides in the MLP block.
This explains why MLP blocks typically contain the most parameters and computational load in transformer models. Training MLP blocks to perform well across diverse tasks is challenging because different tasks may require contradictory behaviors.
One solution is creating specialized models for each task with a router to select the appropriate model. Alternatively, you can combine multiple models and the router into a single model and train everything together. This is the essence of MoE.
MoE introduces sparsity by having multiple experts with only a sparse subset activated each time. The MoE architecture modifies only the MLP block while all experts share the same attention block. Each transformer layer has an independent set of experts, enabling mix-and-match combinations across layers. This allows many experts to be created without drastically expanding parameter count, scaling the model while keeping computational costs low.
The key insight is that different inputs benefit from different specialized computations. By having multiple expert networks with a routing mechanism to select which experts to use, the model achieves better performance with fewer computational resources.
How Mixture of Experts Works
MoE architecture consists of three key components:
- Expert Networks: Multiple independent neural networks (experts) that process input, similar to MLP blocks in other transformer models.
- Router: A mechanism that decides which experts should process each input. Typically a linear layer followed by softmax, producing a probability distribution over $N$ experts. The router output selects the top-$k$ experts through a “gating mechanism.”
- Output combination: The top-$k$ experts process the input, and their outputs are combined as a weighted sum using normalized probabilities from the router.
The basic MoE operation works as follows. For each vector $x$ from the attention block’s output sequence, the router multiplies it with a matrix to produce logits (the gate layer in the figure above). After softmax transformation, these logits are filtered by a top-$k$ operation, producing $k$ indices and $k$ probabilities. The indices activate the experts (MLP blocks in the figure), which process the original attention block output. Expert outputs are combined as a weighted sum using the normalized router probabilities.
Conceptually, the MoE block computes:
$$
\text{MoE}(x) = \sum_{i \in \text{TopK}(p)} p_i \cdot \text{Expert}_i(x)
$$
The value of $k$ is a model hyperparameter. Even $k=2$ has been found sufficient for good performance.
Implementation of MoE in Transformer Models
Below is a PyTorch implementation of a transformer layer with MoE replacing the traditional MLP block:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 |
import torch import torch.nn as nn import torch.nn.functional as F class Expert(nn.Module): def __init__(self, dim, intermediate_dim): super().__init__() self.gate_proj = nn.Linear(dim, intermediate_dim) self.up_proj = nn.Linear(dim, intermediate_dim) self.down_proj = nn.Linear(intermediate_dim, dim) self.act = nn.SiLU() def forward(self, x): gate = self.gate_proj(x) up = self.up_proj(x) swish = self.act(gate) output = self.down_proj(swish * up) return output class MoELayer(nn.Module): def __init__(self, dim, intermediate_dim, num_experts, top_k=2): super().__init__() self.num_experts = num_experts self.top_k = top_k self.dim = dim # Create expert networks self.experts = nn.ModuleList([ Expert(dim, intermediate_dim) for _ in range(num_experts) ]) self.router = nn.Linear(dim, num_experts) def forward(self, hidden_states): batch_size, seq_len, hidden_dim = hidden_states.shape # Reshape for expert processing, the compute routing probabilities hidden_states_reshaped = hidden_states.view(-1, hidden_dim) router_logits = self.router(hidden_states_reshaped) # (batch_size * seq_len, num_experts) routing_probs = F.softmax(router_logits, dim=-1) # Select top-k experts, and scale the probabilities to sum to 1 # output shape: (batch_size * seq_len, k) top_k_probs, top_k_indices = torch.topk(routing_probs, self.top_k, dim=-1) top_k_probs = top_k_probs / top_k_probs.sum(dim=-1, keepdim=True) # Process through selected experts output = [] for i in range(self.top_k): expert_idx = top_k_indices[:, i] expert_probs = top_k_probs[:, i] # Process each vector in the batch and sequence with the selected expert expert_output = torch.stack([ self.experts[j](hidden_states_reshaped[j]) for j in expert_idx ], dim=0) # Weighted sum by routing probability output.sum(expert_probs.unsqueeze(-1) * expert_output) # Reshape back to original shape output = sum(output).view(batch_size, seq_len, hidden_dim) return output class MoETransformerLayer(nn.Module): def __init__(self, dim, intermediate_dim, num_experts, top_k=2, num_heads=8): super().__init__() self.attention = nn.MultiheadAttention(dim, num_heads, batch_first=True) self.moe = MoELayer(dim, intermediate_dim, num_experts, top_k) self.norm1 = nn.RMSNorm(dim) self.norm2 = nn.RMSNorm(dim) def forward(self, x): # Attention sublayer input_x = x x = self.norm1(x) attn_output, _ = self.attention(x, x, x) input_x = input_x + attn_output # MoE sublayer x = self.norm2(input_x) moe_output = self.moe(x) return input_x + moe_output |
A complete MoE transformer model consists of a sequence of transformer layers. Each layer contains an attention sublayer and an MoE sublayer, with the MoE sublayer operating like the MLP sublayer in other transformer models.
In the MoELayer
class, the forward()
method expects input of shape (batch_size, seq_len, hidden_dim)
. Since each sequence vector is processed independently, the input is first reshaped to (batch_size * seq_len, hidden_dim)
. The router and softmax function produce routing_probs
of shape (batch_size * seq_len, num_experts)
, indicating each expert’s contribution to the output.
The top-$k$ operation selects experts and their corresponding probabilities. In the for-loop, each vector is processed by an expert and the outputs are stacked together. The loop produces a list of scaled tensors output
, which are summed for the final output. This output is then reshaped back to the original (batch_size, seq_len, hidden_dim)
shape.
The Expert
class is identical to the MLP block from the previous post, but multiple instances are used by the MoE sublayer instead of the transformer layer.
You can test the transformer layer with this code:
1 2 3 4 5 6 7 8 9 |
batch_size = 4 seq_len = 10 dim = 16 intermediate_dim = 72 num_experts = 8 x = torch.randn(batch_size, seq_len, dim) model = MoETransformerLayer(dim, intermediate_dim, num_experts) y = model(x) |
Further Readings
Below are some resources that you may find useful:
- What is mixture of experts?
- Adaptive Mixtures of Local Experts
- Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity
- Mixtral of Experts
- Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer
- GLaM: Efficient Scaling of Language Models with Mixture-of-Experts
- Mixture of Experts with Expert Choice Routing
Summary
In this post, you learned about Mixture of Experts architecture in transformer models. Specifically, you learned about:
- Why MoE is needed for efficient scaling of transformer models
- How MoE works with expert models, routers, and gating mechanisms
- How to implement MoE layers that can replace traditional MLP layers in transformer models
No comments yet.