Mixture of Experts Architecture in Transformer Models

By Adrian Tam on November 28, 2025 in Building Transformer Models 2

Transformer models have proven highly effective for many NLP tasks. While scaling up with larger dimensions and more layers can increase their power, this also significantly increases computational complexity. Mixture of Experts (MoE) architecture offers an elegant solution by introducing sparsity, allowing models to scale efficiently without proportional computational cost increases.

In this post, you will learn about Mixture of Experts architecture in transformer models. In particular, you will learn about:

Why MoE architecture is needed for efficient transformer scaling
How MoE works and its key components
How to implement MoE in transformer models

Kick-start your project with my book Building Transformer Models From Scratch with PyTorch. It provides self-study tutorials with working code.

Let’s get started.

Mixture of Experts Architecture in Transformer Models
Photo by realfish. Some rights reserved.

Overview

This post covers three main areas:

Why Mixture of Experts is Needed in Transformers
How Mixture of Experts Works
Implementation of MoE in Transformer Models

Why Mixture of Experts is Needed in Transformers

The Mixture of Experts (MoE) concept was first introduced in 1991 by Jacobs et al.. It uses multiple “expert” models to process input, with a “gate” mechanism selecting which expert to use. MoE experienced a revival with the Switch Transformer and Mixtral models in 2021 and 2024 respectively. In transformer models, MoE activates only a subset of parameters for each input, allowing large models to be defined while using only a portion for each computation.

Consider the Mixtral model architecture:

Mixtral Model Architecture

As covered in the previous post, the MLP block introduces non-linearity to transformer layers. The attention block only shuffles information from the input sequence using linear combinations. The “intelligence” of transformer models primarily resides in the MLP block.

This explains why MLP blocks typically contain the most parameters and computational load in transformer models. Training MLP blocks to perform well across diverse tasks is challenging because different tasks may require contradictory behaviors.

One solution is creating specialized models for each task with a router to select the appropriate model. Alternatively, you can combine multiple models and the router into a single model and train everything together. This is the essence of MoE.

MoE introduces sparsity by having multiple experts with only a sparse subset activated each time. The MoE architecture modifies only the MLP block while all experts share the same attention block. Each transformer layer has an independent set of experts, enabling mix-and-match combinations across layers. This allows many experts to be created without drastically expanding parameter count, scaling the model while keeping computational costs low.

The key insight is that different inputs benefit from different specialized computations. By having multiple expert networks with a routing mechanism to select which experts to use, the model achieves better performance with fewer computational resources.

How Mixture of Experts Works

MoE architecture consists of three key components:

Expert Networks: Multiple independent neural networks (experts) that process input, similar to MLP blocks in other transformer models.
Router: A mechanism that decides which experts should process each input. Typically a linear layer followed by softmax, producing a probability distribution over $N$ experts. The router output selects the top-$k$ experts through a “gating mechanism.”
Output combination: The top-$k$ experts process the input, and their outputs are combined as a weighted sum using normalized probabilities from the router.

The basic MoE operation works as follows. For each vector $x$ from the attention block’s output sequence, the router multiplies it with a matrix to produce logits (the gate layer in the figure above). After softmax transformation, these logits are filtered by a top-$k$ operation, producing $k$ indices and $k$ probabilities. The indices activate the experts (MLP blocks in the figure), which process the original attention block output. Expert outputs are combined as a weighted sum using the normalized router probabilities.

Conceptually, the MoE block computes:

$$
\text{MoE}(x) = \sum_{i \in \text{TopK}(p)} p_i \cdot \text{Expert}_i(x)
$$

The value of $k$ is a model hyperparameter. Even $k=2$ has been found sufficient for good performance.

Implementation of MoE in Transformer Models

Below is a PyTorch implementation of a transformer layer with MoE replacing the traditional MLP block:

import torch
import torch.nn as nn
import torch.nn.functional as F

class Expert(nn.Module):
    def __init__(self, dim, intermediate_dim):
        super().__init__()
        self.gate_proj = nn.Linear(dim, intermediate_dim)
        self.up_proj = nn.Linear(dim, intermediate_dim)
        self.down_proj = nn.Linear(intermediate_dim, dim)
        self.act = nn.SiLU()

def forward(self, x):
        gate = self.gate_proj(x)
        up = self.up_proj(x)
        swish = self.act(gate)
        output = self.down_proj(swish * up)
        return output

class MoELayer(nn.Module):
    def __init__(self, dim, intermediate_dim, num_experts, top_k=2):
        super().__init__()
        self.num_experts = num_experts
        self.top_k = top_k
        # Create expert networks
        self.experts = nn.ModuleList([
            Expert(dim, intermediate_dim) for _ in range(num_experts)
        ])
        self.router = nn.Linear(dim, num_experts)

def forward(self, hidden_states):
        batch_size, seq_len, hidden_dim = hidden_states.shape

# Reshape for expert processing, the compute routing probabilities
        hidden_states_reshaped = hidden_states.view(-1, hidden_dim)
        # shape of router_logits: (batch_size * seq_len, num_experts)
        router_logits = self.router(hidden_states_reshaped)

# Select top-k experts, than softmax output probabilities will sum to 1
        # output shape: (batch_size * seq_len, k)
        top_k_logits, top_k_indices = torch.topk(router_logits, self.top_k, dim=-1)
        top_k_probs = F.softmax(top_k_logits, dim=-1)

# Allocate output tensor
        output = torch.zeros(batch_size * seq_len, hidden_dim,
                             device=hidden_states.device,
                             dtype=hidden_states.dtype)

# Process through selected experts
        unique_experts = torch.unique(top_k_indices)
        for i in unique_experts:
            expert_id = int(i)
            # token_mask (boolean tensor) = which token of the input should use this expert
            # token_mask shape: (batch_size * seq_len,)
            mask = (top_k_indices == expert_id)
            token_mask = mask.any(dim=1)
            assert token_mask.any(), f"Expecting some tokens using expert {expert_id}"

# select tokens, apply the expert, then add to the output
            expert_input = hidden_states_reshaped[token_mask]
            expert_weight = top_k_probs[mask].unsqueeze(-1)       # shape: (N, 1)
            expert_output = self.experts[expert_id](expert_input) # shape: (N, hidden_dim)
            output[token_mask] += expert_output * expert_weight

# Reshape back to original shape
        output = output.view(batch_size, seq_len, hidden_dim)
        return output

class MoETransformerLayer(nn.Module):
    def __init__(self, dim, intermediate_dim, num_experts, top_k=2, num_heads=8):
        super().__init__()
        self.attention = nn.MultiheadAttention(dim, num_heads, batch_first=True)
        self.moe = MoELayer(dim, intermediate_dim, num_experts, top_k)
        self.norm1 = nn.RMSNorm(dim)
        self.norm2 = nn.RMSNorm(dim)

def forward(self, x):
        # Attention sublayer
        input_x = x
        x = self.norm1(x)
        attn_output, _ = self.attention(x, x, x)
        input_x = input_x + attn_output

# MoE sublayer
        x = self.norm2(input_x)
        moe_output = self.moe(x)
        return input_x + moe_output

import torch

import torch.nn as nn

import torch.nn.functional as F

class Expert(nn.Module):

def __init__(self, dim, intermediate_dim):

super().__init__()

self.gate_proj = nn.Linear(dim, intermediate_dim)

self.up_proj = nn.Linear(dim, intermediate_dim)

self.down_proj = nn.Linear(intermediate_dim, dim)

self.act = nn.SiLU()

def forward(self, x):

gate = self.gate_proj(x)

up = self.up_proj(x)

swish = self.act(gate)

output = self.down_proj(swish * up)

return output

class MoELayer(nn.Module):

def __init__(self, dim, intermediate_dim, num_experts, top_k=2):

super().__init__()

self.num_experts = num_experts

self.top_k = top_k

# Create expert networks

self.experts = nn.ModuleList([

Expert(dim, intermediate_dim) for _ in range(num_experts)

])

self.router = nn.Linear(dim, num_experts)

def forward(self, hidden_states):

batch_size, seq_len, hidden_dim = hidden_states.shape

# Reshape for expert processing, the compute routing probabilities

hidden_states_reshaped = hidden_states.view(-1, hidden_dim)

# shape of router_logits: (batch_size * seq_len, num_experts)

router_logits = self.router(hidden_states_reshaped)

# Select top-k experts, than softmax output probabilities will sum to 1

# output shape: (batch_size * seq_len, k)

top_k_logits, top_k_indices = torch.topk(router_logits, self.top_k, dim=-1)

top_k_probs = F.softmax(top_k_logits, dim=-1)

# Allocate output tensor

output = torch.zeros(batch_size * seq_len, hidden_dim,

device=hidden_states.device,

dtype=hidden_states.dtype)

# Process through selected experts

unique_experts = torch.unique(top_k_indices)

for i in unique_experts:

expert_id = int(i)

# token_mask (boolean tensor) = which token of the input should use this expert

# token_mask shape: (batch_size * seq_len,)

mask = (top_k_indices == expert_id)

token_mask = mask.any(dim=1)

assert token_mask.any(), f"Expecting some tokens using expert {expert_id}"

# select tokens, apply the expert, then add to the output

expert_input = hidden_states_reshaped[token_mask]

expert_weight = top_k_probs[mask].unsqueeze(-1) # shape: (N, 1)

expert_output = self.experts[expert_id](expert_input) # shape: (N, hidden_dim)

output[token_mask] += expert_output * expert_weight

# Reshape back to original shape

output = output.view(batch_size, seq_len, hidden_dim)

return output

class MoETransformerLayer(nn.Module):

def __init__(self, dim, intermediate_dim, num_experts, top_k=2, num_heads=8):

super().__init__()

self.attention = nn.MultiheadAttention(dim, num_heads, batch_first=True)

self.moe = MoELayer(dim, intermediate_dim, num_experts, top_k)

self.norm1 = nn.RMSNorm(dim)

self.norm2 = nn.RMSNorm(dim)

def forward(self, x):

# Attention sublayer

input_x = x

x = self.norm1(x)

attn_output, _ = self.attention(x, x, x)

input_x = input_x + attn_output

# MoE sublayer

x = self.norm2(input_x)

moe_output = self.moe(x)

return input_x + moe_output

A complete MoE transformer model consists of a sequence of transformer layers. Each layer contains an attention sublayer and an MoE sublayer, with the MoE sublayer operating like the MLP sublayer in other transformer models.

In the MoELayer class, the forward() method expects an input tensor of shape (batch_size, seq_len, hidden_dim). Since each sequence vector is processed independently, the input is first reshaped to (batch_size * seq_len, hidden_dim). The router produces routing_logits of shape (batch_size * seq_len, num_experts), indicating each expert’s potential contribution to the output.

The top-$k$ operation selects experts and their corresponding probabilities, stored in top_k_probs. In the for-loop, each expert involved will process the vectors that it is involved, based on the token_mask. Then the output from the expert will be scaled by the corresponding export’s weight, and added to the output tensor. After the for-loop, the output is reshaped back to the original (batch_size, seq_len, hidden_dim) shape.

The Expert class is identical to the MLP block from the previous post, but multiple instances are used by the MoE sublayer instead of the transformer layer.

You can test the transformer layer with this code:

batch_size = 4

seq_len = 10

dim = 16

intermediate_dim = 72

num_experts = 8

x = torch.randn(batch_size, seq_len, dim)

model = MoETransformerLayer(dim, intermediate_dim, num_experts)

y = model(x)

Shared Experts

The above implementation is the simplest MoE. Recently, a new idea is proposed and popularized by the DeepSeek model to include a few “shared expert” in the MoE architecture such that those shared expert will always used for any input. Mathemtically, this makes the MoE to compute:

$$
\text{MoE}(x) = \text{Expert}^\ast(x) + \sum_{i \in \text{TopK}(p)} p_i \cdot \text{Expert}_i(x)
$$

The extra expert added is the shared expert. Trivially, you can use multiple shared experts. In all cases, the shared expert does not require the router but to take the input unconditionally.

To implement the shared experts, you can reuse the above code and add extra experts in the MoeTransformerLayer class:

class MoETransformerLayer(nn.Module):

def __init__(self, dim, intermediate_dim, num_experts, top_k=2, num_heads=8, num_shared_experts=1):

super().__init__()

self.attention = nn.MultiheadAttention(dim, num_heads, batch_first=True)

self.moe = MoELayer(dim, intermediate_dim, num_experts, top_k)

# shared experts

self.shared_experts = nn.ModuleList([

Expert(dim, intermediate_dim) for _ in range(num_shared_experts)

])

self.norm1 = nn.RMSNorm(dim)

self.norm2 = nn.RMSNorm(dim)

def forward(self, x):

# Attention sublayer

input_x = x

x = self.norm1(x)

attn_output, _ = self.attention(x, x, x)

input_x = input_x + attn_output

# MoE sublayer

x = self.norm2(input_x)

moe_output = self.moe(x)

for expert in self.shared_experts:

moe_output += expert(x)

return input_x + moe_output

Summary

In this post, you learned about Mixture of Experts architecture in transformer models. Specifically, you learned about:

Why MoE is needed for efficient scaling of transformer models
How MoE works with expert models, routers, and gating mechanisms
How to implement MoE layers that can replace traditional MLP layers in transformer models