Mixture of Experts Architecture in Transformer Models

Transformer models have proven highly effective for many NLP tasks. While scaling up with larger dimensions and more layers can increase their power, this also significantly increases computational complexity. Mixture of Experts (MoE) architecture offers an elegant solution by introducing sparsity, allowing models to scale efficiently without proportional computational cost increases.

In this post, you will learn about Mixture of Experts architecture in transformer models. In particular, you will learn about:

  • Why MoE architecture is needed for efficient transformer scaling
  • How MoE works and its key components
  • How to implement MoE in transformer models

Let’s get started.

Mixture of Experts Architecture in Transformer Models
Photo by realfish. Some rights reserved.

Overview

This post covers three main areas:

  • Why Mixture of Experts is Needed in Transformers
  • How Mixture of Experts Works
  • Implementation of MoE in Transformer Models

Why Mixture of Experts is Needed in Transformers

The Mixture of Experts (MoE) concept was first introduced in 1991 by Jacobs et al.. It uses multiple “expert” models to process input, with a “gate” mechanism selecting which expert to use. MoE experienced a revival with the Switch Transformer and Mixtral models in 2021 and 2024 respectively. In transformer models, MoE activates only a subset of parameters for each input, allowing large models to be defined while using only a portion for each computation.

Consider the Mixtral model architecture:

Mixtral Model Architecture

As covered in the previous post, the MLP block introduces non-linearity to transformer layers. The attention block only shuffles information from the input sequence using linear combinations. The “intelligence” of transformer models primarily resides in the MLP block.

This explains why MLP blocks typically contain the most parameters and computational load in transformer models. Training MLP blocks to perform well across diverse tasks is challenging because different tasks may require contradictory behaviors.

One solution is creating specialized models for each task with a router to select the appropriate model. Alternatively, you can combine multiple models and the router into a single model and train everything together. This is the essence of MoE.

MoE introduces sparsity by having multiple experts with only a sparse subset activated each time. The MoE architecture modifies only the MLP block while all experts share the same attention block. Each transformer layer has an independent set of experts, enabling mix-and-match combinations across layers. This allows many experts to be created without drastically expanding parameter count, scaling the model while keeping computational costs low.

The key insight is that different inputs benefit from different specialized computations. By having multiple expert networks with a routing mechanism to select which experts to use, the model achieves better performance with fewer computational resources.

How Mixture of Experts Works

MoE architecture consists of three key components:

  1. Expert Networks: Multiple independent neural networks (experts) that process input, similar to MLP blocks in other transformer models.
  2. Router: A mechanism that decides which experts should process each input. Typically a linear layer followed by softmax, producing a probability distribution over $N$ experts. The router output selects the top-$k$ experts through a “gating mechanism.”
  3. Output combination: The top-$k$ experts process the input, and their outputs are combined as a weighted sum using normalized probabilities from the router.

The basic MoE operation works as follows. For each vector $x$ from the attention block’s output sequence, the router multiplies it with a matrix to produce logits (the gate layer in the figure above). After softmax transformation, these logits are filtered by a top-$k$ operation, producing $k$ indices and $k$ probabilities. The indices activate the experts (MLP blocks in the figure), which process the original attention block output. Expert outputs are combined as a weighted sum using the normalized router probabilities.

Conceptually, the MoE block computes:

$$
\text{MoE}(x) = \sum_{i \in \text{TopK}(p)} p_i \cdot \text{Expert}_i(x)
$$

The value of $k$ is a model hyperparameter. Even $k=2$ has been found sufficient for good performance.

Implementation of MoE in Transformer Models

Below is a PyTorch implementation of a transformer layer with MoE replacing the traditional MLP block:

A complete MoE transformer model consists of a sequence of transformer layers. Each layer contains an attention sublayer and an MoE sublayer, with the MoE sublayer operating like the MLP sublayer in other transformer models.

In the MoELayer class, the forward() method expects input of shape (batch_size, seq_len, hidden_dim). Since each sequence vector is processed independently, the input is first reshaped to (batch_size * seq_len, hidden_dim). The router and softmax function produce routing_probs of shape (batch_size * seq_len, num_experts), indicating each expert’s contribution to the output.

The top-$k$ operation selects experts and their corresponding probabilities. In the for-loop, each vector is processed by an expert and the outputs are stacked together. The loop produces a list of scaled tensors output, which are summed for the final output. This output is then reshaped back to the original (batch_size, seq_len, hidden_dim) shape.

The Expert class is identical to the MLP block from the previous post, but multiple instances are used by the MoE sublayer instead of the transformer layer.

You can test the transformer layer with this code:

Further Readings

Below are some resources that you may find useful:

Summary

In this post, you learned about Mixture of Experts architecture in transformer models. Specifically, you learned about:

  • Why MoE is needed for efficient scaling of transformer models
  • How MoE works with expert models, routers, and gating mechanisms
  • How to implement MoE layers that can replace traditional MLP layers in transformer models

Learn Transformers and Attention!

Building Transformer Models with Attention

Teach your deep learning model to read a sentence

...using transformer models with attention

Discover how in my new Ebook:
Building Transformer Models with Attention

It provides self-study tutorials with working code to guide you into building a fully-working transformer models that can
translate sentences from one language to another...

Give magical power of understanding human language for
Your Projects


See What's Inside

No comments yet.

Leave a Reply

Machine Learning Mastery is part of Guiding Tech Media, a leading digital media publisher focused on helping people figure out technology. Visit our corporate website to learn more about our mission and team.