Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer

Draw an overview of the Sparsely Gated Mixture of Experts (MoE) layer.


moe.png The Mixture-of-Experts layer consists of nn experts (simple feed forward layers in the original paper) and a gating network GG whose output is a sparse nn dimensional vector.

Let us denote by G(x)G(x) and Ei(x)E_i(x) the output of the gating network and the output of the ii-th expert network for a given input xx. The output yy of the MoE module can be written as follows: y=i=1nG(x)iEi(x)y = \sum_{i=1}^n G(x)_iE_i(x)We save computation based on the sparsity of the output of G(x)G(x).

Give the structure of the gating network GG that is used in Sparsely Gated MoE.


A simple choice of GG is to multiply the input with a trainable weight matrix WgW_g and then apply a softmax\operatorname{softmax}: Gσ(x)=softmax(xWg)G_\sigma (x) = \operatorname{softmax}(x W_g).

However we want a sparsely gated MoE where we only evaluate the top-kk experts. The sparsity serves to save computations. Thus the MoE layer only keeps the top-kk values: G(x)=softmax(KeepTopK(H(x),k))G(x) = \operatorname{softmax}(\operatorname{KeepTopK}(H(x),k)) where the keep-top-kk operation is: KeepTopK(v,k)i={viif vi is in the top k elements of votherwise\operatorname{KeepTopK}(v,k)_i = \begin{cases} v_i & \text{if }v_i\text{ is in the top }k\text{ elements of }v \\ -\infty & \text{otherwise} \end{cases}

Here H(x)H(x) is a linear layer with added tunable Gaussian noise such that each expert sees enough training data and we avoid favouring only a few experts for all inputs: H(x)i=(xWg)i+ϵsoftplus((xWnoise)i),ϵN(0,1)H(x)_i = (xW_g)_i + \epsilon \cdot \operatorname{softplus}((xW_\text{noise})_i ), \quad \epsilon \sim \mathcal{N}(0, \mathbf{1})

What is the shrinking batch problem in MoEs?


If a MoE uses only kk out of nn experts, then for a batch of size bb, each export only receive approximately kbnb\frac{kb}{n} \ll b samples.

Through data and model parallelism this problem can be negated.

How does the Sparsely-Gated MoE paper avoid the gating network GG to always favor the same few strong experts?


They soft constrain the learning with an additional importance loss Limportance\mathcal{L}_{\text{importance}} that encourages all experts to have equal importance. Where importance is defined as: Importance(X)=xXG(x)\operatorname{Importance}(X) = \sum_{x \in X} G(x) The importance loss is then defined as the coefficient of variation of the batchwise average importance per expert: Limportance=λCV(Importance(X))2\mathcal{L}_{\text{importance}} = \lambda \cdot \operatorname{CV}(\operatorname{Importance}(X))^2Where the coefficient of variation CV is defined as the ratio of the standard deviation σ\sigma to the mean μ\mu, CV=σμ\operatorname{CV} = \frac{\sigma}{\mu}.

In the Sparsely gated MoE paper what is the load loss Lload\mathcal{L}_{\text{load}} and what does it encourage?


The load loss Lload\mathcal{L}_{load} encourages equal load per expert. It uses a smooth estimator Load(X)Load(X) of examples per expert and minimizes:

Lload(X)=λloadCV(Load(X))2\mathcal{L}_{load}(X) = \lambda_{load} \cdot \operatorname{CV}(Load(X))^2

For the full calculation of the load value, check the original paper.

Machine Learning Research Flashcards is a collection of flashcards associated with scientific research papers in the field of machine learning. Best used with Anki or Obsidian. Edit MLRF on GitHub.