Github:quantumxiaol/NEU_NLP_AccelerateviaMoE:
Design and Implement of Model Inference Acceleration via Mixture-of-Experts
Sen Yang, Xiaolong Liu, Ye Ji, Aobo Guo
Abstract
By mixing experts, the one layer with wider dimensions in the original model is divided into several expert networks with smaller dimensions, which can effectively reduce the number of parameters and achieve the purpose of improving efficiency.
1 Introduction
Reasoning acceleration is an important means to improve model performance and efficiency, which can save time, save energy, reduce costs and enhance user experience.
There are many methods to accelerate model reasoning, such as model compression, model quantification, model pruning, model distillation and so on, but these methods usually sacrifice certain model accuracy or generalization ability.
The model reasoning acceleration method based on mixed experts is a novel method, which can increase the capacity of the model by several orders of magnitude while ensuring the operation speed, thus improving the expression ability and prediction ability of the model.
Transformer-based models revolutionize the field of natural language processing, and demonstrate excellent performance in various tasks such as machine translation, emotion analysis and information retrieval. [1] However, the traditional Transformer model often calculates all neurons at the same time, which leads to high computational cost. However, according to our observation of FFN in Transformer, we found a sparse activation phenomenon, that is, only a small number of neurons were activated for a single input. By mixing experts, the one layer with wider dimensions in the original model is divided into several expert networks with smaller dimensions, which can effectively reduce the number of parameters and achieve the purpose of improving efficiency.
The purpose of this project is to divide the feedforward network (FFN) layer into several experts, each of whom pays attention to a group of neurons that are often activated. We expect this “division of labor” to improve the computational efficiency and performance of our model.
We decided to explore a Transformer model trained on a German-to-English translation dataset, in which Encoder and Decoder have three levels. In the multi-head attention mechanism, the number of heads is set to 4. The dimensions of word embedding and position embedding of the model are both 256. The dimension of the full connection layer is four times that of the word embedding dimension.
Our division expert at the Decoder level. Our research results show that when only half of the experts are used in the decoder layer, the speed is increased by about 20% compared with the original model, and the accuracy loss is about 10%.
2 Approaches
The model reasoning acceleration method based on mixed experts was first proposed by Shazeer et al. in 2017. They called this method Sparsely-Gated Mixture-of-Experts Layer and applied it to language modeling tasks. [2]
They split a large-scale language model into several small-scale language models, and use a gating network to dynamically select which small-scale language models to activate. They solve the batch reduction problem by data parallelism and single-step splitting, and solve the equalization problem by adding noise and penalty term of loss function. They trained a language model with 13.7 billion parameters on the 1B word corpus and achieved 23.7% perplexity on the test set
After that, Lepikhin et al. further expanded this method in 2020. They called this method Switch Transformers and applied it to machine translation tasks. [3]
They split a large-scale Transformer model into several small-scale Transformer models, and use a gating network to dynamically select which small-scale Transformer models to activate. They use adaptive gating network and expert pooling techniques to improve the efficiency and flexibility of the model. They trained a machine translation model with 170 billion parameters on WMT ’14 English-German translation task, and achieved a BLEU score of 28.92 on the test set.
Zhang et al. observed that neurons in FFN are only sparsely activated for most inputs, which is similar to the functional partition of human brain. [4] They put forward a hypothesis that there are also functional partitions in FFN, that is, different neurons are responsible for processing different types of inputs. To test this hypothesis, they proposed a method to transform the ordinary Transformer model into MoE version, called MoEfication. Specifically, MoEfication includes two stages: first, divide the parameters of FFN into multiple functional partitions as experts, and then build an expert router to decide which experts to use for each input.
They experimented on different models and downstream tasks, and found that MoEfication can use only 10% to 30% FFN parameters while maintaining more than 95% original performance. In addition, MoEfication brings two advantages. It significantly reduces the FLOPS of reasoning, that is, it can achieve 2 times acceleration by using 25% FFN parameters, and provides a fine-grained perspective to study the internal mechanism of FFN, which enhances the interpretability of the model.
The Transformer model we chose has implemented some basic functions. The model is applied to German-to-English machine translation task, the main goal of which is to translate the input German sentences into English. We use Beam Search strategy, which is a heuristic search strategy. It selects k options with the highest probability at each step (k is the beam width), and then selects k results with the highest probability from these k options in the next step, and so on. This method can solve the problem that greedy search can only get local optimal solution to a certain extent, thus improving the accuracy of translation. For each German sentence, we use Beam Search to find the k most likely English translation results. Then, the BLEU score of each translation result is calculated, and the average BLEU score is output to evaluate the translation quality of the model.
Algorithm1 BeamSearch |
Input: model, source_sentence, k: integer
Output: final_result |
1: Convert source_sentence into numerical representation 2: Initialize candidates with the start character id, scores with 0 3: Initialize final_results as empty list 4: While length of final_results < k do: 1: For each candidate in candidates do: 1: If candidate is complete (ends with end token or reaches max length), move it to final_results 2: Else, get next possible tokens from the model, add to the new candidates 2: Select top k candidates from new candidates by their scores 5: Return final results |
Here is my analysis of the complexity of the Simplified_BeamSearch algorithm. Suppose V is the size of the vocabulary, n is the length of the source sentence, and k is the number of candidates.
Step | Description | Time complexity | Space complexity |
1 | Convert source_sentence into numerical representation | O(n) | O(n) |
2 | Initialize candidates with the start character id, scores with 0 | O(1) | O(k) |
3 | Initialize final_results as empty list | O(1) | O(k) |
4 | While length of final_results < k do | – | – |
4.1 | For each candidate in candidates do | – | – |
4.1.1 | If candidate is complete (ends with end token or reaches max length) | O(1) | O(1) |
4.1.2 | get next possible tokens from the model, add to the new candidates | O(V) | O(V) |
4.2 | Select top k candidates from new candidates by their scores | O(k log k) | O(1) |
5 | Return final_results | O(1) | O(1) |
Time complexity:The main loop step of the algorithm will run about k times, so the overall time complexity is about O(k * V + k^2 log k).
Space complexity:Because each iteration will generate a new candidate list, the space complexity is O (k + V), where V is the maximum number of candidates that the model can generate at one time. In addition, the final result list also takes up the space of O (k).
Firstly, we studied the sparsity of FFN layer and found that most neurons were not activated. We then refer to the MoEfication approach and conditionally use a portion of the FFN parameter to preserve raw performance. Our main work is to construct expert networks, which combine neurons that are usually activated at the same time. We use two methods to construct expert networks: co-activation graph method and cluster splitting method. The former is based on graph partition, while the latter is based on the principle that similar neurons are activated at the same time. Then, we construct a gating network to select experts, and the output of the gating function determines the final output of the MoE model. Finally, we compare the performance of the original model and the hybrid expert acceleration model, and use the mechanism of inter-thread communication to solve the thread insecurity problem of Tkinter.
3 Experiments
We use the data set of German-English translation to explore. Transformer model is used for sequence-to-sequence (seq2seq) learning, and the original German and English text data are segmented by Byte Pair Encoding (BPE) algorithm.
Transformer’s encoder and decoder are both set to 3 layers. The number of heads in the multi-head attention mechanism is set to 4. The dimension of word embedding and position embedding is set to 256. The hidden layer dimension in the Feed Forward network is set to 4 times d_model, that is, 1024.
We have done the following work in turn.
3.1 Sparsity Verification
At first, we modify forward method in FFN layer, record neurons with negative values, calculate the ratio, and find that about 90% of neurons in Encoder are not activated, and about 80 ~ 90% of neurons in Decoder are not activated.。
We compare the MoEfication papers, and MoEfication can conditionally use 10% ~ 30% FFN parameters while maintaining more than 95% of the original performance, that is, their expert system selects these working neurons (accounting for 10 ~ 30% of the original), and then achieves similar performance.
3.2 Constructing expert network
We want to divide a layer of FFN in a Transformer into several experts. The core idea is to combine neurons that are usually activated at the same time. In this way, we can choose a small number of experts to cover all activated neurons as much as possible.
The first is the co-activation graph method. We construct a co-activation graph for each FFN, in which the nodes are neurons and the edge weight is set to the number of times the two neurons are co-activated. Then, we divide the graph into several subgraphs with strong internal connections (Karypis and Kumar, 1998), and regard each subgraph as an expert. [5]
Karypis and Kumar’s approach consists of three phases:
Coarsening stage: Reduce the size of the graph to a manageable level by repeatedly collapsing nodes and edges in the graph.
Partition stage: Some heuristic algorithm (such as spectral partition or Kernighan-Lin algorithm) is used on the coarsest graph to obtain an initial partition.
Thinning phase: Improve partitioning by reverse expanding the graph and using local optimization techniques (such as Kernighan-Lin algorithm or its variants) at each level.
They also propose a new coarsening heuristic algorithm, called double edge heuristic algorithm, which selects the node pairs to be folded according to the weights of edges in the graph. They proved that this heuristic algorithm can guarantee a partition equivalent to the final partition on the coarsest graph.
We use Metis’s open source tools to call their methods. Metis is a popular graph segmentation software package, which implements a graph segmentation algorithm based on multi-level partition scheme. Metis can handle large-scale graphs well, and provides a variety of optimization options to adapt to different graphs and application scenarios.
We also use the minimum spanning tree to divide the subgraph. A graph is divided into several sub-graphs or clusters, and the internal connections of these sub-graphs are as strong as possible, while the connections between sub-graphs are as weak as possible. The minimum spanning tree ensures that all nodes are connected together and the total edge weight is minimal. Remove n-1 edges with the greatest weight from this tree, and the graph is divided into n parts.
Fig. 2 Co-activation diagram (heat diagram)
We compared two graph partitioning methods for large, dense graphs. We found that Karypis and Kumar’s approach (implemented in METIS), which simplifies the graph before refining it, handles such graphs better than our minimum spanning tree method, which may not balance partition quality and execution efficiency as effectively. In the case of such dense graphs, it may not be able to fully consider the global characteristics of the graph, and the balance between partition quality and execution efficiency may not be as good as the multi-level partitioning method.
Fig. 3 Co-activation graph (undirected graph)
Secondly, clustering and splitting method. Based on the mathematical intuition that the results of similar vector operations are similar, we assume that similar neurons will be activated at the same time. Considering the parameter information, the columns of W matrix in FFN are treated as vector sets with doddle dimensions. Then, balanced k-means is applied to this set, and experts are constructed by k clusters. However, our research results show that the division effect is not good, and the internal connection among experts is not strong.
Fig. 4 Co-activation graph (divided subgraph)
3.3 Establish gating network
The output of gate function determines whether the final output of MoE model is the integration of experts or the output of only a few experts, that is, whether MoE model is a dense integration model or a sparse model. The core idea is to use a limited number of experts to provide as many activated neurons as possible. In order to achieve this goal, we first find the best choice on the training data through greedy algorithm (equivalent to constructing the data set first), and then use them to train a shallow network as an expert router, and select x experts based on the hidden layer input. I enter some test samples and record their word vectors and expert tag numbers activated at this layer as tags to construct the data set.
Fig. 5 Gate network architecture
Algorithm2 Transformer with Expert System Acceleration |
1. Initialize Transformer model with pre-defined parameters
2. Prepare input data for Encoder and Decoder 3. Initialize Expert System (ExpertModel and ExpertList) 4. Run forward pass of the Transformer model 5. Return the final result |
3.4 Visualization
In the visual interface, for the same test case, I call the original model and mixed experts to speed up the model test, count the time and score, and calculate the speedup ratio. Because the thread of Tkinter is unsafe, it means that Tkinter GUI cannot be updated from a different thread. My solution is to use mechanisms for inter-thread communication, such as queue.Queue, which is thread-safe, so it is safe to put data in one thread and then fetch it in another. Queue the data that needs to be updated in the GUI in the worker thread, and then periodically check the queue in another thread of Tkinter and update it accordingly. This approach avoids updating the GUI directly from different threads, thus avoiding thread safety issues.
Fig. 6 UI
4 Analysis
We test several sets of input data of different sizes to observe their reasoning time and scores.
It can be seen that the hybrid expert acceleration model takes less time for these samples when four of the eight experts are used, although the accuracy will decrease.
First we explore the influence of the number of selected experts on the model.
Table 1 shows the effect of different numbers of experts. Firstly, the hybrid expert model (MoE) is superior to the original model in processing time regardless of the amount of data, especially when the number of experts is 2 or 4, the processing time is the shortest. However, when the number of experts increased to 8, the processing time of the model began to increase, even exceeding the original model. This may be because the complexity of the model increases with the increase of the number of experts, which leads to the increase of processing time.
Secondly, from the perspective of accuracy, the accuracy of the original model is mostly higher than that of the MoE model, but when the number of experts is 4, the accuracy of the MoE model is almost the same as that of the original model. When the number of experts increases to 8, the accuracy of MoE model is consistent with the original model again. This shows that the MoE model can achieve similar accuracy to the original model with proper number of experts.
Table 1: Number of experts and model performance.
Data | 10 | 15 | 20 | 25 | 30 | 40 | 50 |
Origin-Time | 4.00 | 4.16 | 4.82 | 5.50 | 6.73 | 7.50 | 10.71 |
Origin-Acc | 0.21 | 0.20 | 0.25 | 0.24 | 0.25 | 0.24 | 0.27 |
1MoE-Time | 2.67 | 3.29 | 4.08 | 4.72 | 5.91 | 6.94 | 10.85 |
1MoE-Acc | 0.15 | 0.15 | 0.15 | 0.15 | 0.16 | 0.14 | 0.16 |
2MoE-Time | 2.28 | 2.71 | 3.15 | 4.41 | 5.48 | 6.61 | 9.63 |
2MoE-Acc | 0.15 | 0.16 | 0.17 | 0.18 | 0.18 | 0.18 | 0.21 |
4MoE-Time | 2.19 | 2.82 | 3.35 | 4.09 | 5.15 | 6.17 | 9.21 |
4MoE-Acc | 0.20 | 0.19 | 0.22 | 0.22 | 0.22 | 0.21 | 0.25 |
8MoE-Time | 2.68 | 3.18 | 4.30 | 5.23 | 6.40 | 7.52 | 11.48 |
8MoE-Acc | 0.21 | 0.20 | 0.25 | 0.24 | 0.25 | 0.24 | 0.27 |
Fig. 7 Number of experts and model performance
Table 2: Mixed expert acceleration effect.
Sample | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 |
Origin-Time | 4.00 | 4.16 | 4.82 | 5.50 | 6.73 | 7.50 | 10.71 | 16.74 | 21.63 | 30.62 |
Origin-Accuracy | 0.21 | 0.20 | 0.25 | 0.24 | 0.25 | 0.24 | 0.27 | 0.28 | 0.27 | 0.30 |
MoE-Time | 2.19 | 2.82 | 3.35 | 4.09 | 5.15 | 6.17 | 9.21 | 14.71 | 19.51 | 29.78 |
MoE-Accuracy | 0.20 | 0.19 | 0.22 | 0.22 | 0.21 | 0.25 | 0.26 | 0.25 | 0.27 | 0.28 |
To sum up, an appropriate number of experts (for example, 4) can significantly reduce the processing time of the hybrid expert model while maintaining accuracy. However, too many experts may lead to an increase in processing time, even exceeding the original model. Therefore, choosing the appropriate number of experts is the key to improve the performance of the hybrid expert model.
We then explore the performance of the hybrid expert model with four experts on more samples.
Compared with the original model, the mixed expert model with four experts (MoE) has significant advantages in processing time, and the processing time of MoE model is shorter than that of the original model in both small samples and large samples. For example, for 10 samples, the processing time of MoE model is 2.19, while that of original model is 4.00. For 100 samples, the processing time of MoE model is 29.78, while that of original model is 30.62. This shows that MoE model has a good performance in processing efficiency.
In terms of accuracy, MoE model and the original model performance is similar, and when the number of samples increases, the accuracy of MoE model has improved. For example, for 6 samples, the accuracy of MoE model is 0.25, while that of original model is 0.24; For 10 samples, the accuracy of MoE model is 0.28, and that of original model is 0.30. This shows that MoE model still has room for improvement on the basis of keeping the original accuracy.
To sum up, the hybrid expert model with four experts can greatly reduce the processing time and improve the efficiency of the model while keeping the accuracy similar. In the case of a large number of samples, the accuracy of hybrid expert model tends to improve, showing good scalability and robustness.
5 Conclusion
We use Mixture of expert model to reduce the computational complexity of the model. When only half of the experts are used in the decoder layer, the speed is increased by about 20% compared with the original model, and the accuracy loss is about 10%.
References
[1] Vaswani, Ashish, et al. “Attention is all you need.” Advances in neural information processing systems. 2017.
[2]Shazeer, Noam, et al. “Outrageously large neural networks: The sparsely-gated mixture-of-experts layer.” arXiv preprint arXiv:1701.06538 (2017).
[3]Lepikhin, D., Lee, H., Xu, Y., Chen, D., Firat, O., Huang, Y., … & Chen, M. X. (2020). Gshard: Scaling giant models with conditional computation and automatic sharding. arXiv preprint arXiv:2006.16668.
[4] Zhang, Zhengyan, et al. “MoEfication: Transformer Feed-forward Layers are Mixtures of Experts.” arXiv preprint arXiv:2110.01786 (2021).
[5] Karypis, George, and Vipin Kumar. “A fast and high-quality multilevel scheme for partitioning irregular graphs.” SIAM Journal on scientific Computing 20.1 (1998): 359-392.