LLaMA-MoE: Building Mixture-of-Experts with Continual Pre-training
LLaMA-MoE is an open-source project that has innovatively developed Mixture-of-Experts (MoE) models, deriving from the renowned LLaMA model and SlimPajama datasets. This effort focuses on creating accessible and affordable smaller MoE models that are friendly for deployment and research use.
Introduction to LLaMA-MoE
LLaMA-MoE distinguishes itself by following a two-step process:
- Partitioning LLaMA's FFNs: LLaMA's Feed-Forward Networks (FFNs) are divided into sparse experts, with a "top-K gate" inserted at each expert layer.
- Continual Pre-training: The initialized MoE model undergoes continual pre-training with optimized data sampling weights from the Sheared LLaMA model, alongside filtered datasets from SlimPajama.
These steps ensure that the resulting model remains efficient and performs well across various tasks.
Key Features
LLaMA-MoE boasts several exciting features:
-
Lightweight Models: With activated parameters ranging from 3.0 to 3.5 billion, these models are lightweight, making them easier to deploy.
-
Multiple Expert Construction Methods:
- Neuron-Independent Methods: These include Random partitioning, Clustering, using Co-activation Graphs, or employing Gradient techniques.
- Neuron-Sharing Methods: Methods such as Inner and Inter (residual) sharing are available.
-
Multiple Gating Strategies:
- TopK Noisy Gate: This strategy helps in efficient routing of information.
- Switch Gating: An alternative gating mechanism offering flexibility.
-
Fast Continual Pre-training: Features like FlashAttention-v2 integrated for improved training speeds and efficient dataset loading.
-
Comprehensive Monitoring Tools: Users can track various metrics including gate load, loss on steps and tokens, TGS (tokens/GPU/second), MFU (model FLOPs utilization), and utilize other visualization utilities.
-
Dynamic Weight Sampling: Offers both self-defined static sampling weights and the dynamic batch loading from Sheared LLaMA's algorithm.
Quick Start Guide
To quickly get started with LLaMA-MoE, users can simply follow a few Python steps to load and run the model. Python 3.10 or above is required, and leveraging the libraries provided by Hugging Face, users can initialize the model and tokenizer, and perform text generation tasks. The demonstration involves generating text about Suzhou, highlighting the model's ability to extend inputs with meaningful content efficiently.
Installation Instructions
Setting up LLaMA-MoE involves creating a Python environment with Conda, setting environment variables, and installing necessary dependencies including PyTorch with CUDA support and the FlashAttention library. The repository can be cloned from GitHub for immediate use.
Expert Construction
LLaMA-MoE provides various expert construction scripts, supporting both Neuron-Independent and Neuron-Sharing methods, allowing users to experiment with different ways of constructing model experts.
Continual Pre-training
Continual pre-training involves tokenizing datasets using SlimPajama resources, and LLaMA-MoE offers a step-by-step guide for using different domain datasets which can be tokenized and used for ongoing model improvement.
Model Evaluation and Performance
The performance of the LLaMA-MoE models is thoroughly evaluated and compared to other foundational models. Metrics across a range of tasks highlight the model's competitive performance in context to various naturally occurring language queries and structured evaluations.
Supervised Fine-Tuning (SFT)
For users looking to build chatbots or fine-tune models for specific tasks, the project provides guidelines and scripts for supervised fine-tuning, augmenting the model's conversational capabilities.
Conclusion
LLaMA-MoE stands out as a flexible and accessible platform for anyone interested in exploring and deploying Mixture-of-Experts models. The project’s focus on affordable, efficient model creation and continual pre-training makes it a powerful tool for furthering natural language processing research and applications.