tutel - Enhanced Mixture-of-Experts for Efficient Dynamic Training and Inference

Tutel MoE: An Optimized Mixture-of-Experts Implementation

Tutel is an advanced implementation designed to handle mixture-of-experts (MoE) models in machine learning frameworks, specifically optimized for improving both training and inference processes. It's the first to propose a "No-penalty Parallism/Sparsity/Capacity Switching" approach, addressing dynamic behaviors in modern computation.

Supported Platforms

Tutel is compatible with PyTorch (recommended version 1.10 or higher). It supports a variety of GPU setups, including NVIDIA CUDA architectures with precision options like fp64, fp32, fp16, and bfp16, as well as AMD's ROCm environments. CPU support is also available for fp64 and fp32 precision.

Features of Tutel

Newest Updates

Tutel v0.3.3

Introduced an all-to-all benchmark, providing tools for bandwidth testing across GPUs. This can be executed with PyTorch's distributed run command.

Tutel v0.3.2

A new tensorcore option enables various benchmarks, extended examples for custom expert layers, and options to adjust NCCL timeout settings.

Tutel v0.3.1

Features NCCL all_to_all_v and all_gather_v capabilities for handling variable-length message transfers, enhancing communication efficiency among devices.

Tutel v0.3

The Megablocks solution improves single-GPU inference performance when using multiple local experts.

Tutel v0.2

Allows dynamic switching of most configurations at no additional cost.

Tutel v0.1

Enhancements in data dispatch encoding and decoding complexities and the introduction of the 2DH option for scalable all-to-all processes.

Quick Setup and Usage

Setting Up Tutel with PyTorch 2

To integrate Tutel with PyTorch, follow these steps:

Install PyTorch:
- For NVIDIA CUDA, version 11.7 or higher,
- For AMD ROCm version 5.4.2,
- Or use a CPU-based setup.
Install Tutel:
- Uninstall any existing versions of Tutel.
- Use pip to install from the GitHub repository.
Build from Source (Optional):
- Clone the GitHub repository.
- Follow the installation commands to set up Tutel manually.

Testing Tutel

You can perform quick tests using examples such as HelloWorld for a single-GPU setup, and you can explore more complex scenarios such as MoE layers on datasets like MNIST or CIFAR10 in distributed environments.

Distributed Mode Execution

Tutel supports both single-node and multi-node setups for distributed machine learning:

Torch Launcher: Suitable for multi-node, multi-GPU configurations using the PyTorch distribution package.
Tutel Launcher: An alternative requiring the openmpi-bin package, allowing flexible distribution strategies including CPU-based testing.

Advanced Usage

Tutel allows converting checkpoint files to suit different distributed configurations. It also seamlessly integrates with PyTorch through an easy-to-use API, enabling the addition of MoE layers to existing models.

Conclusion

Tutel is a sophisticated tool designed for developers and researchers working in large-scale machine learning, particularly those utilizing MoE models. Its diverse set of features and ease of integration make it a valuable asset in optimizing model training and inference.

For technical details, refer to the Tutel paper, and remember to adhere to the contributor guidelines if you plan to contribute to the project.