Trainer: A Comprehensive Introduction
Trainer is an opinionated, general-purpose model trainer built on PyTorch, designed to streamline the model training process while offering flexibility for more advanced configurations. Known for its simple code base, Trainer provides diverse features that cater to both beginners and experts in machine learning.
Installation
Trainer can be installed in two main ways:
-
From GitHub: This method is preferred due to its stability. Clone the repository and install the software using the following commands:
git clone https://github.com/coqui-ai/Trainer cd Trainer make install
-
From PyPI: For convenience, Trainer is also available on PyPI. Run the following command:
pip install trainer
Implementing a Model
To use Trainer, developers need to subclass and overload functions in the TrainerModel
class. This modular approach allows users to tailor the training model to their specific requirements.
Training a Model with Auto-Optimization
Trainer simplifies the model training process by providing tools for auto-optimization. An example is provided using the MNIST dataset, demonstrating how straightforward it is to leverage Trainer for basic training tasks.
Advanced Optimization
Trainer offers advanced optimization capabilities, enabling users to fully customize the optimization cycle. This flexibility is exemplified in the GAN (Generative Adversarial Network) training example, where mixed precision training is handled using the scaled_backward()
function. The example demonstrates how to manage both the generator and discriminator training processes.
Batch Size Finder
Trainer includes a batch size finder feature. It starts training at a default batch size (2048, adjustable by the user) and optimally increases it to maximize GPU memory usage. Instead of calling trainer.fit()
, users call trainer.fit_with_largest_batch_size(starting_batch_size=2048)
.
Distributed Data Parallel (DDP) Training
Trainer supports multi-GPU training through the command line, without the limitations of using .spawn()
. This feature offers a flexible approach to scaling up model training across multiple GPUs.
python -m trainer.distribute --script path/to/your/train.py --gpus "0,1"
Training with Accelerate
Trainer integrates with Accelerate to support multi-GPU or distributed training. Setting use_accelerate
in TrainingArgs
to True
enables this feature.
CUDA_VISIBLE_DEVICES="0,1,2" accelerate launch --multi_gpu --num_processes 3 train_recipe_autoregressive_prompt.py
Callback Support
Trainer includes support for callbacks, allowing users to customize their training runs. Callbacks can be set in model implementations or explicitly provided to the Trainer, offering enhanced control over the training process.
Profiling
Trainer facilitates profiling, allowing users to generate and analyze profiling data using TensorBoard. Users create a PyTorch profiler, pass it to Trainer, and run TensorBoard to visualize performance insights.
tensorboard --logdir="./profiler/"
Supported Experiment Loggers
Trainer supports a variety of experiment loggers, including Tensorboard, ClearML, MLFlow, Aim, and WandDB. New loggers can be added by subclassing BaseDashboardLogger
.
Anonymized Telemetry
To improve for the community, Trainer collects anonymized usage statistics. Users can opt out by setting the environment variable TRAINER_TELEMETRY=0
.
In summary, Trainer offers a robust yet flexible platform for model training in PyTorch, catering to a wide range of user needs from simple training tasks to complex optimization cycles.