PatrickStar - Efficient Parallel Training with Fewer GPUs through Dynamic Memory Management

Introduction to PatrickStar

Meeting PatrickStar

PatrickStar is a groundbreaking system designed to democratize the training of Pre-Trained Models (PTMs), which are increasingly pivotal in natural language processing (NLP) research and industry applications. Traditionally, training these models requires substantial hardware resources, accessible to only a limited segment of the AI community. PatrickStar changes this landscape by making PTM training feasible for a broader audience.

One of the primary challenges in training PTMs is the Out-of-Memory (OOM) error, which is a common obstacle for engineers. To circumvent this, teams typically add more GPUs to accommodate model parameters. PatrickStar offers a smarter solution through heterogeneous training. This technique optimally utilizes both CPU and GPU memory, allowing for the training of larger models with fewer GPUs.

System Design

The innovation behind PatrickStar lies in its dynamic memory scheduling system. Unlike traditional models that statically allocate memory between CPU and GPU, PatrickStar employs a chunk-based memory management module. This approach dynamically manages memory, efficiently offloading everything except the current computing parts of the model to the CPU to conserve GPU resources. Moreover, this method enhances collective communication efficiency when scaling up to multiple GPUs, outperforming static solutions.

Performance Results

PatrickStar's exceptional capabilities have been demonstrated in various experimental setups. For example, version 0.4.3 of PatrickStar successfully trained an 18 Billion parameter model using 8 Tesla V100 GPUs at a WeChat data center, showcasing its ability to handle larger models than competitors like DeepSpeed. Additionally, PatrickStar delivers superior performance for models of comparable size.

The software has also been tested on a single node of the NVIDIA A100 SuperPod, managing a 68 Billion parameter model across 8 A100 GPUs. This feat represents a model over six times larger than what DeepSpeed (v0.5.7) can handle.

Moreover, PatrickStar has succeeded in training a GPT-3 model with 175 Billion parameters on a relatively small cluster of 32 GPUs. Compared to Microsoft, which used 10,000 V100 GPUs for GPT-3, this is a remarkable achievement, enabling users to finetune or even pretrain a GPT-3 model with considerably fewer resources.

Installation and Usage

Installing PatrickStar is straightforward with pip:

pip install .

It requires GCC version 7 or higher and is compatible with NVIDIA NGC images, especially for the PyTorch framework.

PatrickStar is designed to be easily integrated into existing PyTorch projects, using a familiar configuration format similar to DeepSpeed. Here is a basic usage example:

from patrickstar.runtime import initialize_engine

config = {
    "optimizer": {
        "type": "Adam",
        "params": {
            "lr": 0.001,
            "betas": (0.9, 0.999),
            "eps": 1e-6,
            "weight_decay": 0,
            "use_hybrid_adam": True,
        },
    },
    "fp16": {  # loss scaler params
        "enabled": True,
        "loss_scale": 0,
        "initial_scale_power": 2 ** 3,
        "loss_scale_window": 1000,
        "hysteresis": 2,
        "min_loss_scale": 1,
    },
    "default_chunk_size": 64 * 1024 * 1024,
    "release_after_init": True,
    "use_cpu_embedding": False,
    "client": {
        "mem_tracer": {
            "use_async_mem_monitor": args.with_async_mem_monitor,
        }
    },
}

def model_func():
    return MyModel(...)

model, optimizer = initialize_engine(model_func=model_func, local_rank=0, config=config)

Further details on configurations and examples can be found in the project's documentation and guides.

Conclusion

PatrickStar is a sophisticated tool overtaking traditional PTM training methods by optimizing memory management and reducing hardware constraints. It's a crucial development for individuals and organizations looking to engage with PTM training without investing in extensive GPU resources.

Further Information

For those interested in exploring PatrickStar further, the code and additional resources like guides, examples, and benchmarking scripts are available in the project's repository. This ensures that users can maximize the potential of their PTM training endeavors with PatrickStar.