modulus-makani

Makani: Massively Parallel Training of Machine-Learning Based Weather and Climate Models

Overview

Makani, a word meaning wind in Hawaiian, is an experimental library crafted to advance the research and development of machine-learning based weather and climate models using PyTorch. This powerful tool plays a critical role in NVIDIA's research initiatives, facilitating the evolution of robust weather and climate prediction models. Not just a standalone library, Makani frequently integrates its stable features into the NVIDIA Modulus framework, a comprehensive platform designed for training Physics-ML models in various scientific and engineering domains.

Initially developed by engineers and researchers at NVIDIA and the National Energy Research Scientific Computing Center (NERSC), Makani is designed to train the FourCastNet model, a deep-learning weather prediction model. The library is architected for high-performance, massively parallel training operations, capable of leveraging over 100 GPUs. This makes it exceptionally suitable for developing the next generation of weather and climate models, including the Spherical Fourier Neural Operators (SFNO) and Adaptive Fourier Neural Operators (AFNO), using datasets like ERA5.

Getting Started

To dive into Makani, installation is straightforward. By cloning the GitHub repository and installing the package, users can begin exploring its capabilities. Training a model in Makani involves running a Python script with specific configuration arguments that dictate how the training session should proceed. The library supports advanced optimizations such as automatic mixed precision, just-in-time compilation, and several parallelism techniques, which collectively enhance memory efficiency and computational speed.

For instance, a large-scale training operation could involve running on 256 GPUs with automatic mixed precision enabled to reduce memory requirements significantly. Such flexibility in optimization ensures that even extensive models can be trained effectively without being resource-intensive.

Inference

Inference, the process of making predictions with a trained model, is similarly user-friendly in Makani. A simple command passed through the command line initiates the inference process, employing the same sophisticated optimization techniques used during training to ensure efficient operation.

More About Makani

Makani's project structure is meticulously organized to streamline every aspect of machine learning model development for weather and climate applications. It includes configurations, data processing scripts, and utilities for model parallelism within its well-structured directory architecture. The repository also contains scripts for building Docker images, which can be invaluable for deploying models in various computing environments.

Model and Training Configuration

Training in Makani is defined using YAML configuration files, specifying details about the network architecture, loss function, optimizer, learning rate, and many more hyperparameters. These configurations are essential for tailoring the training process to the specific requirements of the model being developed.

Training Data

Makani handles training and testing data in HDF5 format, where the data for each year is encapsulated within a single file. This structure includes input and target data that represents atmospheric states at two distinct time points. Makani also requires metadata files to describe dataset properties necessary for accurate data loading and processing.

Known Issues

Currently, Makani faces challenges with architectures utilizing complex-valued weights due to limitations in PyTorch. However, a workaround exists via a hotfix, which can mitigate these issues temporarily until official patches are released.

Contributing

Makani welcomes contributions from the community. Whether by reporting bugs, suggesting features, or directly contributing code, user involvement is deeply appreciated. Contributors are encouraged to adhere to the project’s standards, including providing unit tests for new features.