torchmd-net - Efficient Neural Network Potentials Tailored for GPU-Driven Molecular Dynamics

TorchMD-NET: A Comprehensive Overview

Introduction

TorchMD-NET is an advanced tool providing neural network potentials (NNPs) and a functional mechanism for training them. Integrating seamlessly with GPU-accelerated molecular dynamics tools such as ACEMD, OpenMM, and TorchMD, TorchMD-NET represents NNPs as PyTorch modules. The primary objective of TorchMD-NET is to deliver efficient and swift implementations of various neural network potentials.

Documentation

To assist users in navigating and utilizing the features of TorchMD-NET, comprehensive documentation has been made available at TorchMD-NET Documentation.

Available Architectures

TorchMD-NET supports several cutting-edge architectures, including:

Installation

TorchMD-NET is distributed via conda-forge and can be installed using Mamba with the command:

mamba install torchmd-net

For users preferring installation from source, detailed instructions are available on the installation documentation page.

Usage

Users can specify training arguments either through a configuration YAML file or directly via command line arguments. The repository contains multiple examples detailing architectural and training specifications that can serve as references. GPU management is handled through the CUDA_VISIBLE_DEVICES environment variable, ensuring optimized resource utilization. For instance, to train using the Equivariant Transformer architecture on the QM9 dataset, one might execute:

mkdir output
CUDA_VISIBLE_DEVICES=0 torchmd-train --conf torchmd-net/examples/ET-QM9.yaml --log-dir output/

Pretrained Models

TorchMD-NET provides instructions for loading pretrained models, which can be found here.

Custom Dataset Creation

For tailored training applications, users can employ torchmdnet.datasets.Custom to manage custom datasets with atom types and coordinates. There's also the option to create more bespoke datasets by deriving from the Dataset or InMemoryDataset classes following the torch-geometric framework, ensuring all necessary data is returned in the expected format.

Custom Prior Models

Adding custom prior models is feasible by defining a new class in torchmdnet.priors, then including it using the argument --prior-model <PriorModelName>. For guidance, users can refer to torchmdnet.priors.Atomref as an example.

Multi-Node Training

TorchMD-NET facilitates multi-node training, which requires setting specific environment variables to enable inter-node communication via NCCL. Here’s an example setup script:

export NODE_RANK=0
export MASTER_ADDR=hostname1
export MASTER_PORT=12910

mkdir -p output
CUDA_VISIBLE_DEVICES=0,1 torchmd-train --conf torchmd-net/examples/ET-QM9.yaml.yaml --num-nodes 2 --log-dir output/

Known Limitations

The necessity for uniform GPU numbers across nodes, as differing setups can cause errors.
A significant performance decline when using mixed GPU architectures on different nodes.
Potential CUDA system hang-ups during training, which can sometimes be mitigated by disabling peer-to-peer communication with export NCCL_P2P_DISABLE=1.

Citation

Researchers using TorchMD-NET in their academic endeavors are encouraged to cite relevant papers listed in the project documentation to acknowledge their contributions.

Developer Guide

For developers interested in extending TorchMD-NET, there are detailed steps for implementing new architectures and maintaining consistent code style using black. Running tests ensures the robustness and usability of any new addition to the package.

In summary, TorchMD-NET stands as a powerful resource for the molecular dynamics community, offering flexibility in model training and execution. With its integration into widely-used platforms and the provision of comprehensive documentation, it is positioned well to support both academic researchers and industry professionals.