TorchMD-NET: A Comprehensive Overview
Introduction
TorchMD-NET is an advanced tool providing neural network potentials (NNPs) and a functional mechanism for training them. Integrating seamlessly with GPU-accelerated molecular dynamics tools such as ACEMD, OpenMM, and TorchMD, TorchMD-NET represents NNPs as PyTorch modules. The primary objective of TorchMD-NET is to deliver efficient and swift implementations of various neural network potentials.
Documentation
To assist users in navigating and utilizing the features of TorchMD-NET, comprehensive documentation has been made available at TorchMD-NET Documentation.
Available Architectures
TorchMD-NET supports several cutting-edge architectures, including:
Installation
TorchMD-NET is distributed via conda-forge and can be installed using Mamba with the command:
mamba install torchmd-net
For users preferring installation from source, detailed instructions are available on the installation documentation page.
Usage
Users can specify training arguments either through a configuration YAML file or directly via command line arguments. The repository contains multiple examples detailing architectural and training specifications that can serve as references. GPU management is handled through the CUDA_VISIBLE_DEVICES
environment variable, ensuring optimized resource utilization. For instance, to train using the Equivariant Transformer architecture on the QM9 dataset, one might execute:
mkdir output
CUDA_VISIBLE_DEVICES=0 torchmd-train --conf torchmd-net/examples/ET-QM9.yaml --log-dir output/
Pretrained Models
TorchMD-NET provides instructions for loading pretrained models, which can be found here.
Custom Dataset Creation
For tailored training applications, users can employ torchmdnet.datasets.Custom
to manage custom datasets with atom types and coordinates. There's also the option to create more bespoke datasets by deriving from the Dataset
or InMemoryDataset
classes following the torch-geometric framework, ensuring all necessary data is returned in the expected format.
Custom Prior Models
Adding custom prior models is feasible by defining a new class in torchmdnet.priors
, then including it using the argument --prior-model <PriorModelName>
. For guidance, users can refer to torchmdnet.priors.Atomref
as an example.
Multi-Node Training
TorchMD-NET facilitates multi-node training, which requires setting specific environment variables to enable inter-node communication via NCCL. Here’s an example setup script:
export NODE_RANK=0
export MASTER_ADDR=hostname1
export MASTER_PORT=12910
mkdir -p output
CUDA_VISIBLE_DEVICES=0,1 torchmd-train --conf torchmd-net/examples/ET-QM9.yaml.yaml --num-nodes 2 --log-dir output/
Known Limitations
- The necessity for uniform GPU numbers across nodes, as differing setups can cause errors.
- A significant performance decline when using mixed GPU architectures on different nodes.
- Potential CUDA system hang-ups during training, which can sometimes be mitigated by disabling peer-to-peer communication with
export NCCL_P2P_DISABLE=1
.
Citation
Researchers using TorchMD-NET in their academic endeavors are encouraged to cite relevant papers listed in the project documentation to acknowledge their contributions.
Developer Guide
For developers interested in extending TorchMD-NET, there are detailed steps for implementing new architectures and maintaining consistent code style using black. Running tests ensures the robustness and usability of any new addition to the package.
In summary, TorchMD-NET stands as a powerful resource for the molecular dynamics community, offering flexibility in model training and execution. With its integration into widely-used platforms and the provision of comprehensive documentation, it is positioned well to support both academic researchers and industry professionals.