foldingdiff - Enhancing Protein Backbone Design with a Diffusion Model

Introduction to foldingdiff: A Diffusion Model for Protein Backbone Generation

The foldingdiff project introduces a diffusion model designed to generate new protein backbone structures. The primary aim is to push the boundaries of computational protein design, a field that has crucial applications in biotechnology and pharmaceuticals. Protein backbones serve as a scaffold for complex biological functions, and this model offers a novel approach to creating them. For those interested in exploring the project further, preprint details are available on arXiv, and the trained model can be accessed for hands-on experimentation via HuggingFace spaces or SuperBio.

How to Get Started

To use foldingdiff, users need to clone the project repository and set up a Python environment, leveraging PyTorch, PyTorch Lightning, and HuggingFace's transformer library. Installation follows a few straightforward steps:

conda env create -f environment.yml
conda activate foldingdiff
pip install -e ./

For training purposes, additional data is required, specifically from the CATH dataset. Users can download these files using a script provided in the data directory.

Training Your Own Models

Researchers interested in training custom models on the CATH dataset can utilize the script located in bin/train.py. This works alongside configuration files that dictate model parameters. The script allows flexibility, enabling users to modify and test different training conditions to achieve optimal results.

Utilizing Pre-trained Models

Foldingdiff provides pre-trained model weights, which can be accessed via HuggingFace's model hub. This access simplifies the process of evaluating or utilizing the model for tasks such as protein backbone generation without needing to undergo the training process.

Sampling and Generating Protein Structures

To generate protein backbones, foldingdiff provides a script (bin/sample.py) that utilizes pre-trained model weights. This script is designed to produce protein structures of varying lengths efficiently using computational resources, particularly Nvidia GPUs for reduced runtimes.

Evaluation of Backbone Designability

An innovative aspect of foldingdiff is its focus on evaluating the designability of generated backbones. This involves using inverse folding methods to predict amino acid sequences that might fold back into the generated structures. Methods such as ProteinMPNN and ESM-IF1 are evaluated, with emphasis on the performance and accuracy of predicted sequences.

Inverse Folding Techniques

Inverse folding is assessing the ability to design sequences for a given protein structure. Tools like ESM-IF1 and ProteinMPNN are used in foldingdiff to create potential amino acid sequences for the backbone structures generated by the model.

Structural Prediction

Tools like OmegaFold and AlphaFold2 assess whether the generated sequences actually fold as anticipated. OmegaFold is preferred due to its speed and design that suits the experimental needs of the project, though AlphaFold2 offers more detailed procedures as warranted.

Testing and Validation

To ensure reliability, foldingdiff includes a series of tests through doctests and unittests. These tests verify the accuracy and efficiency of the diffusion model, reinforcing the project's scientific robustness and application potential.

In conclusion, foldingdiff is an exciting step forward in protein design, marrying advanced algorithmic processes with the intriguing possibilities of biological engineering. This program stands as a testament to ongoing innovation in computational biology, with practical applications already within reach for researchers across the globe.