The PLINDER Project: A Comprehensive Resource for Protein-Ligand Interaction Studies
Overview
The PLINDER project, short for Protein Ligand INteractions Dataset and Evaluation Resource, is a highly detailed and well-annotated dataset designed to enhance the training and evaluation of protein-ligand docking algorithms. This ambitious project brings together an impressive collection of more than 400,000 protein-ligand interaction (PLI) systems, spread across over 11,000 SCOP domains and featuring more than 50,000 unique small molecules.
Key Features
- Extensive Dataset: PLINDER includes an immense variety of PLI systems, with over 500 annotations per system. These annotations cover aspects such as protein and ligand properties, system quality, and more.
- Automated Curation: The dataset uses an automated pipeline to stay current with updates from the Protein Data Bank (PDB).
- Evaluation Metrics: The project includes 14 different PLI metrics and provides over 20 billion similarity scores, enabling robust performance evaluation.
- Diverse Structure Availability: Both unbound (apo) and predicted Alphafold2 structures are linked to their bound (holo) counterparts.
- Efficient Data Splits: The resource provides well-thought-out train-validation-test splits that can be customized based on specific learning tasks.
- Community Effort: PLINDER is a collaborative project initiated by leading institutions such as the University of Basel and the SIB Swiss Institute of Bioinformatics, among others.
Community and Challenge
In an effort to accelerate its adoption, PLINDER has been positioned as the new standard in Protein-Ligand interaction datasets. An exciting competition centered on PLINDER is set to take place at the 2024 Machine Learning in Structural Biology (MLSB) Workshop at NeurIPS, a notable event in the academic calendar for the field.
Version Control
PLINDER employs a meticulous versioning system to ensure dataset accuracy and consistency:
- PLINDER_RELEASE: This refers to the latest synchronization with the RCSB PDB.
- PLINDER_ITERATION: This value supports iterative improvements within a release.
The dataset's current version (2024-06/v2) has introduced several enhancements, such as new systems, better stability in system definitions, improved ligand handling, and enriched dataset diversity.
Benchmarking and Evaluation
PLINDER provides gold standard benchmark sets designed to minimize data leakage based on interaction similarity. These sets have been curated for minimal redundancy and enriched test set diversity, prioritizing high-quality structures for benchmarking purposes.
Getting Started
The PLINDER dataset is accessible in two primary ways:
- Direct Download: Data files can be downloaded directly using tools like
gsutil
. - Python Integration: The
plinder
Python package is available for easy interaction with the dataset, installable via PyPI.
Documentation and Further Information
For a deeper dive into the functionalities and applications of PLINDER, detailed documentation is available on the project's website.
Citation
For referencing PLINDER in academic work, please cite the foundational paper by Durairaj et al., available on bioRxiv.
In conclusion, PLINDER is set to be a transformative resource for those engaged in the study and development of protein-ligand docking methodologies, offering unparalleled data depth, annotation quality, and community engagement.