scGPT: A Foundation Model for Single-Cell Multi-omics
Overview
scGPT is an innovative project that aims to develop a foundational model using generative AI for single-cell multi-omics. This project combines the power of AI with the rich, complex data derived from single-cell studies, providing tools and resources to enhance research and analytical capabilities in the realms of biology and medicine.
Key Features and Updates
-
Pretrained Model Checkpoints: scGPT offers a variety of pretrained model checkpoints, allowing users to explore and utilize models specifically trained on different cell types and datasets. These models are readily available for download, ranging from general human cell models to more specific ones like brain or lung cells.
-
Integration with HuggingFace: The team has been working on integrating the pretraining workflow with HuggingFace, a popular machine learning platform. This integration is aimed at making scGPT models more accessible and easier to implement.
-
Zero-Shot Applications: New tutorials have been released to guide users on zero-shot applications, which allow for immediate predictions without additional training. This feature is particularly useful for tasks such as cell annotation and gene regulatory network analysis.
-
Flash-Attention Compatibility: scGPT now offers flexibility by making flash-attention an optional dependency. This allows pretrained weights to be efficiently loaded across different computing environments, including CPUs, GPUs, and flash-attn setups.
-
Efficient Reference Mapping: With the help of the faiss library, scGPT achieves efficient reference mapping, capable of processing millions of cells quickly and with minimal resource usage. This feature significantly enhances the ability to analyze large-scale cell datasets.
Online Applications
scGPT provides several web-based applications that allow users to experiment with the model directly through their browsers. These include:
- Reference Mapping App: Facilitates mapping samples to reference datasets.
- Cell Annotation App: Assists in identifying and labeling cell types.
- GRN Inference App: Supports inference of gene regulatory networks.
These applications are hosted on Superbio.ai, making them easily accessible for users to begin their explorations.
Installation
scGPT requires Python version 3.7.13 or higher and R version 3.6.1 or higher. It can be installed via PyPI, with optional support for flash-attention for enhanced performance. Users facing issues with the orbax package or requiring specific CUDA versions have alternative installation instructions provided.
For developers, scGPT uses the Poetry package manager for a streamlined setup process, and specific dependencies, such as flash-attn, may necessitate particular GPU and CUDA configurations.
Pretrained Model Zoo
The scGPT Model Zoo includes a collection of pretrained models tailored to different cellular contexts. These models are available for download and include recommendations based on the scope of application, whether general human cells or specific organ-related datasets like heart or kidney cells.
Fine-Tuning and Customization
Guidelines and example scripts are provided for users interested in fine-tuning models for specific scRNA-seq integration tasks. The project encourages customization and adaptation to various research needs.
Development and Contributions
The scGPT project is open to contributions from the community. Researchers and developers are encouraged to submit improvements or report issues. Active development is ongoing, with plans to release new features and expand the model's applicability, including publishing to the HuggingFace model hub.
Acknowledgements and Citing
scGPT acknowledges the contributions from several open-source projects, such as flash-attention and scvi-tools, essential for its development. Researchers using scGPT are encouraged to cite the foundational article published on bioRxiv.
With scGPT, researchers can harness the capabilities of generative AI to delve deeper into the complexities of single-cell omics, paving the way for breakthroughs in understanding cellular functions and disease mechanisms.