Enformer - Pytorch
The Enformer is an exciting project that offers a Pytorch implementation of Deepmind's advanced attention network designed for predicting gene expression. Originally created in TensorFlow, this version brings the functionality to Pytorch users, enabling the prediction of gene expression with deep learning techniques and offering tools for fine-tuning pretrained models for various applications.
Overview
Enformer is a revolutionary model developed by Deepmind, designed to enhance our understanding of gene expression. At its core, the model leverages the power of attention networks to predict which genes will be expressed in different scenarios. The ability to predict gene expression has significant implications for the fields of genetics, biology, and medicine. This Pytorch implementation makes it accessible to a broader audience, allowing researchers and developers to utilize state-of-the-art predictive techniques in their work.
Installation
Setting up the Enformer in your Pytorch environment is straightforward. You can install the package via pip:
$ pip install enformer-pytorch
Usage
Let's dive into how to use Enformer with a simple Python example. Import the necessary packages, define your model parameters, and prepare your input sequences:
import torch
from enformer_pytorch import Enformer
model = Enformer.from_hparams(
dim=1536,
depth=11,
heads=8,
output_heads=dict(human=5313, mouse=1643),
target_length=896,
)
seq = torch.randint(0, 5, (1, 196_608)) # Random sequence for demonstration
output = model(seq)
output['human'] # Outputs a tensor of predicted gene expressions for humans
output['mouse'] # Outputs a tensor of predicted gene expressions for mice
Advanced Features
Enformer also provides advanced functionalities, such as using one-hot encodings, fetching embeddings, and computing losses during training. These additions allow for flexibility in various research scenarios and facilitate fine-tuning for specific tasks.
One-Hot Encoding
You can input sequences using one-hot encoding, which represents nucleotides (A, C, G, T, N) as binary vectors:
import torch
from enformer_pytorch import Enformer, seq_indices_to_one_hot
seq = torch.randint(0, 5, (1, 196_608))
one_hot = seq_indices_to_one_hot(seq)
output = model(one_hot)
Embeddings and Loss
For fine-tuning and further analysis, you can extract embeddings:
output, embeddings = model(one_hot, return_embeddings=True)
Compute the loss during training:
loss = model(seq, head='human', target=target)
loss.backward()
Pretrained Models
Leverage pretrained models to kickstart your experiments with the Enformer. These models are made available through HuggingFace:
from enformer_pytorch import from_pretrained
enformer = from_pretrained('EleutherAI/enformer-official-rough')
Fine-Tuning
Tailor Enformer to your specific requirements through fine-tuning. This repository includes comprehensive methods for adapting the model:
- Fine-tune on new tracks.
- Incorporate contextual data, such as cell types.
- Utilize attention aggregation for enhanced predictions.
The examples provide a foundation for adding new features or adapting the model for your data.
Data Handling
The package aids in handling genomic data, like sequences fetched from .bed
files. It incorporates helpful tools such as GenomicIntervalDataset
to process sequences efficiently:
from enformer_pytorch import GenomeIntervalDataset
ds = GenomeIntervalDataset(bed_file='./sequences.bed', fasta_file='./hg38.ml.fa')
Acknowledgments
Enformer - Pytorch was made possible with the resources from EleutherAI and contributions from community members, ensuring robustness and accuracy in gene expression predictions.
Future Work
The project has ongoing goals, including enhancing training utilities and refining numerical accuracy. As Enformer continues to evolve, it promises to remain a critical tool for genomic research and applications.
Citations
The work behind Enformer and associated research is credited to leading scientists in the field of computational biology, as documented in the formal citations section.