enformer-pytorch - Comprehensive Tool for Gene Expression Prediction Using Deepmind's Attention Network in PyTorch

Enformer - Pytorch

The Enformer is an exciting project that offers a Pytorch implementation of Deepmind's advanced attention network designed for predicting gene expression. Originally created in TensorFlow, this version brings the functionality to Pytorch users, enabling the prediction of gene expression with deep learning techniques and offering tools for fine-tuning pretrained models for various applications.

Overview

Enformer is a revolutionary model developed by Deepmind, designed to enhance our understanding of gene expression. At its core, the model leverages the power of attention networks to predict which genes will be expressed in different scenarios. The ability to predict gene expression has significant implications for the fields of genetics, biology, and medicine. This Pytorch implementation makes it accessible to a broader audience, allowing researchers and developers to utilize state-of-the-art predictive techniques in their work.

Installation

Setting up the Enformer in your Pytorch environment is straightforward. You can install the package via pip:

$ pip install enformer-pytorch

Usage

Let's dive into how to use Enformer with a simple Python example. Import the necessary packages, define your model parameters, and prepare your input sequences:

import torch
from enformer_pytorch import Enformer

model = Enformer.from_hparams(
    dim=1536,
    depth=11,
    heads=8,
    output_heads=dict(human=5313, mouse=1643),
    target_length=896,
)

seq = torch.randint(0, 5, (1, 196_608))  # Random sequence for demonstration
output = model(seq)

output['human']  # Outputs a tensor of predicted gene expressions for humans
output['mouse']  # Outputs a tensor of predicted gene expressions for mice

Advanced Features

Enformer also provides advanced functionalities, such as using one-hot encodings, fetching embeddings, and computing losses during training. These additions allow for flexibility in various research scenarios and facilitate fine-tuning for specific tasks.

One-Hot Encoding

You can input sequences using one-hot encoding, which represents nucleotides (A, C, G, T, N) as binary vectors:

import torch
from enformer_pytorch import Enformer, seq_indices_to_one_hot

seq = torch.randint(0, 5, (1, 196_608))
one_hot = seq_indices_to_one_hot(seq)
output = model(one_hot)

Embeddings and Loss

For fine-tuning and further analysis, you can extract embeddings:

output, embeddings = model(one_hot, return_embeddings=True)

Compute the loss during training:

loss = model(seq, head='human', target=target)
loss.backward()

Pretrained Models

Leverage pretrained models to kickstart your experiments with the Enformer. These models are made available through HuggingFace:

from enformer_pytorch import from_pretrained

enformer = from_pretrained('EleutherAI/enformer-official-rough')

Fine-Tuning

Tailor Enformer to your specific requirements through fine-tuning. This repository includes comprehensive methods for adapting the model:

Fine-tune on new tracks.
Incorporate contextual data, such as cell types.
Utilize attention aggregation for enhanced predictions.

The examples provide a foundation for adding new features or adapting the model for your data.

Data Handling

The package aids in handling genomic data, like sequences fetched from .bed files. It incorporates helpful tools such as GenomicIntervalDataset to process sequences efficiently:

from enformer_pytorch import GenomeIntervalDataset

ds = GenomeIntervalDataset(bed_file='./sequences.bed', fasta_file='./hg38.ml.fa')

Acknowledgments

Enformer - Pytorch was made possible with the resources from EleutherAI and contributions from community members, ensuring robustness and accuracy in gene expression predictions.

Future Work

The project has ongoing goals, including enhancing training utilities and refining numerical accuracy. As Enformer continues to evolve, it promises to remain a critical tool for genomic research and applications.

Citations

The work behind Enformer and associated research is credited to leading scientists in the field of computational biology, as documented in the formal citations section.