Evo: DNA Foundation Modeling from Molecular to Genome Scale
Evo is an innovative biological foundation model designed for advanced DNA sequence modeling and design, ranging from molecular to genome scale. This model employs the StripedHyena architecture, which allows it to work with sequences at single-nucleotide, byte-level resolution while maintaining efficient computational and memory scaling as contexts become longer.
What is Evo?
Evo is a sophisticated model with 7 billion parameters, constructed to handle complex genomic data. It was trained using OpenGenome, a dataset containing whole genomes of prokaryotic organisms, amassing around 300 billion tokens. Evo's capabilities and applications are detailed in a scientific paper, as well as an accompanying blog post.
Model Checkpoints
Evo offers different model checkpoints for various tasks:
evo-1-8k-base
: This model is pretrained with an 8,192-context length and acts as the base for tasks focusing on molecular-scale fine-tuning.evo-1-131k-base
: This more advanced model, extended to a context of 131,072 using theevo-1-8k-base
as its foundation, allows for sequence reasoning and generation at the genome level.
Updates and Revisions
The Evo team has improved the model's generation quality by rectifying a projection permutation issue. Users are advised to use the model with the updated version 1.1_fix to ensure optimal performance with HuggingFace.
Getting Started with Evo
Evo requires specific software setup, including StripedHyena architecture and FlashAttention-2, which may have compatibility considerations with some GPUs. To get started with Evo, users should install the necessary Python libraries, with PyTorch being a prerequisite due to dependency issues.
Installation Instructions
Evo can be installed via pip:
pip install evo-model
Alternatively, users can clone the project from GitHub and install manually:
git clone https://github.com/evo-design/evo.git
cd evo/
pip install .
For protein sequence generation and analysis, additional software such as prodigal
might be required. An environment file is provided for easy setup using the conda
package manager.
Using Evo
To employ Evo locally, users can download and run the model through the Python API. The following is a sample Python code snippet showing how to configure and utilize Evo:
from evo import Evo
import torch
device = 'cuda:0'
evo_model = Evo('evo-1-131k-base')
model, tokenizer = evo_model.model, evo_model.tokenizer
model.to(device)
model.eval()
sequence = 'ACGT'
input_ids = torch.tensor(
tokenizer.tokenize(sequence),
dtype=torch.int,
).to(device).unsqueeze(0)
with torch.no_grad():
logits, _ = model(input_ids) # (batch, length, vocab)
print('Logits: ', logits)
print('Shape (batch, length, vocab): ', logits.shape)
Integration with HuggingFace
Evo is compatible with the HuggingFace platform, allowing users to load and interact with the model seamlessly:
from transformers import AutoConfig, AutoModelForCausalLM
model_name = 'togethercomputer/evo-1-8k-base'
model_config = AutoConfig.from_pretrained(model_name, trust_remote_code=True, revision="1.1_fix")
model_config.use_cache = True
model = AutoModelForCausalLM.from_pretrained(
model_name,
config=model_config,
trust_remote_code=True,
revision="1.1_fix"
)
Access via Together API
Evo can also be accessed through Together AI, which provides a web UI for generating DNA sequences interactively. For more technical or bulk operations, users can use the Together API, as shown in the example below:
import openai
import os
client = openai.OpenAI(
api_key=TOGETHER_API_KEY,
base_url='https://api.together.xyz',
)
chat_completion = client.chat.completions.create(
messages=[
{
"role": "system",
"content": ""
},
{
"role": "user",
"content": "ACGT", # Prompt with a sequence
}
],
model="togethercomputer/evo-1-131k-base",
max_tokens=128,
logprobs=True
)
print(
chat_completion.choices[0].logprobs.token_logprobs,
chat_completion.choices[0].message.content
)
Academic Reference
For those referencing Evo in academic work, a preprint with detailed model information and findings is available. Here is how to cite it:
@article {nguyen2024sequence,
author = {Eric Nguyen and Michael Poli and Matthew G Durrant and Armin W Thomas and Brian Kang and Jeremy Sullivan and Madelena Y Ng and Ashley Lewis and Aman Patel and Aaron Lou and Stefano Ermon and Stephen A Baccus and Tina Hernandez-Boussard and Christopher Ré and Patrick D Hsu and Brian L Hie},
title = {Sequence modeling and design from molecular to genome scale with Evo},
year = {2024},
doi = {10.1101/2024.02.27.582234},
publisher = {Cold Spring Harbor Laboratory},
URL = {https://www.biorxiv.org/content/early/2024/02/27/2024.02.27.582234},
journal = {bioRxiv}
}
Evo stands out as a powerful tool in the world of biological modeling, providing scientists with robust resources to explore and manipulate genetic sequences across various scales.