DeepVariant: A Deep Learning-Based Genomic Variant Caller
Introduction to DeepVariant
DeepVariant is an advanced genomic variant caller designed to identify genetic variations accurately. Developed with deep learning technology, it processes genomic data from sequencing reads and classifies these variations using a convolutional neural network. The end results are reported in VCF or gVCF format, which are standard file types for storing DNA sequence variations.
Key Features
Types of Supported Data
DeepVariant is capable of handling several types of sequencing data:
- Next-Generation Sequencing (NGS): For both whole genome and whole exome sequencing using Illumina technology.
- PacBio HiFi Sequencing: Specifically for high-fidelity data from PacBio sequencing.
- Oxford Nanopore Technology (ONT): Applicable to the R10.4.1 Simplex and Duplex datasets.
- Hybrid Sequencing: Integrates data from both PacBio HiFi and Illumina for a comprehensive analysis.
- Complete Genomics Data: Supports analysis from Complete Genomics platforms.
Application for Various Genomic Studies
DeepVariant is primarily suited to germline variant calling in diploid organisms. However, it's worth noting that it doesn't extend to somatic variant calling or organisms with more than two copies of DNA, as it's tailored to handle specific genotypes like homozygous alternate, heterozygous, and homozygous reference.
An Extension: DeepTrio
Built on the foundation of DeepVariant, DeepTrio enhances functionality by focusing on genomic variations in trio (parent-child) datasets. It supports similar sequencing data types as DeepVariant, with the addition of duo analysis (one parent and one child) and integrates with the GLnexus tool for merging VCF outputs.
Why Choose DeepVariant?
- Accuracy: DeepVariant is a leader in precision, having won major precisionFDA challenges for its exceptional performance across sequencing technologies.
- Flexibility: Offers out-of-the-box solutions for various sample types and sequencing qualities, with adjustability for non-human species.
- Ease of Use: Requires minimal filtering adjustments for users.
- Cost-Effective: Operations on Google Cloud Platform are efficient, with costs manageable for whole genome and exome sequencing.
- Speed: Efficient runtime on 64-core CPU machines, with faster options available using hardware accelerators such as GPUs and TPUs.
How DeepVariant Works
The tool converts sequencing read data (BAM or CRAM formats) into pileup images, which are essentially visual representations of DNA sequences. These images are analyzed by a neural network to classify and identify genomic variants.
Technical Setup
- Operating Systems: Runs on Unix-like systems, specifically requiring Python 3.8.
- Official Deployment Methods: Includes Docker for easy setup, pre-built binaries, and the option to build from source on specific systems.
Community and Contributions
DeepVariant is an open-source project, encouraging community input through contributions and feedback. While external pull requests can be challenging to merge directly due to existing infrastructure, contributions are reviewed and acknowledged in release notes.
Acknowledgements
DeepVariant leverages several key open-source libraries and tools within its framework, such as TensorFlow, Nucleus, and others, enhancing its functionality and integration capabilities.
Disclaimer
It's essential to note that while DeepVariant provides advanced genomic analysis capabilities, it is not classified as a medical device and is not intended for clinical use. It serves as a research tool within the genomic research community.
DeepVariant continues to evolve, offering powerful solutions to the complex challenges of genomic variant identification with the robust support of deep learning technologies.