Introduction to DeepSeek-VL
DeepSeek-VL is an innovative open-source Vision-Language (VL) model designed to bridge the gap between vision and language understanding in real-world applications. This model, known for its robust multimodal capabilities, is adept at analyzing and interpreting various complex data types, including logical diagrams, web pages, scientific literature, and natural images. DeepSeek-VL is also equipped to handle embodied intelligence tasks, making it a versatile tool in numerous complex scenarios.
The project, titled "DeepSeek-VL: Towards Real-World Vision-Language Understanding," represents a collaborative effort led by Haoyu Lu, Wen Liu, and Bo Zhang among others, featuring contributions from multiple researchers dedicated to advancing multimodal AI technologies.
Release Overview
DeepSeek-VL was introduced with a variety of models tailored for different applications. As of March 2024, the DeepSeek-VL family includes 7B and 1.3B parameter models, available in both base and chat versions. These releases aim to support a wide range of applications across both academic and commercial fields.
A demonstration of the DeepSeek-VL-7B model is live on Hugging Face, offering users an opportunity to interact with the model and witness its capabilities firsthand.
Model Downloads
The DeepSeek-VL models are accessible for public downloading. Available models include:
- DeepSeek-VL-1.3B-base and DeepSeek-VL-1.3B-chat
- DeepSeek-VL-7B-base and DeepSeek-VL-7B-chat
These models can be downloaded via Hugging Face, with each model supporting extensive sequence lengths of up to 4096 tokens.
Quick Start Guide
Installation
To install DeepSeek-VL, ensure your environment is running Python 3.8 or later, and execute the following command to install necessary dependencies:
pip install -e .
Inference Example
Below is a simple Python script to perform inference using the DeepSeek-VL model:
import torch
from transformers import AutoModelForCausalLM
from deepseek_vl.models import VLChatProcessor, MultiModalityCausalLM
from deepseek_vl.utils.io import load_pil_images
model_path = "deepseek-ai/deepseek-vl-7b-chat"
vl_chat_processor = VLChatProcessor.from_pretrained(model_path)
tokenizer = vl_chat_processor.tokenizer
vl_gpt = AutoModelForCausalLM.from_pretrained(model_path, trust_remote_code=True)
vl_gpt = vl_gpt.to(torch.bfloat16).cuda().eval()
conversation = [
{
"role": "User",
"content": "<image_placeholder>Describe each stage of this image.",
"images": ["./images/training_pipelines.jpg"],
},
{"role": "Assistant", "content": ""},
]
pil_images = load_pil_images(conversation)
prepare_inputs = vl_chat_processor(conversations=conversation, images=pil_images, force_batchify=True).to(vl_gpt.device)
inputs_embeds = vl_gpt.prepare_inputs_embeds(**prepare_inputs)
outputs = vl_gpt.language_model.generate(
inputs_embeds=inputs_embeds,
attention_mask=prepare_inputs.attention_mask,
pad_token_id=tokenizer.eos_token_id,
bos_token_id=tokenizer.bos_token_id,
eos_token_id=tokenizer.eos_token_id,
max_new_tokens=512,
do_sample=False,
use_cache=True
)
answer = tokenizer.decode(outputs[0].cpu().tolist(), skip_special_tokens=True)
print(f"{prepare_inputs['sft_format'][0]}", answer)
Using Gradio for Demonstrations
To run a quick demo using Gradio, first install Gradio with:
pip install -e .[gradio]
Then launch the demo:
python deepseek_vl/serve/app_deepseek.py
Licensing
The DeepSeek-VL project code is available under the MIT License, while the models are under a specific Model License. Importantly, these models support commercial usage under these terms.
How to Cite
For academic purposes, please cite DeepSeek-VL using the following BibTeX entry:
@misc{lu2024deepseekvl,
title={DeepSeek-VL: Towards Real-World Vision-Language Understanding},
author={Haoyu Lu and Wen Liu and Bo Zhang and others},
year={2024},
eprint={2403.05525},
archivePrefix={arXiv},
primaryClass={cs.AI}
}
Contact
For further inquiries or to raise issues, please reach out via email. The team behind DeepSeek-VL is eager to help and engage with the community in advancing this technology.