DeepSeek-VL - Vision-Language Model for Complex Multimodal Data Processing

Introduction to DeepSeek-VL

DeepSeek-VL is an innovative open-source Vision-Language (VL) model designed to bridge the gap between vision and language understanding in real-world applications. This model, known for its robust multimodal capabilities, is adept at analyzing and interpreting various complex data types, including logical diagrams, web pages, scientific literature, and natural images. DeepSeek-VL is also equipped to handle embodied intelligence tasks, making it a versatile tool in numerous complex scenarios.

The project, titled "DeepSeek-VL: Towards Real-World Vision-Language Understanding," represents a collaborative effort led by Haoyu Lu, Wen Liu, and Bo Zhang among others, featuring contributions from multiple researchers dedicated to advancing multimodal AI technologies.

DeepSeek VL Sample

Release Overview

DeepSeek-VL was introduced with a variety of models tailored for different applications. As of March 2024, the DeepSeek-VL family includes 7B and 1.3B parameter models, available in both base and chat versions. These releases aim to support a wide range of applications across both academic and commercial fields.

A demonstration of the DeepSeek-VL-7B model is live on Hugging Face, offering users an opportunity to interact with the model and witness its capabilities firsthand.

Model Downloads

The DeepSeek-VL models are accessible for public downloading. Available models include:

DeepSeek-VL-1.3B-base and DeepSeek-VL-1.3B-chat
DeepSeek-VL-7B-base and DeepSeek-VL-7B-chat

These models can be downloaded via Hugging Face, with each model supporting extensive sequence lengths of up to 4096 tokens.

Quick Start Guide

Installation

To install DeepSeek-VL, ensure your environment is running Python 3.8 or later, and execute the following command to install necessary dependencies:

pip install -e .

Inference Example

Below is a simple Python script to perform inference using the DeepSeek-VL model:

import torch
from transformers import AutoModelForCausalLM
from deepseek_vl.models import VLChatProcessor, MultiModalityCausalLM
from deepseek_vl.utils.io import load_pil_images

model_path = "deepseek-ai/deepseek-vl-7b-chat"
vl_chat_processor = VLChatProcessor.from_pretrained(model_path)
tokenizer = vl_chat_processor.tokenizer

vl_gpt = AutoModelForCausalLM.from_pretrained(model_path, trust_remote_code=True)
vl_gpt = vl_gpt.to(torch.bfloat16).cuda().eval()

conversation = [
    {
        "role": "User",
        "content": "<image_placeholder>Describe each stage of this image.",
        "images": ["./images/training_pipelines.jpg"],
    },
    {"role": "Assistant", "content": ""},
]

pil_images = load_pil_images(conversation)
prepare_inputs = vl_chat_processor(conversations=conversation, images=pil_images, force_batchify=True).to(vl_gpt.device)

inputs_embeds = vl_gpt.prepare_inputs_embeds(**prepare_inputs)

outputs = vl_gpt.language_model.generate(
    inputs_embeds=inputs_embeds,
    attention_mask=prepare_inputs.attention_mask,
    pad_token_id=tokenizer.eos_token_id,
    bos_token_id=tokenizer.bos_token_id,
    eos_token_id=tokenizer.eos_token_id,
    max_new_tokens=512,
    do_sample=False,
    use_cache=True
)

answer = tokenizer.decode(outputs[0].cpu().tolist(), skip_special_tokens=True)
print(f"{prepare_inputs['sft_format'][0]}", answer)

Using Gradio for Demonstrations

To run a quick demo using Gradio, first install Gradio with:

pip install -e .[gradio]

Then launch the demo:

python deepseek_vl/serve/app_deepseek.py

Licensing

The DeepSeek-VL project code is available under the MIT License, while the models are under a specific Model License. Importantly, these models support commercial usage under these terms.

How to Cite

For academic purposes, please cite DeepSeek-VL using the following BibTeX entry:

@misc{lu2024deepseekvl,
      title={DeepSeek-VL: Towards Real-World Vision-Language Understanding},
      author={Haoyu Lu and Wen Liu and Bo Zhang and others},
      year={2024},
      eprint={2403.05525},
      archivePrefix={arXiv},
      primaryClass={cs.AI}
}

Contact

For further inquiries or to raise issues, please reach out via email. The team behind DeepSeek-VL is eager to help and engage with the community in advancing this technology.