MPP-LLaVA - Improve Efficiency in Multimodal Systems via Parallel Processing Techniques

MPP-Qwen-Next: Multimodal Pipeline Parallel based on QwenLM

MPP-Qwen-Next is an innovative project that builds upon the Qwen language model to introduce a multimodal approach, allowing for advanced processing and understanding of both text and visual data. This development is a significant leap in machine learning, offering enhanced capabilities for image and video interactions.

News

Over the past months, several key updates have been made to the project:

June 2024: The release of the open-source MPP-Qwen-Next soft fine-tuning (sft) weights (15GB). These resources can be accessed via Modelscope and a Baidu Cloud link.
June 2024: Integration of LLaVA's multi-round dialogue sft data and VideoChatGPT's 100k sft data, enabling support for image multi-round dialogues, video dialogues, and emerging capabilities for multi-image conversations.
May 2024: Code support expanded to include multi-round dialogue sft, video sft, and multi-image sft.
April 2024: Enhanced dialogue effects with support for multi-GPU inference and chat template adjustments.

Framework

The framework integrates sophisticated techniques to allow seamless interactions across multiple modalities like images and videos, supporting complex dialogues and interpretative functions.

Features

Image Single-turn Q&A: The system can efficiently process single image queries and generate accurate responses.
Image Multi-turn Dialogue: Engages in ongoing dialogue with users about image content, enhancing interaction quality.
Video Dialogue: Processes video content for more dynamic and contextually rich conversations.
Multi-image Dialogue: Though initially without specific training, this model developed the ability to compare and contrast different images after video sft.

Installation

The installation is straightforward. Create a new Python environment and activate it using conda:

conda create -n minigpt4qwen python=3.8 && conda activate minigpt4qwen
pip install -e .

Weight & Data Preparation

Store all necessary model weights and datasets in the cache directory as outlined in the provided structure. Detailed instructions are available in WEIGHT.md and DATA.md.

Inference

Before running inference, configure your model weights as per [WEIGHT.md], and download the sft model weights (15GB) from the preferred source.

Single-GPU Inference:

python cli_demo.py --model-type qwen7b_chat -c lavis/output/pp_7b_video/sft_video/global_step2005/unfreeze_llm_model.pth

Multi-GPU Inference using device map:

python cli_demo.py --model-type qwen7b_chat -c lavis/output/pp_7b_video/sft_video/global_step2005/unfreeze_llm_model.pth --llm_device_map "auto"

Pipeline Parallel Training (PP+DP)

This advanced training setup is designed for efficient resource use across multiple GPUs.

Pretrain

Configure and run the pretrain script:

python -m torch.distributed.run --nproc_per_node=8 train_pipeline.py --cfg-path lavis/projects/pp_qwen7b_video/pretrain.yaml --num-stages 2

SFT

Configure and run the sft script:

python -m torch.distributed.run --nproc_per_node=8 train_pipeline.py --cfg-path lavis/projects/pp_qwen7b_video/sft.yaml --num-stages 8

Acknowledgement

The project utilizes frameworks and data from various sources:

Lavis for core scaffolding and use of components like BLIP2's ViT and Q-former.
QwenLM for its language modeling.
DeepSpeed for facilitating efficient multi-GPU operations and pipeline parallelism.
LLaVA and VideoChatGPT for training paradigms and datasets.

License

This project is primarily based on Lavis with a BSD 3-Clause License, and it incorporates Qwen-7B-Chat, which is licensed for research and development under its respective terms.

The MPP-Qwen-Next project marks a substantial development in the multi-modal AI field, reflecting an amalgamation of state-of-the-art techniques and collaborative innovations from various AI research projects.