VisCPM - Enhancing Multimodal Interaction: A Bilingual Model with Cutting-Edge Performance

VisCPM: A Bilingual Multimodal Model Series

VisCPM is an innovative open-source series of large-scale multimodal models capable of understanding and generating content in both visual and linguistic forms. It supports bilingual dialogues in Chinese and English, reaching new heights in open-source multimodal capabilities, especially within Chinese models. The project leverages the CPM-Bee language model, which boasts 10 billion parameters, effectively integrated with visual processing components like the visual encoder (Muffin) and the visual decoder (Diffusion-UNet). This integration allows VisCPM to adeptly process visual inputs and outputs.

Key Features of VisCPM

Bilingual Multimodal Dialogues: VisCPM-Chat model supports interactive dialogues based on visual content, utilizing its robust visual and language understanding capabilities.
Text-to-Image Generation: VisCPM-Paint enables the generation of images from text descriptions, offering a seamless visual creation process.
Advanced Bilingual Performance: Built on the strong bilingual capabilities of CPM-Bee, VisCPM efficiently transfers multimodal understanding between English and Chinese through pre-training primarily on English datasets.

Technical Details

VisCPM is comprised of two main models:

VisCPM-Chat

VisCPM-Chat is designed for rich, interactive dialogue experiences involving images. The model is trained using a sophisticated process:

Pretraining: Utilizes over 100 million high-quality English text-image pairs to align linguistic and visual understanding without altering the language model's parameters, focusing on updating visual encoder parameters only.
Instruction Tuning: Incorporates English and translated Chinese instruction data to enhance its responsiveness and align with user intent. This stage refines the model's performance across languages and scenarios.

Performance assessments have demonstrated its exceptional capability in both Chinese and English, excelling in open-domain dialogue, detailed image description, and complex reasoning tasks. Two versions of VisCPM-Chat, namely balance and zhplus, provide users with models optimized either for balanced language performance or enhanced Chinese capabilities.

VisCPM-Paint

This model focuses on generating images from text inputs in both Chinese and English. It retains fixed language model parameters, while the visual decoder’s capabilities are expanded through a diffusion-model-based approach, allowing for complex and high-quality visual creation.

Recent Updates and Developments

The project continually evolves, with recent expansions enhancing its functionality and user accessibility:

API Accessibility: Users can now easily leverage VisCPM-Chat through newly released APIs.
Enhanced Model Versions: The series has seen the introduction of newer models such as OmniLMM, aiming to provide improved dialogue capabilities and operational efficiency across diverse platforms, including mobile and edge devices.
Academic Recognition: The VisCPM research paper has been accepted at the prestigious ICLR 2024, highlighting its significant impact and innovative contributions.

In sum, VisCPM stands as a testament to the rapid advancements in AI-assisted language and imagery processes, opening new possibilities for multi-language communication and creative expression across different media. Whether for research, personal exploration, or developmental initiatives, VisCPM offers a versatile and powerful toolset for the modern technologically-driven world.