dolphin - Comprehensive Video Platform Integrating Large Language Models

Introduction to the Dolphin Project

Overview

Dolphin is an innovative video interaction platform currently under development, leveraging large language models to power its functionalities. It is designed as a chatbot capable of understanding, processing, and generating video content. The effort is spearheaded by a team of researchers from Beihang University and Nanyang Technological University, aiming to make contributions and receive feedback from the community as the project evolves.

Key Features

The Dolphin platform is built with diverse video-related capabilities that fall into three main categories: video understanding, video processing, and video generation. Here’s a closer look at these features:

Video Understanding

Dolphin can answer questions about video content, enhancing the way users interact with and comprehend video materials.

Video Processing

Basic processing features include trimming, subtitling, extracting, and adding audio to videos, utilizing functionalities from moviepy. Additionally, it offers transformation options like converting video to pose, depth, or canny formats.

Video Generation

The platform supports text-to-video creation, allowing for outputs based on text, pose/depth inputs, or even video to video transformations akin to pix2pix.

Demo and Development Updates

A demo video showcasing Dolphin's capabilities is available on YouTube, providing an insight into its functionalities. The project experienced a significant update on May 6, 2023, with the release of the code and an online demonstration that reflects the practical applications of its features.

Getting Started

For those interested in exploring or contributing to Dolphin, a quick start guide is provided to set up the environment and begin using the platform. Users can manage the system environment through conda, clone the repository, and install the necessary dependencies. Furthermore, Dolphin offers flexible configuration for hardware utilization, letting users decide the allocation of tasks between CPU and GPU resources.

GPU Usage

Dolphin's various video models have specific GPU memory requirements, offering everything from video captioning to complex transformations like pose or canny to text-to-video. A detailed GPU memory usage table is available for users to plan resource allocation effectively.

Expandability and Future Plans

The Dolphin framework is designed with expansion in mind, inviting users to integrate additional video models or large language models. Developers can add new functionalities by implementing models and adjusting configurations, ensuring the platform can grow and adapt over time.

Future developments include creating a pre-trained unified video model with context learning, establishing benchmarks for new video tasks, and extending services to include Gradio, Web, API, and Docker integrations.

Community and Contributions

Dolphin is appreciative of the open-source community, drawing inspiration and tools from prominent projects such as Hugging Face, LangChain, and MoviePy. Users are encouraged to engage through GitHub for issues or questions, or contact the developers directly via email.

Dolphin remains an evolving project, poised to become a comprehensive tool for video interaction, backed by robust academic and community support. For those using or studying the Dolphin platform, a citation is provided to acknowledge the project's contribution to the field.