multimedia-gpt - Utilize OpenAI Models for Comprehensive Multimodal Functionality Supporting Vision and Audio Inputs

Overview of Multimedia GPT

Multimedia GPT is an innovative project that aims to enhance the capabilities of OpenAI's GPT models by incorporating vision and audio functionalities. This enables users to send various multimedia inputs such as images, audio recordings, and PDF documents to the OpenAI API and receive responses in both text and image formats. The project is continually being developed to include support for video integration, expanding its multimedia handling capabilities.

Integration with Other Models

The project leverages several advanced models to support its functionalities. Beyond the foundational vision models included in Microsoft Visual ChatGPT, Multimedia GPT integrates OpenAI's Whisper for speech recognition and DALLE for image generation. This integration negates the need for users to have their own GPUs for activities like voice recognition and image generation, although this option remains available.

The base chat model for the project can be set as any OpenAI Language Learning Model (LLM), such as ChatGPT or GPT-4, with the default being text-davinci-003. Users can freely customize the project by forking it and adding models that meet their specific needs. This can be achieved using tools like llama_index and by creating new model classes and suitable runner methods.

Demonstrations

In practice, Multimedia GPT showcases its potential, for example, by processing audio recordings such as a person narrating the tale of Cinderella. This demo exemplifies the system's ability to comprehend and generate meaningful responses from multimedia inputs.

Setting Up and Starting Multimedia GPT

To utilize Multimedia GPT, users can clone the repository and set up an environment using conda. After installation, users must insert their private OpenAI API key to run the project. Multimedia GPT provides flexibility in specifying the GPU or CPU for models, allowing users to run Whisper and DALLE remotely using their API key if they lack GPUs.

Here’s how one might start Multimedia GPT:

python multimedia_gpt.py --load ImageCaptioning_cpu,DALLE_cpu,Whisper_cpu

Moreover, users have the option to choose which OpenAI LLM they prefer, using the example command below:

python multimedia_gpt.py --llm text-davinci-003

With the default setting being gpt-3.5-turbo (ChatGPT), users can easily switch between different backends.

Future Plans

Although Multimedia GPT is an experimental project and not intended for production deployment, it aims to explore and harness the potential of prompting. The developers have a robust set of goals, including the support for OpenAI Whisper and DALLE, and the possibility of extracting key frames from videos. However, the team acknowledges existing limitations and actively identifies solutions to enhance the functionality of Multimedia GPT.

Overall, Multimedia GPT represents a promising leap forward in integrating language models with multimedia inputs, providing diverse and powerful tools for users to explore and innovate within the realm of AI-driven interactions.