#multimodal

Logo of keras-llm-robot
keras-llm-robot
Keras-llm-robot utilizes Langchain and Fastchat frameworks in a Streamlit UI for offline deployment of Hugging Face models, with features like model integration, multimodal support, and customizations including quantization and fine-tuning. It also offers tools for retrieval, speech, and image recognition, plus environment setup guides for multiple OSs, ideal for developers exploring AI model deployment.
Logo of data-juicer
data-juicer
Data-Juicer is a versatile platform that streamlines the processing of multimodal data for large language models, supporting formats like text, image, audio, and video. Its integration with Alibaba Cloud's AI enhances data-model co-development, allowing swift iteration and refinement. With extensive features and flexible configurations, it upgrades data quality and efficiency in AI processing, aligning with top industry standards.
Logo of wit
wit
The WIT dataset offers a vast collection of 37.6 million image-text examples sourced from 108 languages on Wikipedia, optimized for pretraining multimodal machine learning models. Its strengths include broad multilingual support, detailed metadata, and demanding real-world evaluations. The dataset facilitates advancements in multilingual and multimodal research by using images as a universal medium to bridge language barriers, enhancing text comprehension across languages. WIT is widely recognized in research circles and is available for download.
Logo of VideoLLaMA2
VideoLLaMA2
Discover cutting-edge techniques in spatial-temporal modeling and audio-visual integration with VideoLLaMA2, a project that provides advanced capabilities for video and audio question answering. With recent updates including new checkpoints, the project offers important insights into multi-source video captioning, designed for researchers and developers exploring high-performance video solutions.
Logo of generative-ai-python
generative-ai-python
The Google AI Python SDK provides efficient access to the Gemini API, supporting multimodal capabilities across various mediums like text, images, and code. Developed by Google DeepMind, Gemini models facilitate advanced integrations. Starting with the API is straightforward through Google AI Studio's API key acquisition and SDK quickstart guidance. Developers benefit from extensive resources like the Gemini API Cookbook and comprehensive tutorials for Python model implementation. The setup process is simplified via PyPI installation and supported by thorough documentation and community-driven open-source contributions.
Logo of VisCPM
VisCPM
VisCPM represents a versatile series of open-source bilingual multimodal models, proficient in dialogue ('VisCPM-Chat') and image generation ('VisCPM-Paint'). Utilizing the robust CPM-Bee model and integrating advanced visual encoders and decoders, it excels in bilingual processing with superior Chinese language capabilities. Ideal for research applications, VisCPM continuously evolves with features such as low-resource inference and web deployment, facilitating widespread utilization.
Logo of chatgpt-on-wechat
chatgpt-on-wechat
chatgpt-on-wechat is a versatile AI chatbot supporting platforms like WeChat Official Accounts, Enterprise WeChat, Lark, and DingTalk. It integrates multiple AI models such as GPT-3.5 and GPT-4, enabling text, voice, and image processing. With flexible deployment options, it features multi-round conversation memory, speech recognition, image generation, and extensible plugins, making it ideal for personalized enterprise AI solutions with a customizable knowledge base.
Logo of Visual-Chinese-LLaMA-Alpaca
Visual-Chinese-LLaMA-Alpaca
VisualCLA is a Chinese multimodal language model that builds upon Chinese-LLaMA/Alpaca by incorporating image encoding features. Pre-trained with Chinese image-text data, it synchronizes visual and text elements to enhance multimodal understanding. Additionally, it is fine-tuned on a range of multimodal command datasets to improve comprehension, execution, and dialogue with complex instructions. Still in its testing phase, the project aims to refine model performance in understanding and conversational tasks, offering inference code and deployment scripts via Gradio/Text-Generation-WebUI. Available in a test version as VisualCLA-7B-v0.1, it exhibits promising advancements in multimodal interaction, encouraging further exploration in diverse applications.
Logo of LLaVA
LLaVA
Investigate how visual instruction tuning is advancing large language and vision models with capabilities similar to GPT-4. LLaVA introduces refined techniques for integrating visual cues, improving performance for complex tasks in these domains. The LLaVA-NeXT release presents enhanced models supporting LLaMA-3 and Qwen, achieving remarkable outcomes in video tasks without prior training. This project also emphasizes community involvement, offering a comprehensive Model Zoo and straightforward installation processes. Learn how LLaVA is establishing new benchmarks and achieving significant successes in current evaluations.
Logo of datacomp
datacomp
This competition aims to design effective datasets for pre-training CLIP models, prioritizing dataset curation. Participants focus on achieving high accuracy in downstream tasks by selecting optimal image-text pairs, with a fixed model setup. The competition offers two tracks, allowing varying computational resources: one with a provided data pool and another that accepts additional external data. With scales from small to xlarge, it covers different computational demands. The project offers tools for downloading, selecting subsets, training, and evaluation to support flexible and robust participation.
Logo of Lumina-mGPT
Lumina-mGPT
Lumina-mGPT is a suite of cutting-edge multimodal autoregressive models specializing in text-to-image generation with precision and adaptability. Equipped with extensive training resources, Lumina-mGPT supports complex multimodal tasks through the xllmx module. Its functionality is demonstrated via local Gradio demos on image creation and interpretation. This open-source project serves as a resource for both research and practical applications, accommodating expanding model configurations within the AI field.
Logo of AnyGPT
AnyGPT
AnyGPT is a versatile model handling speech, text, images, and music through discrete representations, enabling smooth conversions. Utilizing the AnyInstruct dataset, it supports tasks like text-to-image and text-to-speech and showcases advanced data compression within generative training. This approach unlocks new capabilities beyond traditional text-only models.
Logo of towhee
towhee
Towhee enhances unstructured data processing by leveraging LLM-based orchestration, converting text, images, audio, and video into efficient database-ready formats such as embeddings. It supports multiple data modalities and provides comprehensive models across CV, NLP, and additional fields. Offering prebuilt ETL pipelines and efficient backend support using Triton Inference Server, Towhee's Pythonic API allows for the easy development of custom data workflows. Streamline data operations for production environments with Towhee's adaptable and scalable technology.
Logo of Otter
Otter
Explore the features of Otter's latest version in multimodal instruction tuning, focusing on OtterHD-8B and MagnifierBench. Otter introduces techniques such as detailed visual interpretation without using a vision encoder and advanced training methods with Flash-Attention-2 for increased efficiency. Evaluate diverse uses with the MIMIC-IT dataset for integrated video and image processing. Otter provides advanced capabilities for complex visual inputs, serving as a valuable resource for AI visual tasks.
Logo of TencentPretrain
TencentPretrain
TencentPretrain is a comprehensive AI toolkit for pre-training and fine-tuning in text, vision, and audio modalities. Its modular architecture allows for flexible model configuration and scalability. With a diverse model zoo, it offers a range of pre-trained models suited for various tasks. Supporting CPU, single GPU, and distributed training, including DeepSpeed, it ensures superior performance in AI research and applications such as classification and reading comprehension.
Logo of multimodal
multimodal
TorchMultimodal is a PyTorch library designed for comprehensive multimodal multi-task model training. It provides modular fusion layers, adaptable datasets, and pretrained model classes while enabling integration with elements from the PyTorch framework. The library includes numerous examples for training, fine-tuning, and evaluating models on various multimodal tasks. Models such as ALBEF, BLIP-2, CLIP, and DALL-E 2 facilitate the replication of state-of-the-art research, providing a valuable resource for researchers and developers aiming to advance in multimodal model training.
Logo of Seeing-and-Hearing
Seeing-and-Hearing
Discover a method for enhancing video and audio content creation by integrating existing models through a shared latent space. This approach supports joint and conditional tasks such as video-to-audio and audio-to-video generation, utilizing a multimodal latent aligner and the pre-trained ImageBind, serving the needs of professionals in the film industry.
Logo of NExT-GPT
NExT-GPT
The project presents a versatile multimodal model that processes and generates various output types, including text, images, videos, and audio. It utilizes pre-trained models and advanced diffusion technology to enhance semantic understanding and multimodal content generation. Recent updates include the release of code and datasets, supporting further research and development. Developers can customize NExT-GPT with flexible datasets and model frameworks. Instruction tuning strengthens its performance across different tasks, making it a solid foundation for AI research.
Logo of MPP-LLaVA
MPP-LLaVA
This project enables exploration into advanced multimodal communication and processing, supporting image and video dialogues. It leverages QwenLM for seamless multi-round conversations, offering efficient solutions for complex interactions via pipeline and model parallelism. The framework is optimized for training and inference on multiple GPUs with DeepSpeed implementations and provides open-source pre-trained and SFT weights for diverse AI applications.
Logo of dynalang
dynalang
Discover how Dynalang uses language processing to forecast events within a multimodal model. Detailed in the study 'Learning to Model the World with Language,' this project encompasses guides for installation and use in environments such as HomeGrid, Messenger, VLN, and LangRoom. It focuses on training, text pretraining, and finetuning, offering resources for implementation without the exaggeration.
Logo of open_flamingo
open_flamingo
OpenFlamingo is an open-source PyTorch implementation of a multimodal language model inspired by DeepMind's Flamingo. By integrating image and text inputs with pretrained vision encoders and language models, it performs various tasks efficiently. The project allows training and evaluation through provided scripts and offers multiple model versions tailored for specific functions. It simplifies tasks like image captioning and context-based text generation, with future enhancements to include video input support.
Logo of scenic
scenic
Scenic offers a robust framework for creating attention-based computer vision models, supporting tasks like classification and segmentation across multiple modalities. Utilizing JAX and Flax, it simplifies large-scale training through efficient pipelines and established baselines, ideal for research. Explore projects with state-of-the-art models like ViViT. Scenic provides adaptable solutions for both newcomers and experts, facilitating easy integration into existing workflows.