Project Icon

LLaVA-Plus-Codebase

Integrating Tools to Enhance Multimodal Agent Capabilities

Product DescriptionLLaVA-Plus enhances the functionalities of language and vision assistants by incorporating tool use for vision tasks. It forms multimodal agents capable of learning and utilizing various skills. The project's design simplifies installation and ensures operating system compatibility, offering specific guidance for Linux, macOS, and Windows. Detailed demo setups and training guides enable model deployment and utilization of public checkpoints in the Model Zoo. The training is structured in two stages: feature alignment and tool-enhanced visual instruction tuning, utilizing massive datasets like COCO and VisualGenome. Available for research purposes under defined licenses, the project supports the advancement of multimodal AI by permitting non-commercial use.
Project Details