Visual-Chinese-LLaMA-Alpaca
VisualCLA is a Chinese multimodal language model that builds upon Chinese-LLaMA/Alpaca by incorporating image encoding features. Pre-trained with Chinese image-text data, it synchronizes visual and text elements to enhance multimodal understanding. Additionally, it is fine-tuned on a range of multimodal command datasets to improve comprehension, execution, and dialogue with complex instructions. Still in its testing phase, the project aims to refine model performance in understanding and conversational tasks, offering inference code and deployment scripts via Gradio/Text-Generation-WebUI. Available in a test version as VisualCLA-7B-v0.1, it exhibits promising advancements in multimodal interaction, encouraging further exploration in diverse applications.