DIVA
This project uses a post-training self-supervised diffusion approach to enhance CLIP models. By integrating text-to-image generative feedback, it enhances visual precision across benchmarks, boosting performance by 3-7% on the MMVP-VLM. It retains CLIP's zero-shot ability across 29 classification benchmarks, while acting as a new Visual Assistant for improved multimodal insights.