GiT
Discover a novel general vision model employing a basic Vision Transformer to unify multiple vision tasks effectively. The model maintains minimal dependencies with a clean codebase for optimal performance in object detection, semantic segmentation, and vision-language tasks. By using a unified language interface, it enhances multi-task training outcomes, excelling in zero-shot and few-shot benchmarks. The training strategy aligns with modern language model frameworks to ensure broad scalability and adaptability.