CogVLM
CogVLM and CogAgent provide leading visual language models for cross-modal performance. With 10 billion visual and 7 billion language parameters, CogVLM-17B excels in image tasks and dialogues at 490x490 resolution, achieving top marks in NoCaps and Flicker30k. The enhanced CogAgent, adding GUI capabilities, offers 11 billion visual parameters and 1120x1120 resolution, excelling in VQAv2 and DocVQA, outperforming on GUI datasets such as AITW and Mind2Web. These models are key for complex image-text integration tasks.