en

#Visual Language Model

ScreenAgent enables Visual Language Model agents to interact with computer interfaces effectively through structured task breakdowns and executions. Utilizing the VNC protocol ensures broad OS compatibility. The comprehensive ScreenAgent dataset supports diverse task automation, highlighting a methodical approach rather than a revolutionary change.

CogVLM and CogAgent provide leading visual language models for cross-modal performance. With 10 billion visual and 7 billion language parameters, CogVLM-17B excels in image tasks and dialogues at 490x490 resolution, achieving top marks in NoCaps and Flicker30k. The enhanced CogAgent, adding GUI capabilities, offers 11 billion visual parameters and 1120x1120 resolution, excelling in VQAv2 and DocVQA, outperforming on GUI datasets such as AITW and Mind2Web. These models are key for complex image-text integration tasks.

Terms of Use Privacy Policy Advertising Services

Feedback Email: [email protected]