en

#Visual Grounding

Groma enhances multimodal language models using innovative localized visual tokenization, effectively improving region comprehension and visual grounding. Known for its superior performance in referring expression comprehension, Groma excels on benchmarks like RefCOCO and RefCOCOg. Through a detailed training pipeline from datasets like COCO to the Groma Instruct dataset, Groma is tailored for applications bridging visual and linguistic contexts.

Terms of Use Privacy Policy Advertising Services

Feedback Email: [email protected]