Groma
Groma enhances multimodal language models using innovative localized visual tokenization, effectively improving region comprehension and visual grounding. Known for its superior performance in referring expression comprehension, Groma excels on benchmarks like RefCOCO and RefCOCOg. Through a detailed training pipeline from datasets like COCO to the Groma Instruct dataset, Groma is tailored for applications bridging visual and linguistic contexts.