VITA
VITA is an open-source model that processes video, image, text, and audio simultaneously, enhancing capabilities in multilingual, vision, and audio tasks. It features non-awakening and audio interrupt interactions for real-time queries without manual activation, employing state token differentiation and a duplex scheme for adaptive responses during user interruptions. VITA's advanced processing abilities support diverse multimodal applications.