MultiPLY
MultiPLY is a multisensory embodied language model that interacts with 3D objects to gather sensory information like visual, audio, tactile, and thermal inputs. It integrates this data to strengthen the relationship between language, action, and perception by encoding scenes into object-centric representations. Sensory details become apparent through agent interactions utilizing specially designed tokens, enhancing language model capabilities for better 3D interaction fidelity.