EmbodiedScan: A Holistic Multi-Modal 3D Perception Suite Towards Embodied AI
About the Project
EmbodiedScan is a cutting-edge project in the fields of computer vision and robotics, designed to advance the capabilities of embodied artificial intelligence (AI). In essence, embodied AI refers to machines and robots that can understand and interact with their environment, much like how humans do. For these machines to perform tasks and follow instructions, they need to fully understand their surroundings in 3D from their own perspective, rather than from a bird’s-eye view, as traditional setups often do.
EmbodiedScan fills this gap by offering a robust dataset and benchmark for comprehensive understanding of 3D scenes. It includes over 5,000 scans and a million ego-centric RGB-D views—visual data captured from the perspective of the robot or AI. Additionally, the dataset contains one million language prompts and over 160,000 3D-oriented boxes, covering more than 760 object categories. This wealth of information assists in training embodied agents to interpret and perform tasks involving language and visual understanding.
Exciting Features
-
Multi-Modal Perception: EmbodiedScan integrates various data forms, such as RGB-D images and textual descriptions, to help AI systems gain a richer, more nuanced understanding of their environment.
-
Ego-Centric Perspective: Unlike traditional datasets that provide a global view, EmbodiedScan focuses on the first-person perspective of the AI, resembling how humans perceive and interact with the world.
-
Language Integration: The project marries language comprehension with visual inputs, enabling AI to not only see its surroundings but also understand textual instructions related to them.
-
Comprehensive Dataset: With millions of data points including language prompts and categories, EmbodiedScan offers a versatile resource for developing AI capable of complex tasks.
-
Advanced Benchmarking: The benchmark associated with EmbodiedScan sets a standard for evaluating the performance of AI systems in complex, real-world scenarios.
Getting Started
EmbodiedScan is designed to be accessible for developers and researchers. It can be used on systems running Ubuntu 20.04, with the necessary installation of NVIDIA drivers and PyTorch3D. The installation process involves cloning the project repository and setting up the required computational environment.
For data preparation, users are guided through a straightforward process to download and organize the necessary files. A tutorial and demo inference are also provided, offering a hands-on guide for utilizing the dataset in experiments.
Training and Evaluation
EmbodiedScan provides configurations for various tasks such as 3D detection and occupancy prediction. Researchers can train and test these models using single or multiple GPUs, depending on their computational power.
Furthermore, EmbodiedScan includes several pretrained models for users to implement and evaluate, with detailed logs and performance metrics provided for comparison.
Benchmarks
Several baseline benchmarks are available for tasks such as multi-view 3D detection, occupancy prediction, and visual grounding. These benchmarks highlight the dataset's ability to facilitate advancements in AI's understanding and interaction with complex 3D environments.
Future Plans and Updates
The project is continually evolving, with ongoing releases of data and improvements. Future plans include expanding the dataset and refining models and evaluation methods.
EmbodiedScan serves as a foundational resource for the next generation of AI systems, pushing the boundaries of how machines perceive and interact with the world through a uniquely human-like perspective.