Introduction to DriveLM: Driving with Graph Visual Question Answering
DriveLM is an innovative project designed to enhance autonomous driving through the use of graph visual question answering (GVQA). The key focus of DriveLM is to integrate language models with driving systems to improve reasoning processes and decision-making capabilities.
Project Overview
DriveLM serves as a major initiative under the CVPR 2024 Autonomous Driving Challenge. It involves creating a comprehensive dataset known as DriveLM-Data, which builds on existing data from platforms like nuScenes and CARLA. Accompanying this dataset is a baseline model called DriveLM-Agent, designed to handle tasks related to both GVQA and end-to-end driving.
Key Highlights
-
Comprehensive Datasets: DriveLM-Data combines extensive data from popular driving datasets, nuScenes and CARLA. This dataset supports various tasks such as perception, prediction, planning, behavior, and motion, all facilitated by human-written reasoning logic.
-
GVQA Approach: The project introduces a new approach called Graph Visual Question Answering. This approach structures questions and answers in a graph form, allowing for logical dependencies between each, enhancing the understanding and reasoning capabilities with the scene.
-
Autonomous Driving Challenge: DriveLM is part of the CVPR 2024 Autonomous Driving Challenge, where participants can access all necessary resources, such as baseline models and test data, to compete effectively.
Current Endeavors and Future Directions
The project is tackling significant challenges:
- Data Shortages: DriveLM-Data provides a robust benchmark, making it easier to study driving scenarios with language components.
- Embodied Applications: GVQA offers new directions for integrating language models in embodied tasks within autonomous vehicles.
- Closed-loop Systems: DriveLM-CARLA explores the closed-loop planning potential, using language to simulate real-world driving scenarios better.
DriveLM-Data
DriveLM-Data supports key tasks across the full stack of autonomous driving tasks by linking perception, prediction, planning, behavior, and motion through a cohesive reasoning process. The dataset provides a unique approach to forming logic-driven QA pairs, thus connecting these tasks seamlessly.
Structure and Content:
- Each Q&A pair forms part of a larger graph, enabling a structured reasoning path.
- Comparisons and statistics indicate DriveLM-Data is the first of its kind to facilitate all driving tasks with structured logical dependencies.
License and Citation
The DriveLM code and assets are licensed under Apache 2.0, while language data comes under a Creative Commons license. Those utilizing this resource for research are encouraged to cite the project to acknowledge its contributions.
Conclusion
DriveLM is groundbreaking in its approach, integrating advanced language reasoning capabilities into autonomous driving systems. By leveraging existing datasets and creating new frameworks for question answering in driving contexts, DriveLM pushes the boundaries of how language and driving systems interact to achieve safer, more reliable autonomous driving solutions.