3D Bounding Box Estimation Using Deep Learning and Geometry
Introduction
The 3D Bounding Box Estimation project focuses on implementing a PyTorch-based system described in a research paper. It estimates 3D bounding boxes for detected objects using deep learning and geometrical techniques. With an average processing time of approximately 0.4 seconds per frame, this project aims to provide a quick and efficient way to identify and estimate the spatial dimensions of objects. Future improvements are expected to increase its speed further.
Requirements
To run this project, the following tools and libraries are required:
- PyTorch: A popular machine learning library for Python.
- Cuda: A parallel computing platform that helps leverage the computing power of NVIDIA graphics cards.
- OpenCV (version 3.4.3 or higher): A library designed for real-time computer vision.
Usage
To start using the 3D Bounding Box Estimation system, the pre-trained weights need to be downloaded. By executing a simple script, users can obtain the necessary weights for both the 3D bounding box network and the YOLOv3 object detection network:
cd weights/
./get_weights.sh
Alternatively, the pre-trained weights can be directly downloaded from available online sources.
For exploring different options and settings, users can run the following command:
python Run.py --help
To process images in the default directory and optionally display 2D bounding boxes, use:
python Run.py [--show-yolo]
For video processing with the default video data from the Kitti dataset, a specific script is provided. Users can also input their own Kitti videos along with the required calibration data:
python Run.py --video [--hide-debug]
Training
The training process requires downloading about 13GB of data from the Kitti dataset, including left color images, training labels, and camera calibration matrices. After organizing the data into the specified directory, the model can be trained by executing:
python Train.py
The training script saves the model every 10 epochs. It is important to note that the loss function for orientation is designed to converge to -1, so a negative loss value is normal. The script allows for tuning specific parameters (alpha
and w
) to refine performance, often achieving satisfactory results after 10 epochs, though it can run up to 100 epochs.
How it Works
The system functions by first resizing input images to 224x224 pixels. A neural network predicts the orientation and dimensions relative to the class average of the detected object. It requires a supplementary neural network, such as YOLOv3 via OpenCV, to provide 2D bounding boxes and object classification. Using these predictions, the project's algorithm calculates the 3D position and projects it back onto the original image.
Two main assumptions for this process include:
- The 2D bounding box tightly encloses the object.
- The object, like a car on the road, has minimal pitch and roll angles.
Future Goals
The project's developers have set goals to enhance its capabilities and expand its utility:
- Training a custom YOLO network using the Kitti dataset for improved object detection accuracy.
- Implementing pose visualization, potentially through robotic operating systems (ROS).
Credit
The development of this project began with a fork from a now-archived repository and incorporates code for training scripts. The process of converting 2D to 3D geometric estimations is inspired by existing academic resources.
In conclusion, the 3D Bounding Box Estimation project provides a practical approach to visualizing objects in three dimensions, drawing from the power of PyTorch and sophisticated computer vision techniques. Its development continues toward better speeds and expanded functionality, contributing to advancements in visual perception technology.