3D-OVS - Advanced 3D Scene Segmentation Without Annotations Using Open-vocabulary Texts

3D-OVS: An Introduction to Weakly Supervised 3D Open-vocabulary Segmentation

3D-OVS, a pioneering project presented at NeurIPS 2023, revolutionizes the way 3D scenes are segmented. This project introduces a method called "Weakly Supervised 3D Open-vocabulary Segmentation," which segments 3D scenes using open-vocabulary texts without the need for traditional, extensive segmentation annotations.

Project Overview

At its core, 3D-OVS employs state-of-the-art technology to identify and label different parts of a 3D scene based on text descriptions. Unlike conventional methods that require large datasets of pre-labeled training data, this approach requires minimal supervision, relying instead on natural language inputs to guide the segmentation process. This drastically reduces the time and resources needed to prepare datasets for training machine learning models in the context of 3D environments.

Installation

To get started with 3D-OVS, users should have an Ubuntu 20.04 system with Pytorch 1.12.1 installed. The environment setup involves creating a Conda environment and installing necessary Python packages like torch, torchvision, opencv-python, and others. Additionally, the CLIP model from OpenAI is integral to the segmentation process and needs installation.

Dataset Preparation

For effective use, the 3D-OVS project requires downloading and organizing datasets into specified folders. Each dataset includes images, test view segmentations, the classes’ textual descriptions, and camera poses. It is essential to adjust file paths in the configuration files if datasets are stored in alternative directories.

Quick Start Guide

Checkpoints for different scenes, a kind of snapshot of the model's learning process, are available for use. Users can test the segmentation capabilities by running a simple bash script that uses these checkpoints. The configuration for each scene is stored in a respective file within a configs directory.

Data Extraction

A significant part of preparing for training involves extracting a hierarchy of CLIP features from image patches using provided scripts. This specialized data extraction is crucial as it serves as the foundation for teaching the model to understand different parts of a scene under minimal supervision.

Training Process

Training is divided into two main parts:

TensoRF Reconstruction: This initial step involves reconstructing the scenes. Users need to set the desired dataset path and experiment name in the configuration file. Once modified, training can commence, and the TensoRF model’s progress will be logged.
Segmentation Training: The actual training script for segmentation uses configurations specific to the dataset being used. The training process is efficient, requiring around 90 minutes while consuming approximately 14GB of GPU memory.

Troubleshooting

The 3D-OVS documentation provides key insights into potential issues users might face, such as slow loading of CLIP features due to their size, prompt engineering for better relevancy of segmentation maps, handling custom data, and adjusting model parameters if the segmentation results are not satisfactory.

Future Developments

3D-OVS currently supports forward-facing scenes. Future enhancements might include extending support for unbounded 360-degree scenes through the use of coordinate transformations, broadening its applicability and functionality.

Acknowledgments

The 3D-OVS project heavily builds on the work of the TensoRF project. The creators acknowledge the contributions of the researchers and developers who laid the groundwork for the current advancements.

Citation

For those interested in referencing the work scientifically, the relevant citation information is provided, showcasing the contributions of multiple authors to this innovative research.

This project's breakthrough in reducing the dependency on extensive datasets and annotations marks a significant shift in how we can utilize machine learning for 3D scene analysis, opening up numerous possibilities for future applications and research.