3D ResNets for Action Recognition

Overview

The 3D-ResNets-PyTorch project is a comprehensive PyTorch implementation designed for action recognition using 3D Convolutional Neural Networks (3D CNNs). This project offers scripts and pretrained models to facilitate the development and testing of models that leverage spatial and temporal information for understanding human actions in videos.

Key Publications

This project is detailed in several key research papers:

Mega-scale Datasets for 3D CNNs: Explores the benefits of using large datasets to improve the performance of spatiotemporal 3D CNNs. Read more on arXiv.
Good Practice for Action Recognition: Discusses best practices for employing 3D convolutions in action recognition tasks. Find it in the ICPR proceedings.
Retracing 2D CNNs History: Investigates whether 3D CNNs can mimic the success path of 2D CNNs with datasets like ImageNet. Read more in CVPR 2018.
Spatio-Temporal Feature Learning: Focuses on learning features with 3D residual networks for action recognition. Find it in the ICCV Workshop proceedings.

Major Updates

April 2020 Updates

Published a paper on using large-scale datasets for enhancing 3D CNNs.
Released pretrained models, including the ResNet-50 trained on datasets like Kinetics-700 and Moments in Time.
Significantly refactored project scripts to improve usability with newer PyTorch versions.
Added support for distributed training and new datasets such as Moments in Time.
Introduced R(2+1)D models alongside traditional 3D ResNets.

Pre-trained Models and Usage

Multiple pre-trained models are available, trained on various datasets like Kinetics-700, Moments in Time, and STAIR-Actions. Users can fine-tune these models by setting specific options related to model architecture and depth. The models can cater to various numbers of pre-trained classes depending on the dataset used.

Requirements for Use

PyTorch: Version 0.4 or higher is necessary. Installation can be done via Conda.
FFmpeg and FFprobe: Required for video processing.
Python 3: The project is developed with Python 3.

Dataset Preparation

The project mostly deals with popular video datasets that need conversion from video formats like AVI to JPG for processing. Specific scripts are provided to aid these conversions and to generate necessary annotation files for datasets such as ActivityNet, Kinetics, UCF-101, and HMDB-51.

Running the Project

Users need to ensure that their data directories are structured correctly. The provided scripts aid in training the models, resuming training, fine-tuning, and evaluating with various datasets. Key operations include setting batch sizes, the depth of the model, and choosing CPU/GPU for the process.

Conclusion

The 3D-ResNets-PyTorch project presents an in-depth resource for researchers and developers interested in video action recognition using advanced 3D CNN methodologies. With meticulous preparation scripts and robust pretrained models, it provides an excellent platform for exploring the depths of spatiotemporal video analysis.