UniRepLKNet - Comprehensive Recognition Across Modalities with Large-Kernel ConvNet

UniRepLKNet: A Universal Perception Large-Kernel ConvNet

Overview

UniRepLKNet is an innovative project that introduces a convolutional neural network (ConvNet) designed to handle multiple data types across various domains, such as audio, video, point clouds, time-series, and images. This model is called a "universal perception" model because it aims to unify the processing of different data modalities through a single architectural framework. It is a step forward in machine learning, with the potential to outperform traditional, modality-specific models.

Motivation

The project addresses two significant challenges in the field of large-kernel ConvNets. First, current architectures mostly derive from previous models without substantial adjustments for large kernels, leaving the optimal design unexplored. Second, while Transformers are known for their universal approach in multimodal research, ConvNets traditionally have not been leveraged similarly across different data types. UniRepLKNet attempts to bridge this gap by employing a unified architecture that could provide universal perception abilities.

Highlights

UniRepLKNet innovates by unifying processing across multiple modalities and outperforming models that are specifically tailored to individual types of data. This achievement is important for two ongoing areas in machine learning: Structural Re-param (since the introduction of RepVGG in 2021) and very-large-kernel ConvNet (known since RepLKNet in 2022).

Key performance metrics include an ImageNet accuracy of 88.0%, COCO average precision of 56.4, and ADE20K mean Intersection over Union (mIoU) of 55.6, achieved with just ImageNet-22K pretraining. These results are notably faster and more efficient than recent models such as ConvNeXt v2.

Furthermore, UniRepLKNet shows outstanding capabilities in audio recognition and remarkably in predicting global temperature and wind speed. This latter application demonstrates the model’s strength in handling large-scale forecasting tasks traditionally challenging for neural networks.

Architectural Contributions

UniRepLKNet introduces four architectural guidelines for designing large-kernel ConvNets:

Large kernels can achieve wide coverage without deep stacking of layers.
Exploitation of fundamental characteristics of large kernels, distinguishing them from small kernels.
Leveraging these guidelines leads to superior performance in image recognition tasks.
The model adapts across various domains with minimal preprocessing, achieving state-of-the-art results in time-series forecasting and audio recognition without modifying the core architecture for different modalities.

Advancements and Releases

UniRepLKNet represents a "comeback" for ConvNets, showcasing their potential not only in traditional domains but also in new areas, increasing their versatility and adaptability:

The project has released model code and pretrained weights on platforms like Google Drive and Hugging Face.
Efficient PyTorch implementations and training code are available, along with supporting materials for tasks involving images, audio, video, and time-series.
While most functionalities are live, certain checkpoints are still being finalized.

Technical Design and Usage

The project code is designed to integrate seamlessly with frameworks like MMDetection and MMSegmentation. It provides models that can be easily instantiated and deployed in different environments using Python libraries like timm. Additionally, the code supports model reparameterization, which optimizes the network’s structure for improved inference efficiency.

Model Variants and Checkpoints

UniRepLKNet offers several model variants pretrained on datasets like ImageNet-1K and ImageNet-22K. These variants come with different parameters and complexities to cater to diverse application requirements. Extensive benchmarks on tasks like COCO object detection and ADE20K semantic segmentation are available, with weights accessible for easy deployment.

Conclusion

UniRepLKNet is a pioneering effort in the field of universal perception models, demonstrating that ConvNets, traditionally specialized models, can achieve universal applicability across data types with a thoughtfully designed large-kernel architecture. By balancing efficiency and performance, UniRepLKNet stands as a remarkable step towards the future of machine learning applications.