litepose - Real-Time Multi-Person Pose Estimation Optimized for Edge Devices

Project Overview: LitePose

LitePose is a project focused on advancing the field of human pose estimation, particularly in situations where computational resources are limited. This innovation is especially relevant for edge devices where implementing advanced pose estimation models, typically based on the HRNet framework, can be challenging due to their high computational requirements. These models often require over 150 GMACs per frame, which can be prohibitive for devices with limited processing power.

The Innovation: LitePose Design

LitePose introduces an efficient architecture for real-time multi-person pose estimation that is tailored for edge computing environments. Researchers identified that HRNet's high-resolution branches are unnecessary for low-computation models. By removing these redundant branches, LitePose not only improves efficiency but also enhances performance.

Key Components

Fusion Deconv Head: This component reduces redundancy by enabling scale-aware feature fusion with minimal computational overhead.
Large Kernel Convolutions: With the use of larger kernel sizes, LitePose improves model capacity and expands the receptive field without significantly increasing computational cost. For instance, adopting a $7 \times 7$ kernel can achieve a 14.0 mAP improvement over a $3 \times 3$ kernel on the CrowdPose dataset, with only a 25% increase in computation.

Performance Milestones

LitePose significantly reduces latency on mobile platforms, achieving up to a 5.0 times reduction compared to previous state-of-the-art models, without compromising performance. This breakthrough propels the capabilities of real-time pose estimation on edge devices.

Results and Benchmarks

CrowdPose Dataset Performance

LitePose has demonstrated superior performance across different models within its family on the CrowdPose dataset:

LitePose-Auto-S: Achieved 58.3 mAP with just 5.0G MACs, offering significant latency reduction compared to other models like HigherHRNet or EfficientHRNet.
LitePose-Auto-XS: Offers a balance between performance and efficiency, ensuring fast predictions suitable for real-time applications.

COCO Dataset Performance

On the COCO 2017 validation and test-dev sets, LitePose continues to excel, particularly the Auto-M model variant, which achieved a 59.8 mAP, showcasing its ability to maintain high accuracy with reduced computational demands.

Getting Started with LitePose

Prerequisites and Setup

To use LitePose, one needs to first install PyTorch and other necessary dependencies. Additionally, the COCO and CrowdPose datasets must be prepared by downloading and arranging them appropriately following the guidelines of the HigherHRNet repository.

Training and Evaluation

LitePose can be trained either as a super-net or a specific sub-network. The training scripts provided allow for flexibility in model configuration and optimization. Evaluating the trained models is equally straightforward, with scripts tailored for assessing performance on chosen architectures.

Pre-trained Models

The project offers pre-trained models which can be used to replicate the results presented in the research. These models cover different configurations suitable for various levels of computational capacity.

Acknowledgements and Further Reading

LitePose draws inspiration and builds upon the HRNet family of models, especially benefiting from the structure of HigherHRNet. The large kernel convolution approach aligns with contemporary research findings, as seen in works like ConvNeXt and RepLKNet. For readers interested in further technical details and background, access to the paper and supplementary materials is available.

LitePose signifies a significant step forward in accessible, efficient pose estimation, making sophisticated human pose applications feasible on a broader range of devices.