MIMDet
Utilizing Masked Image Modeling with a Vanilla ViT, this project enhances object detection and instance segmentation. A compact convolutional stem is integrated for multi-scale representation, forming a hybrid ViT-ConvNet backbone. It achieves significant results on COCO with 51.7 box AP and 46.2 mask AP, showcasing efficiency in training and accuracy in inference through varied sample ratios.