#CVPR 2024
DriveLM
This article explores the application of Graph Visual Question Answering in autonomous driving systems, particularly for the 2024 challenge. It uses datasets such as nuScenes and CARLA to develop a VLM-based baseline approach that combines Graph VQA with end-to-end driving solutions. The project seeks to simulate human reasoning in driving, offering a holistic framework for perception, prediction, and planning. It merges language models with autonomous systems for explainable planning and improved decision-making in self-driving vehicles. Learn about the project's novel methodologies and its impact on the field of autonomous vehicles.
mickey
MicKey implements a feature detection pipeline for estimating keypoint positions directly in camera space, enabling metric correspondences and relative pose estimation. Featured at CVPR 2024, this end-to-end differentiable model utilizes pose optimization, necessitating only image pairs with ground-truth poses for training. Addressing AR challenges, MicKey supports a map-free benchmark using a single reference image. The availability of pre-trained models and demo scripts supports streamlined evaluation and custom processing, illustrating MicKey's practical contributions to enhanced image alignment and pose estimation.
SIFU
SIFU uses advanced technology to generate detailed 3D models of clothed humans from a single image. Its unique Side-view Conditioned Implicit Function enhances feature extraction and geometric accuracy. The 3D Consistent Texture Refinement enhances texture quality, adapting well to complex poses and loose clothing, making it suitable for applications such as 3D printing and animation. Discover the project's outcomes and technical insights, emphasizing its practical utility.
CFLD
This article introduces an advanced pose-guided image synthesis method using Coarse-to-Fine Latent Diffusion, showcased at CVPR 2024. It demonstrates improvements in resolution through datasets like DeepFashion. Resources such as code, models, and a customizable Jupyter notebook are provided to aid researchers in optimizing image synthesis processes. Access to generated images and pre-trained models allows for experimentation and further study.
richdreamer
RichDreamer applies a Normal-Depth Diffusion Model to convert text into detailed 3D images, suitable for various applications. The model, accepted at CVPR2024, supports MultiView-ND and Albedo Diffusion for enhanced detail. It includes installation instructions and pre-trained weights, and works efficiently with various GPUs. Resources are available on ModelScope's 3D Object Generation platform.
CVPR2024-Papers-with-Code
Access a wide range of CVPR 2024's notable papers and open-source projects on OpenReview, covering various areas such as 3D modeling, AI advancements, and multimodal learning. Connect with a global network of experts to keep abreast of the latest in computer vision and related tech.
Depth-Anything
Depth Anything employs large-scale datasets comprising over 63.5 million images to improve monocular depth estimation. Recognized by CVPR 2024, this method enhances depth prediction capabilities in relative and metric contexts. The project integrates features such as an optimized depth-conditioned ControlNet and supports scene understanding. With the release of Depth Anything V2 and its integration with platforms like Hugging Face, the project offers accessible tools for enhancing depth perception technologies.
murf
Discover MuRF's advancements in multi-baseline radiance fields, delivering top performance in various evaluations. Utilizing PyTorch and CUDA, it provides comprehensive scripts for installation, training, evaluation, and rendering. Models are available on Hugging Face with detailed datasets to thoroughly explore MuRF's capabilities. Suitable for researchers and developers looking to improve radiance field representations. Consult the documentation for easy integration and enhance computer vision projects with MuRF.
PointTransformerV3
PointTransformerV3 offers an efficient approach to 3D point cloud segmentation, providing improved speed and accuracy in semantic segmentation tasks on benchmarks like nuScenes and ScanNet. The project is continually updated in Pointcept v1.5, supplying valuable resources such as model weights and experiment records. Selected for oral presentation at CVPR'24, it utilizes Flash Attention to enhance computational efficiency and support scalable multi-dataset 3D representation learning.
UniDepth
UniDepth provides universal metric depth estimation from single images, accommodating adaptable input shapes, confident predictions, and swift processing. This leading model in depth estimation boasts a strong decoder design and ONNX compatibility, excelling in tests like KITTI and NYUv2. The project progresses with compact model versions and continual image processing enhancements. It facilitates seamless model access via Hugging Face and TorchHub, fostering integration in various AI frameworks. Researchers are encouraged to contribute to this innovative tool, advancing automated visual comprehension.
mPLUG-Owl
Examine the progressive developments in multi-modal large language models achieved by mPLUG-Owl. This family utilizes modular architecture to boost multimodality. Stay informed on the advancements of mPLUG-Owl3, which emphasizes long image-sequence comprehension, and note mPLUG-Owl2's CVPR 2024 accolade. Gain insight into the enriched features of the Chinese version, mPLUG-Owl2.1, which collectively contribute to advancing AI linguistic capabilities.
Marigold
Discover the application of diffusion-based image generators in advancing monocular depth estimation, utilizing generative model insights. This project excels in zero-shot transfer and integrates seamlessly with platforms such as Hugging Face and Google Colab, delivering leading-edge results in visual computing.
InstanceDiffusion
InstanceDiffusion offers precise, instance-level control for text-to-image diffusion models, significantly improving image generation quality with enhanced metrics like 2.0 times higher AP50 for box inputs. It supports varied location inputs, such as points and masks. Recent updates include integration with ComfyUI and support for flash attention, reducing memory usage. The model is thoroughly evaluated for advanced tasks on datasets like MSCOCO, making it suitable for research and academic exploration.
APISR
The project improves anime images and videos quality with advanced super-resolution techniques from real-world inspirations. It enables upscaling of low-resolution anime content with various degradations. Features include multiple upscaling factors, different architecture weights, and integration with Toon Crafter for better results. It provides easy installation, Gradio demo for fast inference, and comprehensive guides on dataset creation and training. Suitable for developers focusing on advanced anime visual enhancement.
Pointcept
Pointcept offers a robust framework for point cloud perception research, incorporating models such as Point Transformer V3 and OA-CNNs. It supports extensive 3D representation learning, integrating a vast array of datasets and tasks like semantic and instance segmentation. As a pivotal resource in 3D model training, Pointcept is an invaluable tool for researchers delving into cutting-edge computer vision and perception capabilities.
4DGaussians
The 4DGaussians project provides an efficient method for real-time and high-quality rendering of dynamic scenes using 4D Gaussian splatting. It optimizes setups and hyperparameters for datasets such as HyperNeRF and D-NeRF, offering minimal training times while supporting multiple view configurations. This project facilitates environment setup, data preparation, and model training using PyTorch. Check out the project's GitHub for scripts and further insights into rendering, evaluation, and checkpoint usage.
awesome-cvpr-2024
Discover the latest advances in Computer Vision and Pattern Recognition at CVPR 2024, featuring workshops, research papers, and challenges across emerging topics like Vision Transformers and multimodal interactions. With an impressive acceptance rate from over 11,000 submissions, this event highlights cutting-edge innovation and collaboration in algorithm development, data exploration, and practical applications.
Ranni
The project introduces a text-to-image diffusion process using a large language model that enhances semantic comprehension and a diffusion-based model for drawing. Comprising an LLM-based planning component and diffusion model, the system accurately aligns with text prompts in two phases. Listed as a CVPR 2024 oral paper, the package includes model weights such as a LoRA-finetuned LLaMa-2-7B and fully-finetuned SDv2.1. Users can explore image creation interactively through Gradio demos and apply continuous edits for targeted image changes.
Feedback Email: [email protected]