#Vision Transformer

Logo of Awesome-Transformer-Attention
Awesome-Transformer-Attention
Explore a meticulously curated repository focused on Vision Transformer and Attention, featuring comprehensive resources like papers, codes, and links to relevant websites. Maintained by Min-Hung Chen, this updated list invites contributions to enhance its comprehensiveness and includes the latest developments from major conferences such as NeurIPS 2023 and ICCV 2023. Researchers can contribute by opening issues or creating pull requests for any missed papers, ensuring a continually relevant resource for academic and enthusiast communities alike.
Logo of tensorflow-image-models
tensorflow-image-models
Discover TensorFlow's diverse pretrained image models, including Vision Transformers and ResNets with ImageNet weights. Easily create and customize models beyond classification using advanced features like Segment Anything. Compatible with Keras, installation is straightforward, leveraging model versatility for innovative tasks.
Logo of pixel
pixel
Explore an innovative language modeling approach with image-based text processing, removing fixed vocabulary limitations. This approach enables smooth language adaptation across different scripts. Pretrained with 3.2 billion words, this model surpasses BERT in handling non-Latin scripts. Utilizing components like a text renderer, encoder, and decoder, it reconstructs images at the pixel level, enhancing syntactic and semantic tasks. Access detailed pretraining and finetuning guidelines via Hugging Face for enhanced multilingual text processing.
Logo of fast-reid
fast-reid
FastReID is a research-driven platform enhancing re-identification with advanced algorithms and performance optimization. Beyond traditional re-identification tasks, it supports image retrieval and face recognition, integrating Vision Transformer architectures and automatic mixed precision training. Features include model distillation, diverse visualization tools, and comprehensive evaluation metrics. The platform accommodates distributed training and model conversion to Caffe, ONNX, and TensorRT, serving as a flexible library for varied research initiatives. Continuous updates like DG-ReID enhancements and Partial FC exemplify its ongoing role in AI research.
Logo of SAM4MIS
SAM4MIS
Discover the innovations in medical image segmentation using SAM and SAM2 models, with detailed surveys on benchmarks, method adaptations, and emerging research pathways. This survey reveals new advances in biomedical imaging, helping researchers stay informed with the latest summaries.
Logo of MultiModalMamba
MultiModalMamba
Discover the AI model MultiModalMamba that combines Vision Transformer and Mamba, built on the Zeta framework for efficient multi-modal data processing. This model handles both text and image data seamlessly, making it suitable for a range of AI tasks. With customizable parameters and the option to return embeddings, it can be tailored for diverse needs such as transfer learning. MultiModalMamba provides a versatile and efficient AI solution for streamlining workflows.
Logo of hiera
hiera
Hiera emerges as a streamlined vision transformer, offering superior performance in image and video tasks through fast inference and MAE pretraining. This model is available on platforms such as Torch Hub and Hugging Face Hub, enabling seamless integration into various projects.
Logo of MIMDet
MIMDet
Utilizing Masked Image Modeling with a Vanilla ViT, this project enhances object detection and instance segmentation. A compact convolutional stem is integrated for multi-scale representation, forming a hybrid ViT-ConvNet backbone. It achieves significant results on COCO with 51.7 box AP and 46.2 mask AP, showcasing efficiency in training and accuracy in inference through varied sample ratios.
Logo of GiT
GiT
Discover a novel general vision model employing a basic Vision Transformer to unify multiple vision tasks effectively. The model maintains minimal dependencies with a clean codebase for optimal performance in object detection, semantic segmentation, and vision-language tasks. By using a unified language interface, it enhances multi-task training outcomes, excelling in zero-shot and few-shot benchmarks. The training strategy aligns with modern language model frameworks to ensure broad scalability and adaptability.
Logo of ml-fastvit
ml-fastvit
This repository features FastViT, a rapid hybrid vision transformer utilizing structural reparameterization to boost image classification accuracy. Models have been trained on ImageNet-1K and benchmarked for latency on an iPhone 12 Pro via the ModelBench app. The repository includes setup guides for configuring environments, training, and evaluating models, along with scripts for implementation. It provides a varied collection of pre-trained models tailored for classification tasks, including the option for knowledge distillation. Comprehensive dataset preparation and model export instructions are available, making this a versatile tool for tasks ranging from training to fine-tuning in machine learning.
Logo of vit-pytorch
vit-pytorch
Discover the Vision Transformer (ViT) implemented in Pytorch, providing a powerful approach to vision classification with a single transformer encoder. This project features diverse models like Simple ViT, NaViT, and Deep ViT, optimized for efficient training and higher accuracy across various datasets. Leverage pretrained models and explore a range of transformer architectures, such as Token-to-Token ViT, CaiT, and Cross ViT, with advanced features like Distillation and Efficient Attention for robust machine learning applications.
Logo of Awesome-Diffusion-Transformers
Awesome-Diffusion-Transformers
This extensive compilation delves into diffusion transformers used in various fields such as text, speech, and video production. It highlights groundbreaking research, including text-driven motion generation and scalable image synthesis models, illustrating the latest technological applications. With emphasis on methodologies like transformer-based denoising and high-resolution image synthesis, this collection provides valuable insights into efficient training techniques. Featuring works like MotionDiffuse and scalable diffusion models, it is designed for researchers and practitioners, offering a comprehensive overview of innovations in diffusion transformers, paired with accessible resources and recent research data.