#transformer

Logo of VisionLLaMA
VisionLLaMA
VisionLLaMA is a unified vision transformer similar to LLaMA, optimized for various image tasks including perception and generation. It consistently outperforms previous state-of-the-art models and serves as a strong benchmark for vision-related tasks. This model has been validated through typical pre-training methods, proving its efficacy and adaptability in processing 2D images. Designed to set new standards in the field of vision tasks without exaggeration.
Logo of spreadsheets-are-all-you-need
spreadsheets-are-all-you-need
Learn the foundational elements of GPT2, the precursor to ChatGPT, through standard Excel functions. This initiative offers an accessible gateway for all users to engage with a tangible transformer model in Excel, minimizing the need for in-depth coding knowledge. Access the Excel binary file from the repository for seamless use on both Mac and PC, free from VBA or macro dependencies. For added safety, utilize Excel's Trust Center to manage macro settings. Perfect for those interested in AI's inner workings in a familiar spreadsheet setting.
Logo of open-muse
open-muse
The project replicates the MUSE model for efficient text-to-image synthesis using transformers and VQGAN, involving stages like class-conditional modeling and large dataset training. Utilizing advanced masking strategies and state-of-the-art techniques, it integrates tools like PyTorch and WebDataset, providing scalable open-source solutions shared on Hugging Face.
Logo of Gemini
Gemini
This article explores the open-source implementation of Gemini, a multi-modal transformer model processing inputs from text, images, audio, and videos. Its architecture directly integrates image embeddings into the transformer, bypassing visual encoders for improved efficiency and mirroring Fuyu's architecture with a broader scope. The model initially focuses on optimizing image embeddings, with plans to incorporate audio and video. Techniques like flash attention and qk norm enhance performance for potential production use. For implementation discussions, consider joining the Agora Discord community.
Logo of SpecVQGAN
SpecVQGAN
This project presents an innovative method for generating sound guided by visual inputs through a spectrogram-based codebook. Using a Spectrogram VQGAN model, it trains a transformer to utilize visual features, producing coherent and high-quality audio suited for various data classes. This approach facilitates the creation of extensive, high-quality sound sequences, making it valuable for multimedia and auditory synthesis applications. The project includes comprehensive instructions on environment setup and data management, emphasizing its use in training complex models with open-source tools like Conda and Docker. Additionally, it offers access to pretrained models and transformers for sampling and evaluation, assisting users in efficiently producing sounds from visual stimuli and advancing the field of conditional sound creation.
Logo of nlp_paper_study
nlp_paper_study
Engage with key NLP concepts through the rigorous study of leading conferences' papers and code replication efforts. This project provides an organized approach from paper discovery to manuscript drafting, facilitating scholars in advancing their research skills. Equipped with tools for paper translation and analysis, it aims to bridge language gaps and promote collaborative learning. Discover insights into various NLP applications including Transformers, Pre-training Models, Knowledge Graphs among others. Gain practical knowledge and participate in dedicated study groups for enhanced learning experiences.