TransformerHub - Overview of Transformer Architectures and Their Implementations

Introduction to TransformerHub

TransformerHub is a project dedicated to implementing various forms of transformer models, which have become a vital part of many deep learning applications. The repository aims to provide a space for learning advanced programming skills and serves as a valuable reference for those passionate about deep learning and machine intelligence.

The project draws inspiration from and would not have been possible without the contributions from open-source repositories such as NanoGPT, ViT, MAE, CLIP, and OpenCLIP. These foundational models have paved the way for ongoing development within TransformerHub.

Key Features

Transformer Architectures

TransformerHub covers a range of transformer architectures, each tailored for different functionalities:

Encoder-only Models: Focus on understanding input sequences without prediction.
Decoder-only Models: Primarily used for sequence generation based on input prompts.
Encoder-Decoder Models: Handle tasks that require understanding input before producing output, like translation.
Unified Models: Currently under development, these models aim to blend functionalities of the above architectures.

Attention Modules

Attention mechanisms are central to transformer functionality. TransformerHub includes:

Unmasked Attention: Used in architectures like BERT for understanding entire sequences.
Causal Masked Attention: Employed in models like GPT to ensure predictions are based only on previous inputs.
Prefix Causal Attention: Utilized by models such as T5 for specific task adjustments.
Sliding-Window Attention: Featured in Mistral to handle sequences by processing segments.

Position Embedding

Handling the sequence order is crucial for transformer models, which is achieved through position embeddings:

Fixed Position Embedding: The original method used in transformers.
Learnable Position Embedding: A flexible version implemented in models including BERT.
Rotary Position Embedding: A technique seen in the Roformer model, enhancing context comprehension.
Extrapolable Position Embedding: Allows models to adapt to varying sequence lengths, like in the Length-Extrapolatable Transformer.

Sampling Techniques

For generating outputs, different sampling methods ensure variety and relevance in responses:

Temperature-based Sampler: Adjusts the randomness of predictions.
Top-k Sampler: Limits choices to the top k probable outputs.
Nucleus (top-p) Sampler: Flexibly chooses from high probability outputs while ensuring diversity.

Current Progress

Currently, TransformerHub is focused on implementing DINO, an innovative variant of the ViT model that is trained using self-supervised learning methods. Here is the status of various models:

Model	Implemented	Trained	Evaluated
Transformer	✅	No	No
GPT	✅	No	No
BERT	✅	Yes	No
ViT	✅	No	No
MAE	No	No	No
CLIP	No	No	No

Important Consideration

Due to the appeal and adaptability of transformer models, TransformerHub aims to serve educational purposes and is highly beneficial for those interested in implementing parts of transformer architectures. However, users are reminded not to directly copy the contents of the repository, as it contradicts ethical guidelines in most academic environments. For a deeper understanding of transformer models, an illustrative blog is recommended: Annotated Transformer by Harvard NLP.

In summary, TransformerHub provides a comprehensive playground for exploring the intricacies of transformer models, fostering both learning and innovation in machine learning.