seemore - Learn to Build Vision Language Models with Pytorch for Image Analysis

seemore Project Introduction

The seemore project is an exciting venture that aims to create a vision language model (VLM) from scratch using PyTorch. This project provides insights into building complex machine learning models and is developed with love using Databricks.

Overview

The seemore project is a simplified version of sophisticated models like Grok 1.5 and GPT-4 Vision. It is consolidated into a single PyTorch file named seeMoE.py, while the accompanying Jupyter notebook, seeMoE_from_scratch.ipynb, offers a step-by-step guide to understanding the architecture.

Main Components

Image Encoder:
- The image encoder is responsible for extracting visual features from images. It uses a vision transformer similar to the one in CLIP, which is a popular choice among modern VLMs. Some models, like the Fuyu series from Adept, choose a different approach by sending patchified images directly to the projection layer.
Vision-Language Projector:
- This module adjusts the dimensionality of image features to align them with text embeddings used by the decoder. By using a Multi-Layer Perceptron (MLP), image features become 'visual tokens' for the language decoder.
Decoder-Only Language Model:
- The project incorporates a decoder only language model, which generates the textual content. Unlike other models, this implementation includes a projection module within the decoder, which is usually left untouched when using pretrained models.

Technical Insights

The project borrows its scaled dot product self-attention implementation from Andrej Kapathy's makemore, and the decoder functions similarly as an autoregressive character-level language model.
Everything in this project, from attention mechanisms to patch creation, is built from the ground up using PyTorch, ensuring a clear understanding of each component.

Resources and Recommendations

Publications heavily referenced in this project include:

Large Multimodal Models: Notes on CVPR 2023 Tutorial
Visual Instruction Tuning
Language Is Not All You Need: Aligning Perception with Language Models

To delve deeper into the model’s architecture, the seemore_from_Scratch.ipynb notebook walks through the logical flow and implementation. For experimentation and customization, seemore_Concise.ipynb offers a hackable version of the model.

Infrastructure and Tools

The project was developed on Databricks using an A100 GPU, with the option to scale on larger GPU clusters, leveraging any cloud provider. Databricks features MLFlow, a helpful tool for tracking and logging metrics, which is fully open source and can be easily installed elsewhere.

Final Thoughts

The seemore project prioritizes readability and hackability over performance, inviting users to explore, improve, and make it their own. This approach makes it particularly beneficial for those looking to gain hands-on experience with building vision language models from scratch.

Enjoy the journey into the world of vision language models with seemore, and happy hacking!