uform - Compact and Versatile Multimodal AI for Efficient Content Analysis and Creation

Introducing UForm: A Pocket-Sized Multimodal AI

UForm is an innovative multimodal AI library designed to facilitate both the understanding and generation of content. It is a versatile tool, capable of handling various types of content from short texts and images to potentially video clips and long documents. Let's delve into the features, models, and technical details of this fascinating project.

Features of UForm

UForm is packed with a range of features designed to enhance user experience and widen application usability:

Tiny Embeddings: UForm employs a 64-dimensional embedding model that is exceptionally fast, enabling quick searches.
High Throughput: Thanks to its compact size, it boasts an inference speed two to four times faster than competitors.
Portability: With native ONNX support, the models are highly portable and can be easily deployed across various platforms.
Quantization Aware: The model allows embeddings to be down-cast from f32 to i8, retaining high recall performance.
Multilingual Capability: UForm supports over 20 languages, ensuring a high recall across different linguistic datasets.

UForm Models

UForm provides both embedding and generative models focused on various applications:

Embedding Models

These models are designed for different use-cases involving language and image processing. They vary in size and complexity to cater to specific needs:

uform3-image-text-english-large: With 365 million parameters, it utilizes a 12-layer BERT and ViT-L/14 architecture.
uform3-image-text-english-base: A more compact model with 143 million parameters and a 4-layer BERT and ViT-B/16 architecture.
uform3-image-text-multilingual-base: Supporting 21 languages with 206 million parameters, it combines a 12-layer BERT with ViT-B/16.

Generative Models

These are designed for tasks like chat, image captioning, and visual question answering (VQA):

uform-gen2-dpo and uform-gen2-qwen-500m: These models each have 1.2 billion parameters and leverage the qwen1.5-0.5B and ViT-H/14 architecture.
uform-gen: With 1.5 billion parameters, the model specializes in image captioning and VQA using the llama-1.3B and ViT-B/16 architecture.

How to Get Started

To use UForm embedding models, users can install UForm via pip and load the desired model. For example, users can easily embed images and queries using provided scripts and incorporate these embeddings into larger projects using Python, JavaScript, or Swift.

Technical Details

UForm integrates advanced techniques like down-casting and quantization to maintain efficiency and performance even on older hardware. Matryoshka embeddings allow hierarchical search using smaller parts of a large embedding, making them extremely efficient for retrieval tasks.

Compact Deployment

UForm's deployment is made simpler through ONNX, significantly reducing memory usage. Importantly, it supports various execution providers to cater to different hardware needs, making it ideal for Edge or IoT applications.

Chat Functionality

Users can access chat-like experiences via the command line using UForm's generative models. This showcases the library's interactive capabilities and versatility in real-world applications.

UForm stands as a highly efficient and adaptable AI tool, tailored for handling visual and textual data across various platforms and languages with remarkable speed and accuracy.