mlx-vlm - Use MLX-VLM for Efficient Vision-Language Model Inference and Fine-Tuning on Mac

MLX-VLM: A Vision Language Model Toolkit for Mac

MLX-VLM is an innovative toolkit that enables users to perform inference and fine-tuning of Vision Language Models (VLMs) seamlessly on Mac computers. Utilizing the capabilities of the MLX framework, this package promises a straightforward setup and powerful features for handling visual language tasks.

Installation

Getting started with MLX-VLM is simple and convenient. Users can install the package quickly with pip, a package manager for Python:

pip install mlx-vlm

Usage

MLX-VLM offers multiple ways to interact with Vision Language Models:

Command Line Interface (CLI)

For those who prefer using the command line, MLX-VLM allows users to generate model outputs by entering a straightforward command. Here’s an example command that processes an image and generates output:

python -m mlx_vlm.generate --model mlx-community/Qwen2-VL-2B-Instruct-4bit --max-tokens 100 --temp 0.0 --image http://images.cocodataset.org/val2017/000000039769.jpg

Chat UI with Gradio

Users who enjoy more interactive experiences can take advantage of the Gradio interface, which offers a chat-based interaction model, giving a conversational feel to data processing:

python -m mlx_vlm.chat_ui --model mlx-community/Qwen2-VL-2B-Instruct-4bit

Python Script

Developers can integrate MLX-VLM into their own Python scripts with ease. The package can be utilized to load models, prepare data inputs, and process outputs. Here’s an illustrative example:

import mlx.core as mx
from mlx_vlm import load, generate
from mlx_vlm.prompt_utils import apply_chat_template
from mlx_vlm.utils import load_config

# Load the model
model_path = "mlx-community/Qwen2-VL-2B-Instruct-4bit"
model, processor = load(model_path)
config = load_config(model_path)

# Prepare input
image = ["http://images.cocodataset.org/val2017/000000039769.jpg"]
prompt = "Describe this image."

# Apply chat template
formatted_prompt = apply_chat_template(
    processor, config, prompt, num_images=len(image)
)

# Generate output
output = generate(model, processor, image, formatted_prompt, verbose=False)
print(output)

Multi-Image Chat Support

MLX-VLM extends its functionality to support multiple images within the same conversation, enhancing the ability to perform intricate visual reasoning tasks. Here’s how you can use this feature:

Supported Models

Several advanced models like Idefics 2, LLaVA (Interleave), Qwen2-VL, Phi3-Vision, and Pixtral support this multi-image chat function.

Usage Example: Python Script

from mlx_vlm import load, generate
from mlx_vlm.prompt_utils import apply_chat_template
from mlx_vlm.utils import load_config

model_path = "mlx-community/Qwen2-VL-2B-Instruct-4bit"
model, processor = load(model_path)
config = load_config(model_path)

images = ["path/to/image1.jpg", "path/to/image2.jpg"]
prompt = "Compare these two images."

formatted_prompt = apply_chat_template(
    processor, config, prompt, num_images=len(images)
)

output = generate(model, processor, images, formatted_prompt, verbose=False)
print(output)

Usage Example: Command Line

python -m mlx_vlm.generate --model mlx-community/Qwen2-VL-2B-Instruct-4bit --max-tokens 100 --prompt "Compare these images" --image path/to/image1.jpg path/to/image2.jpg

Fine-tuning

For users looking to optimize their models further, MLX-VLM supports fine-tuning techniques such as LoRA and QLoRA, allowing customizability and improved performance for specialized tasks.

In summary, MLX-VLM offers a versatile suite of tools for anyone interested in exploring or expanding their capabilities with Vision Language Models on Mac. Whether through command-line commands, conversational UIs, or extensive Python integration, MLX-VLM provides the resources to innovate in the field of visual language processing.