MLX-VLM: A Vision Language Model Toolkit for Mac
MLX-VLM is an innovative toolkit that enables users to perform inference and fine-tuning of Vision Language Models (VLMs) seamlessly on Mac computers. Utilizing the capabilities of the MLX framework, this package promises a straightforward setup and powerful features for handling visual language tasks.
Installation
Getting started with MLX-VLM is simple and convenient. Users can install the package quickly with pip, a package manager for Python:
pip install mlx-vlm
Usage
MLX-VLM offers multiple ways to interact with Vision Language Models:
Command Line Interface (CLI)
For those who prefer using the command line, MLX-VLM allows users to generate model outputs by entering a straightforward command. Here’s an example command that processes an image and generates output:
python -m mlx_vlm.generate --model mlx-community/Qwen2-VL-2B-Instruct-4bit --max-tokens 100 --temp 0.0 --image http://images.cocodataset.org/val2017/000000039769.jpg
Chat UI with Gradio
Users who enjoy more interactive experiences can take advantage of the Gradio interface, which offers a chat-based interaction model, giving a conversational feel to data processing:
python -m mlx_vlm.chat_ui --model mlx-community/Qwen2-VL-2B-Instruct-4bit
Python Script
Developers can integrate MLX-VLM into their own Python scripts with ease. The package can be utilized to load models, prepare data inputs, and process outputs. Here’s an illustrative example:
import mlx.core as mx
from mlx_vlm import load, generate
from mlx_vlm.prompt_utils import apply_chat_template
from mlx_vlm.utils import load_config
# Load the model
model_path = "mlx-community/Qwen2-VL-2B-Instruct-4bit"
model, processor = load(model_path)
config = load_config(model_path)
# Prepare input
image = ["http://images.cocodataset.org/val2017/000000039769.jpg"]
prompt = "Describe this image."
# Apply chat template
formatted_prompt = apply_chat_template(
processor, config, prompt, num_images=len(image)
)
# Generate output
output = generate(model, processor, image, formatted_prompt, verbose=False)
print(output)
Multi-Image Chat Support
MLX-VLM extends its functionality to support multiple images within the same conversation, enhancing the ability to perform intricate visual reasoning tasks. Here’s how you can use this feature:
Supported Models
Several advanced models like Idefics 2, LLaVA (Interleave), Qwen2-VL, Phi3-Vision, and Pixtral support this multi-image chat function.
Usage Example: Python Script
from mlx_vlm import load, generate
from mlx_vlm.prompt_utils import apply_chat_template
from mlx_vlm.utils import load_config
model_path = "mlx-community/Qwen2-VL-2B-Instruct-4bit"
model, processor = load(model_path)
config = load_config(model_path)
images = ["path/to/image1.jpg", "path/to/image2.jpg"]
prompt = "Compare these two images."
formatted_prompt = apply_chat_template(
processor, config, prompt, num_images=len(images)
)
output = generate(model, processor, images, formatted_prompt, verbose=False)
print(output)
Usage Example: Command Line
python -m mlx_vlm.generate --model mlx-community/Qwen2-VL-2B-Instruct-4bit --max-tokens 100 --prompt "Compare these images" --image path/to/image1.jpg path/to/image2.jpg
Fine-tuning
For users looking to optimize their models further, MLX-VLM supports fine-tuning techniques such as LoRA and QLoRA, allowing customizability and improved performance for specialized tasks.
In summary, MLX-VLM offers a versatile suite of tools for anyone interested in exploring or expanding their capabilities with Vision Language Models on Mac. Whether through command-line commands, conversational UIs, or extensive Python integration, MLX-VLM provides the resources to innovate in the field of visual language processing.