mlx-llm - Optimize LLM Performance on Apple Silicon with Real-time Applications Using MLX

Introducing the mlx-llm Project

The mlx-llm project is designed to harness the power of Large Language Models (LLMs) on Apple Silicon, utilizing Apple MLX for real-time applications and tools. It provides a robust platform for engaging with advanced language models efficiently and effectively.

Installation Guide

Getting started with mlx-llm is straightforward. You can easily install it using pip with the following command:

pip install mlx-llm

Supported Models

The mlx-llm project comes pre-configured with a variety of LLMs from different families. These models range from LLaMA 2 and 3, to more niche models like TinyLLaMA and Gemma. Each model family offers multiple configurations to suit different needs. For example, LLaMA 2 supports models like llama_2_7b_chat_hf and llama_2_7b_hf, while the Gemma family includes gemma_1.1_2b_it and others.

Creating a model is made simple with pre-trained weights from HuggingFace. Here's a quick example:

from mlx_llm.model import create_model

# Load weights from HuggingFace
model = create_model("llama_3_8b_instruct")

Additionally, users can load new versions of pre-trained weights for specific models, configuring them as needed.

Quantization

The quantization feature in mlx-llm allows users to optimize models for better performance and efficiency. This is done by reducing the model's size and computational requirements.

from mlx_llm.model import create_model, quantize, get_weights
from mlx_llm.utils.weights import save_weights

# Create the model from original weights
model = create_model("llama_3_8b_instruct")
# Quantize the model
model = quantize(model, group_size=64, bits=4)
# Retrieve and save weights
weights = get_weights(model)
save_weights(weights, "llama_3_8b_instruct-4bit.safetensors")

Model Embeddings

mlx-llm models can extract embeddings, which are helpful in transforming text into meaningful numerical representations.

import mlx.core as mx
from mlx_llm.model import create_model, create_tokenizer

model = create_model("llama_3_8b_instruct")
tokenizer = create_tokenizer('llama_3_8b_instruct')
text = ["I like to play basketball", "I like to play tennis"]
tokens = tokenizer(text)
x = mx.array(tokens["input_ids"])
embeds, _ = model.embed(x, norm=True)

Applications of mlx-llm

The mlx-llm project offers a wide array of applications:

Chat with LLM: Engage in conversations with an LLM on Apple Silicon, customizing the tone and context.
Fine-Tuning: Further tune models with methods like LoRA or QLoRA.
Retrieval Augmented Generation (RAG): A feature for enhancing question-answering capabilities.

Chat with LLM

Engaging in conversation with an LLM is made easy. For instance:

from mlx_llm.chat import ChatSetup, LLMChat

chat = LLMChat(
    model_name="tiny_llama_1.1B_chat_v1.0",
    prompt_family="tinyllama",
    chat_setup=ChatSetup(
        system="You are Michael Scott from The Office. Your goal is to answer like him, so be funny and inappropriate, but be brief.",
        history=[
            {"question": "What is your name?", "answer": "Michael Scott"},
            {"question": "What is your favorite episode of The Office?", "answer": "The Dinner Party"},
        ],
    ),
    quantized=False
)

chat.start()

Please note, certain features like OpenELM chat-mode are currently under development.

Future Work

Current plans for mlx-llm include implementing and enhancing features like LoRA, QLoRA, and RAG.

Contact Information

For inquiries or more information, you can reach out to [email protected]. The mlx-llm project continues to evolve, aims to serve as a powerful tool for developers and researchers working with language models on Apple Silicon.