Introducing the mlx-llm Project
The mlx-llm project is designed to harness the power of Large Language Models (LLMs) on Apple Silicon, utilizing Apple MLX for real-time applications and tools. It provides a robust platform for engaging with advanced language models efficiently and effectively.
Installation Guide
Getting started with mlx-llm is straightforward. You can easily install it using pip with the following command:
pip install mlx-llm
Supported Models
The mlx-llm project comes pre-configured with a variety of LLMs from different families. These models range from LLaMA 2 and 3, to more niche models like TinyLLaMA and Gemma. Each model family offers multiple configurations to suit different needs. For example, LLaMA 2 supports models like llama_2_7b_chat_hf
and llama_2_7b_hf
, while the Gemma family includes gemma_1.1_2b_it
and others.
Creating a model is made simple with pre-trained weights from HuggingFace. Here's a quick example:
from mlx_llm.model import create_model
# Load weights from HuggingFace
model = create_model("llama_3_8b_instruct")
Additionally, users can load new versions of pre-trained weights for specific models, configuring them as needed.
Quantization
The quantization feature in mlx-llm allows users to optimize models for better performance and efficiency. This is done by reducing the model's size and computational requirements.
from mlx_llm.model import create_model, quantize, get_weights
from mlx_llm.utils.weights import save_weights
# Create the model from original weights
model = create_model("llama_3_8b_instruct")
# Quantize the model
model = quantize(model, group_size=64, bits=4)
# Retrieve and save weights
weights = get_weights(model)
save_weights(weights, "llama_3_8b_instruct-4bit.safetensors")
Model Embeddings
mlx-llm models can extract embeddings, which are helpful in transforming text into meaningful numerical representations.
import mlx.core as mx
from mlx_llm.model import create_model, create_tokenizer
model = create_model("llama_3_8b_instruct")
tokenizer = create_tokenizer('llama_3_8b_instruct')
text = ["I like to play basketball", "I like to play tennis"]
tokens = tokenizer(text)
x = mx.array(tokens["input_ids"])
embeds, _ = model.embed(x, norm=True)
Applications of mlx-llm
The mlx-llm project offers a wide array of applications:
- Chat with LLM: Engage in conversations with an LLM on Apple Silicon, customizing the tone and context.
- Fine-Tuning: Further tune models with methods like LoRA or QLoRA.
- Retrieval Augmented Generation (RAG): A feature for enhancing question-answering capabilities.
Chat with LLM
Engaging in conversation with an LLM is made easy. For instance:
from mlx_llm.chat import ChatSetup, LLMChat
chat = LLMChat(
model_name="tiny_llama_1.1B_chat_v1.0",
prompt_family="tinyllama",
chat_setup=ChatSetup(
system="You are Michael Scott from The Office. Your goal is to answer like him, so be funny and inappropriate, but be brief.",
history=[
{"question": "What is your name?", "answer": "Michael Scott"},
{"question": "What is your favorite episode of The Office?", "answer": "The Dinner Party"},
],
),
quantized=False
)
chat.start()
Please note, certain features like OpenELM chat-mode are currently under development.
Future Work
Current plans for mlx-llm include implementing and enhancing features like LoRA, QLoRA, and RAG.
Contact Information
For inquiries or more information, you can reach out to [email protected]
. The mlx-llm project continues to evolve, aims to serve as a powerful tool for developers and researchers working with language models on Apple Silicon.