llama_cpp-rs - Rust Bindings for C++ Project llama.cpp Enabling User-Friendly CPU-Based Language Model Execution

llama_cpp-rs: A Comprehensive Guide

Introduction

llama_cpp-rs is a high-level, Rust-based interface designed to facilitate interaction with the C++ project "llama.cpp". This project, accessible via GitHub, aims to provide an intuitive and user-friendly medium for users, especially those with no prior machine learning experience, to run large language models directly on their CPUs. It's adaptable enough to begin operation with merely fifteen lines of code.

Aimed at promoting ease of use, this library offers safe bindings and simplifies complex operations into manageable tasks. With its focus on GGUF-based models, it ensures accessibility and efficiency for users wishing to explore language model capabilities.

Core Functionality

Model Creation

Creating a model using llama_cpp-rs is straightforward. Users can load a model from any source that implements the AsRef<Path> trait. Here's a simple example of how one might load a model:

let model = LlamaModel::load_from_file("path_to_model.gguf", LlamaParams::default()).expect("Could not load model");

Session Management

Once a model is loaded, many sessions can be initiated from it, as models typically store large datasets (several gigabytes) while the sessions consume fewer resources (a few dozen to a hundred megabytes). Establishing a session is intuitive with the following code:

let mut ctx = model.create_session(SessionParams::default()).expect("Failed to create session");

Advancing Context and Generating Tokens

The model's context can be expanded by feeding it data, which the model inspects to predict and generate subsequent segments of a sequence:

ctx.advance_context("This is the story of a man named Stanley.").unwrap();

For generating tokens, llama_cpp-rs supports the creation of worker threads that manage the token generation process. This ensures seamless processing of potentially large sequences of data:

let mut completions = ctx.start_completing_with(StandardSampler::default(), 1024).into_strings();

for completion in completions {
    print!("{completion}");
    let _ = io::stdout().flush();
}

Building and Optimization

llama_cpp-rs emphasizes performance, especially when dealing with computationally intense processes. It is noted that standard debug builds lack the necessary optimizations. Therefore, invoking Cargo's --release flag is strongly recommended to exploit the full potential of the library.

Supported Features

The project supports several backends through Cargo features:

CUDA: Utilizes the CUDA backend; requires CUDA Toolkit.
Vulkan: Integrates the Vulkan backend; requires Vulkan SDK.
Metal: Available exclusively for macOS.
hipBLAS: Employs the hipBLAS/ROCm backend; requires ROCm compatibility.

Experimental Features

Among its innovative features, llama_cpp-rs attempts to predict the context size in memory. However, this capability is experimental and may yield inaccurate measurements since llama.cpp itself does not offer such predictions. Nonetheless, efforts are made to ensure that the reported sizes are never less than the actual values.

License Information

llama_cpp-rs offers license flexibility, allowing users to choose between MIT and Apache-2.0 licenses. Full details are available in the respective LICENSE-MIT and LICENSE-APACHE files.

llama_cpp-rs stands out as an accessible, high-performance library for users seeking to harness the power of large language models with ease and efficiency, emphasizing user-friendliness and adaptability.