llama.go - LLaMA Model Inference Framework in Golang for Efficient Machine Learning

Introducing Llama.go

Llama.go is an exciting project aimed at revolutionizing the way machine learning enthusiasts interact with large GPT models, making it possible to run them in home labs without the need for expensive GPU clusters. This ambitious venture seeks to make these advanced models more accessible by leveraging the user-friendly Go programming language in place of the more complex C++.

Motivation

The project is driven by the vision of a world where learning and experimentation with large language models (LLMs) are democratized. Llama.go draws inspiration from the powerful ggml.cpp framework, produced by Georgi Gerganov. The developers of Llama.go aim to offer similar performance and elegance, but in Go, a language known for its simplicity and efficiency. By adopting Go, Llama.go aspires to gain broader adoption among developers who find C++ too intricate.

Implementation and Roadmaps

V0 Roadmap

The initial phase of Llama.go emphasizes foundational components:

Implementation of tensor mathematics in pure Go.
Integration of LLaMA neural network architecture with model loading capabilities.
Testing using smaller models like LLaMA-7B.
Ensuring that inference in Go matches the results achieved with C++.
Optimizing performance through multi-threading and messaging.

V1 Roadmap - Spring 2023

Further development in the first version focuses on:

Achieving cross-platform compatibility (Mac, Linux, Windows).
Publishing the first stable version for machine learning enthusiasts.
Enabling larger models such as 13B, 30B, and 65B.
Supporting ARM NEON on Apple Silicon and ARM servers.
Boosting performance on Intel and AMD platforms with x64 AVX2.
Improving memory utilization and garbage collection.
Introducing a Server Mode with a REST API for real-world applications.
Providing free access to converted models via the internet.

V2 Roadmap - Winter 2023

The second version envisions several advancements:

Supporting LLaMA V2 models and implementing advanced query attention.
Introducing efficient INT8 quantization for handling larger models.
Benchmarking against other popular frameworks.
Facilitating various family models like Vicuna and Alpaca.
Enhancing performance with memory-aligned tensors and extensive logging.
Adding an interactive mode for real-time GPT interaction and CPU/GPU feature detection.
Developing a standalone GUI or web interface for easier access.
Extending support to open models like Open Assistant and StableLM.
Adding AVX512 support for better performance on newer AMD and Intel processors.

V3 Roadmap - Spring 2024

Future plans include expansive features:

Enabling plugins and external APIs for complex project deployments.
Offering model training and fine-tuning capabilities.
Accelerating execution on GPUs and clusters.
Supporting FP16, BF16, INT4, and GPTQ quantization.
Extending GPU support to AMD Radeon.

Running Llama.go

To run Llama.go:

Acquire the LLaMA models or download pre-converted versions.
Build the application from source code or download a pre-built version for your platform.
Execute using command line with appropriate flags for your tasks, such as specifying the model and prompt to process.

For example:

llama-go-v1.4.0-macos --model ~/models/llama-7b-fp32.bin --prompt "Why Golang is so popular?"

Production Usage

Llama.go can operate as a standalone HTTP server with REST API functionality:

Use flags to run it in server mode, defining host, port, and processing threads.
Adjust the number of parallel executions using pods and control CPU usage with threads.

Practical Examples and Building

The application supports various command-line flags for customization, and it can be compiled from source by installing Golang and Git, then following simple build instructions. Additionally, there is comprehensive guidance for using the REST API with real-world examples such as placing jobs and checking statuses via HTTP requests.

FAQs

Common questions address acquiring original LLaMA models and converting them to a supported format. Users are directed to contact Meta or search alternative routes for acquiring models, while conversion scripts are available to transform models into a format compatible with Llama.go.

Conclusion

With its focus on accessibility, high performance, and cross-platform support, Llama.go is poised to empower a new wave of machine learning experimentation and innovation. By reducing the dependency on costly hardware solutions, it brings the power of large-scale language models into reach for individual developers and researchers worldwide.