llama3.java - Concisely Implement Llama 3 Inference in Java with No Dependencies

Introduction to Llama3.java

Llama3.java is an innovative project that builds upon its predecessor, llama2.java, to support inference for the Llama 3 models using Java. These models, including Llama 3, 3.1, and 3.2, have been developed for practical applications, and Llama3.java facilitates their use in a single Java file. The project not only offers educational value but also aids in testing and optimizing compiler features and performance within the Java Virtual Machine (JVM), particularly for the Graal compiler.

Features

Llama3.java comes packed with an impressive range of features:

Single File, No Dependencies: The entire functionality is implemented in a single Java file, making it easy to handle with no external dependencies.
GGUF Format Parser: It includes a parser for the GGUF format, which is essential for handling complex data structures associated with the Llama models.
Llama 3 Tokenizer: Uses a tokenizer based on minbpe for efficient text processing.
Inference Capabilities: Facilitates Llama 3 inference using Grouped-Query Attention, supporting Llama 3.1 (ad-hoc RoPE scaling) and 3.2 (tie word embeddings).
Quantization Support: Supports Q8_0 and Q4_0 quantizations for efficient model processing.
Matrix-Vector Multiplication: Offers fast matrix-vector multiplication routines for quantized tensors, leveraging Java's Vector API.
User-Friendly CLI: Provides a simple command-line interface with --chat and --instruct modes, allowing for conversational or instructional interactions.
GraalVM and Native Image Support: Includes support for GraalVM's Native Image to enhance performance, with options for Ahead-of-Time (AOT) model pre-loading to achieve instant inference.

Setup and Usage

To get started with Llama3.java, users are encouraged to download quantized GGUF files. These can be obtained from multiple sources, including Hugging Face, with a preference for pure Q4_0 quantized models for optimal performance. Users have the option to manually quantize to a pure Q4_0 format using the 'llama-quantize' utility if needed.

To execute the project:

Java 21+ Required: The project requires Java 21 or later to leverage the latest features, such as the MemorySegment API.
Building and Running: Users can utilize jbang for execution or compile the project manually using Java command lines. For advanced users, a Makefile is provided for streamlined building and running the application.
Native Compilation: For those seeking optimal performance, Llama3.java can be compiled into a native image using GraalVM's native-image tool, reducing overhead and improving start-up times.

Performance

The performance of Llama3.java is noteworthy, especially when taking advantage of more recent GraalVM Vector API operations. The project has been benchmarked against other implementations, such as llama.cpp, with favorable results. Running on a single CPU core, the application showcases efficient usage of system resources while delivering rapid inference times.

Conclusion

Llama3.java is a robust solution for Llama model inference in Java, combining ease of use with powerful features. It exemplifies cutting-edge Java capabilities, providing users and developers with a practical tool for natural language processing tasks in academia, research, or industry applications. Licensed under the MIT License, it is freely available for modification and use, fostering further innovation and exploration in the field of AI.