π¦ Jlama: A Modern LLM Inference Engine for Java
Jlama is an advanced inference engine designed specifically for running large language models (LLMs) within Java applications. It serves as a bridge between powerful language models and Java developers, making it easier to deploy and utilize these models effectively in Java environments.
π Features
Jlama supports a variety of well-known models including:
- Gemma & Gemma 2
- Llama, Llama2 & Llama3
- Mistral & Mixtral
- Qwen2
- IBM Granite
- GPT-2
- BERT
Jlama also provides support for the BPE and WordPiece tokenizers, as well as other functionalities:
- Paged Attention: Efficiently manages attention mechanisms within models.
- Mixture of Experts: Utilizes specialized subnetworks within the model to provide better performance.
- Tool Calling: Supports calling external tools during model inference.
- Generate Embeddings: Lets users obtain embeddings from text inputs.
- Classifier Support: Enables model use in classification tasks.
- Huggingface SafeTensors: Supports SafeTensors model and tokenizer formats.
- Data Type Support: Handles F32, F16, BF16, Q8, and Q4 model quantization and fast GEMM operations.
- Distributed Inference: Allows distributed processing of model inferences.
These features require Java 20 or newer, leveraging the Vector API for high-speed inference.
π€ What is it used for?
Jlama is used to integrate advanced LLM inference capabilities directly into Java applications, streamlining the development process for Java developers who want to harness the power of LLMs.
π¬ Quick Start
π΅οΈββοΈ Local Client Usage (with jbang!)
Jlama offers a command-line interface (CLI) that simplifies the process of working with models.
-
Install jbang:
curl -Ls https://sh.jbang.dev | bash -s - app setup
-
Install Jlama CLI:
jbang app install --force jlama@tjake
-
Download and Run a Model:
jlama restapi tjake/Llama-3.2-1B-Instruct-JQ4 --auto-download
Access the chat UI via
http://localhost:8080/
.
π¨βπ» Java Project Integration
To incorporate Jlama into a Java project, developers can use the Langchain4j Integration or directly add the Maven dependencies:
<dependency>
<groupId>com.github.tjake</groupId>
<artifactId>jlama-core</artifactId>
<version>${jlama.version}</version>
</dependency>
<dependency>
<groupId>com.github.tjake</groupId>
<artifactId>jlama-native</artifactId>
<classifier>${os.detected.name}-${os.detected.arch}</classifier>
<version>${jlama.version}</version>
</dependency>
Ensure to enable Java 21 preview features if needed:
export JDK_JAVA_OPTIONS="--add-modules jdk.incubator.vector --enable-preview"
β Show Support
If Jlama proves useful in your projects, consider giving it a star on its repository. This helps show support and encourages further development.
πΊοΈ Roadmap
Future development plans for Jlama include:
- Expanding support for more models
- Adding LoRA and GraalVM support
- Continuing enhancements for distributed inference
π·οΈ License and Citation
Jlama is released under the Apache License. If used in research, it can be cited using its bibliographic entry:
@misc{jlama2024,
title = {Jlama: A modern Java inference engine for large language models},
url = {https://github.com/tjake/jlama},
author = {T Jake Luciani},
month = {January},
year = {2024}
}
Jlama stands out as a powerful, modern solution for integrating and running large language models directly within Java applications, catering to developers seeking to leverage LLM capabilities efficiently and effectively.