Jlama - Seamless Java Integration with Jlama for Large Language Model Inference

🦙 Jlama: A Modern LLM Inference Engine for Java

Jlama is an advanced inference engine designed specifically for running large language models (LLMs) within Java applications. It serves as a bridge between powerful language models and Java developers, making it easier to deploy and utilize these models effectively in Java environments.

🚀 Features

Jlama supports a variety of well-known models including:

Gemma & Gemma 2
Llama, Llama2 & Llama3
Mistral & Mixtral
Qwen2
IBM Granite
GPT-2
BERT

Jlama also provides support for the BPE and WordPiece tokenizers, as well as other functionalities:

Paged Attention: Efficiently manages attention mechanisms within models.
Mixture of Experts: Utilizes specialized subnetworks within the model to provide better performance.
Tool Calling: Supports calling external tools during model inference.
Generate Embeddings: Lets users obtain embeddings from text inputs.
Classifier Support: Enables model use in classification tasks.
Huggingface SafeTensors: Supports SafeTensors model and tokenizer formats.
Data Type Support: Handles F32, F16, BF16, Q8, and Q4 model quantization and fast GEMM operations.
Distributed Inference: Allows distributed processing of model inferences.

These features require Java 20 or newer, leveraging the Vector API for high-speed inference.

🤔 What is it used for?

Jlama is used to integrate advanced LLM inference capabilities directly into Java applications, streamlining the development process for Java developers who want to harness the power of LLMs.

🔬 Quick Start

🕵️‍♀️ Local Client Usage (with jbang!)

Jlama offers a command-line interface (CLI) that simplifies the process of working with models.

Install jbang:

curl -Ls https://sh.jbang.dev | bash -s - app setup

Install Jlama CLI:
```
jbang app install --force jlama@tjake
```
Download and Run a Model:
```
jlama restapi tjake/Llama-3.2-1B-Instruct-JQ4 --auto-download
```
Access the chat UI via http://localhost:8080/.

👨‍💻 Java Project Integration

To incorporate Jlama into a Java project, developers can use the Langchain4j Integration or directly add the Maven dependencies:

<dependency>
  <groupId>com.github.tjake</groupId>
  <artifactId>jlama-core</artifactId>
  <version>${jlama.version}</version>
</dependency>
<dependency>
  <groupId>com.github.tjake</groupId>
  <artifactId>jlama-native</artifactId>
  <classifier>${os.detected.name}-${os.detected.arch}</classifier>
  <version>${jlama.version}</version>
</dependency>

Ensure to enable Java 21 preview features if needed:

export JDK_JAVA_OPTIONS="--add-modules jdk.incubator.vector --enable-preview"

⭐ Show Support

If Jlama proves useful in your projects, consider giving it a star on its repository. This helps show support and encourages further development.

🗺️ Roadmap

Future development plans for Jlama include:

Expanding support for more models
Adding LoRA and GraalVM support
Continuing enhancements for distributed inference

🏷️ License and Citation

Jlama is released under the Apache License. If used in research, it can be cited using its bibliographic entry:

@misc{jlama2024,
    title = {Jlama: A modern Java inference engine for large language models},
    url = {https://github.com/tjake/jlama},
    author = {T Jake Luciani},
    month = {January},
    year = {2024}
}

Jlama stands out as a powerful, modern solution for integrating and running large language models directly within Java applications, catering to developers seeking to leverage LLM capabilities efficiently and effectively.