llama-tokenizer-js - Client-Side JavaScript Tokenization for LLaMA Models

Introduction to llama-tokenizer-js Project

llama-tokenizer-js is a JavaScript library specifically designed to tokenize text for the LLaMA 1 and LLaMA 2 models. Developed with TypeScript support, this library works both in web browsers and Node.js environments. Its primary function is to accurately count tokens on the client-side, making it a valuable tool for developers working with LLaMA models. For those interested in LLaMA 3, a separate repository exists here.

Features

Easy to Use: The library is incredibly straightforward, requiring no external dependencies. All necessary code and data are included in a single file, llama-tokenizer.js.
Compatibility: llama-tokenizer-js is compatible with a broad range of LLaMA models. A detailed list is available in the compatibility section.
Optimized: Both runtime efficiency and bundle size are optimized. The tokenizer features a highly efficient Byte-Pair Encoding (BPE) implementation and a minimized bundle size of 670KiB before compression.

Importing the Library

Recommended Method: npm Package

To integrate the tokenizer into your project using npm, simply install it as a package:

npm install llama-tokenizer-js

Import it as an ES6 module in your code:

import llamaTokenizer from 'llama-tokenizer-js';

console.log(llamaTokenizer.encode("Hello world!").length);

Alternative Methods

For those preferring script tags, you can load the module directly in your HTML:

<script type="module" src="https://belladoreai.github.io/llama-tokenizer-js/llama-tokenizer.js"></script>

Or, for CommonJS projects, you can import it asynchronously:

async function main() {
    const llamaTokenizer = await import('llama-tokenizer-js');
    console.log(llamaTokenizer.default.encode("Hello world!"));
}

main();

Usage

Once imported, users can encode and decode text with llama-tokenizer-js. It's important to note that training is not supported. When used in a browser, the library introduces llamaTokenizer to the global namespace.

Encoding

Encode a string:

llamaTokenizer.encode("Hello world!");
// Output: [1, 15043, 3186, 29991]

Decoding

Decode an array of token IDs:

llamaTokenizer.decode([1, 15043, 3186, 29991]);
// Output: 'Hello world!'

For specific cases where you might not want default tokens, additional parameters can adjust the function behavior:

llamaTokenizer.decode([3186], false, false);
// Output: 'Hello'

Testing

Run tests using:

llamaTokenizer.runTests();

The test suite is small yet comprehensive, effectively covering various edge cases. It is versatile, functioning both in browsers and Node.js.

Comparison to Alternatives

llama-tokenizer-js stands out as the first JavaScript tokenizer for LLaMA models that operates client-side in the browser. Other commonly used tokenizers, particularly those from OpenAI, are incompatible with LLaMA and can variably affect token counts, differing by as much as 20%. Network-based tokenizers, though fast, incur significant latency, especially when repeated network requests are required. Since its release, llama-tokenizer-js has inspired the creation of other tokenizers, such as the one integrated into transformers.js.

Compatibility

llama-tokenizer-js is designed for compatibility with most LLaMA models that utilize the SentencePiece Byte-Pair Encoding tokenizer. While it works with models trained on Facebook's released checkpoints, such as llama2-13b-4bit-gptq, incompatible models include those trained from scratch, like OpenLLaMA. For custom tokenizers, users may modify the vocabulary and merge data.

Maintenance and Contribution

The library's maintenance involves a structured release process, including code testing, version updates, and publication steps. Contributions have come from multiple developers and organizations, highlighting a collaborative effort in its ongoing development.

For more information, you can explore the demo or contribute to the project via GitHub. Llama-tokenizer-js is actively maintained by belladore.ai and contributors from the community.