Introduction to llama-tokenizer-js Project
llama-tokenizer-js is a JavaScript library specifically designed to tokenize text for the LLaMA 1 and LLaMA 2 models. Developed with TypeScript support, this library works both in web browsers and Node.js environments. Its primary function is to accurately count tokens on the client-side, making it a valuable tool for developers working with LLaMA models. For those interested in LLaMA 3, a separate repository exists here.
Features
- Easy to Use: The library is incredibly straightforward, requiring no external dependencies. All necessary code and data are included in a single file,
llama-tokenizer.js
. - Compatibility: llama-tokenizer-js is compatible with a broad range of LLaMA models. A detailed list is available in the compatibility section.
- Optimized: Both runtime efficiency and bundle size are optimized. The tokenizer features a highly efficient Byte-Pair Encoding (BPE) implementation and a minimized bundle size of 670KiB before compression.
Importing the Library
Recommended Method: npm Package
To integrate the tokenizer into your project using npm, simply install it as a package:
npm install llama-tokenizer-js
Import it as an ES6 module in your code:
import llamaTokenizer from 'llama-tokenizer-js';
console.log(llamaTokenizer.encode("Hello world!").length);
Alternative Methods
For those preferring script tags, you can load the module directly in your HTML:
<script type="module" src="https://belladoreai.github.io/llama-tokenizer-js/llama-tokenizer.js"></script>
Or, for CommonJS projects, you can import it asynchronously:
async function main() {
const llamaTokenizer = await import('llama-tokenizer-js');
console.log(llamaTokenizer.default.encode("Hello world!"));
}
main();
Usage
Once imported, users can encode and decode text with llama-tokenizer-js. It's important to note that training is not supported. When used in a browser, the library introduces llamaTokenizer
to the global namespace.
Encoding
Encode a string:
llamaTokenizer.encode("Hello world!");
// Output: [1, 15043, 3186, 29991]
Decoding
Decode an array of token IDs:
llamaTokenizer.decode([1, 15043, 3186, 29991]);
// Output: 'Hello world!'
For specific cases where you might not want default tokens, additional parameters can adjust the function behavior:
llamaTokenizer.decode([3186], false, false);
// Output: 'Hello'
Testing
Run tests using:
llamaTokenizer.runTests();
The test suite is small yet comprehensive, effectively covering various edge cases. It is versatile, functioning both in browsers and Node.js.
Comparison to Alternatives
llama-tokenizer-js stands out as the first JavaScript tokenizer for LLaMA models that operates client-side in the browser. Other commonly used tokenizers, particularly those from OpenAI, are incompatible with LLaMA and can variably affect token counts, differing by as much as 20%. Network-based tokenizers, though fast, incur significant latency, especially when repeated network requests are required. Since its release, llama-tokenizer-js has inspired the creation of other tokenizers, such as the one integrated into transformers.js.
Compatibility
llama-tokenizer-js is designed for compatibility with most LLaMA models that utilize the SentencePiece Byte-Pair Encoding tokenizer. While it works with models trained on Facebook's released checkpoints, such as llama2-13b-4bit-gptq, incompatible models include those trained from scratch, like OpenLLaMA. For custom tokenizers, users may modify the vocabulary and merge data.
Maintenance and Contribution
The library's maintenance involves a structured release process, including code testing, version updates, and publication steps. Contributions have come from multiple developers and organizations, highlighting a collaborative effort in its ongoing development.
For more information, you can explore the demo or contribute to the project via GitHub. Llama-tokenizer-js is actively maintained by belladore.ai and contributors from the community.