tiktoken-rs

Introduction to `tiktoken-rs`

tiktoken-rs is a Rust library designed to tokenize text using OpenAI models, benefiting developers who work with various OpenAI technologies like GPT. This library builds on the foundational tiktoken library while adding enhancements to streamline its integration and use in Rust projects.

Key Features

The tiktoken-rs library is packed with features that help developers tokenize and count tokens in text inputs efficiently. This is particularly useful in applications involving large language models where understanding and controlling token counts are crucial for performance and accuracy.

Some of its main offerings include:

Ready-made tokenizer libraries specifically designed for GPT and other OpenAI models.
A convenient API for Rust developers to integrate into their projects seamlessly.
Support for counting tokens in text as well as calculating the maximum token parameters for chat completion requests.

Usage Example

To start using tiktoken-rs, you can easily add it to your project with the Rust package manager, Cargo:

cargo add tiktoken-rs

Here’s a simple example demonstrating how to count tokens in a string:

use tiktoken_rs::o200k_base;

let bpe = o200k_base().unwrap();
let tokens = bpe.encode_with_special_tokens(
  "This is a sentence   with spaces"
);
println!("Token count: {}", tokens.len());

Advanced Usage

tiktoken-rs can also handle more complex tasks, like calculating the maximum token count for chat completion requests. Here’s how it can be achieved:

use tiktoken_rs::{get_chat_completion_max_tokens, ChatCompletionRequestMessage};

let messages = vec![
    ChatCompletionRequestMessage {
        content: Some("You are a helpful assistant that only speaks French.".to_string()),
        role: "system".to_string(),
        name: None,
        function_call: None,
    },
    ChatCompletionRequestMessage {
        content: Some("Hello, how are you?".to_string()),
        role: "user".to_string(),
        name: None,
        function_call: None,
    },
    ChatCompletionRequestMessage {
        content: Some("Parlez-vous francais?".to_string()),
        role: "system".to_string(),
        name: None,
        function_call: None,
    },
];
let max_tokens = get_chat_completion_max_tokens("o1-mini", &messages).unwrap();
println!("max_tokens: {}", max_tokens);

Supported OpenAI Model Encodings

tiktoken-rs supports a range of encodings compatible with different OpenAI models, ensuring flexibility across various applications:

o200k_base for GPT-4o and o1 models
cl100k_base for ChatGPT models and text-embedding-ada-002
p50k_base for Code models, text-davinci-002, and text-davinci-003
p50k_edit for editing tasks like text-davinci-edit-001
r50k_base, also known as gpt2, for older GPT-3 models

Contribution and Feedback

The tiktoken-rs library is an open-source project that welcomes feedback, suggestions, and contributions. If you encounter any bugs or have ideas for improvement, you are encouraged to raise an issue on the library's GitHub repository.

Acknowledgements and Licensing

Special thanks to @spolu for the original code and the associated .tiktoken files. The project is available under the MIT License, allowing for wide usage and adaptation in the community.

In closing, tiktoken-rs is a robust tool for developers looking to harness OpenAI’s powerful tokenization capabilities within the Rust programming environment. Its ease of integration and rich feature set make it a valuable asset in the toolkit of anyone working with language models.