Introduction to tiktoken-rs
tiktoken-rs
is a Rust library designed to tokenize text using OpenAI models, benefiting developers who work with various OpenAI technologies like GPT. This library builds on the foundational tiktoken
library while adding enhancements to streamline its integration and use in Rust projects.
Key Features
The tiktoken-rs
library is packed with features that help developers tokenize and count tokens in text inputs efficiently. This is particularly useful in applications involving large language models where understanding and controlling token counts are crucial for performance and accuracy.
Some of its main offerings include:
- Ready-made tokenizer libraries specifically designed for GPT and other OpenAI models.
- A convenient API for Rust developers to integrate into their projects seamlessly.
- Support for counting tokens in text as well as calculating the maximum token parameters for chat completion requests.
Usage Example
To start using tiktoken-rs
, you can easily add it to your project with the Rust package manager, Cargo:
cargo add tiktoken-rs
Here’s a simple example demonstrating how to count tokens in a string:
use tiktoken_rs::o200k_base;
let bpe = o200k_base().unwrap();
let tokens = bpe.encode_with_special_tokens(
"This is a sentence with spaces"
);
println!("Token count: {}", tokens.len());
Advanced Usage
tiktoken-rs
can also handle more complex tasks, like calculating the maximum token count for chat completion requests. Here’s how it can be achieved:
use tiktoken_rs::{get_chat_completion_max_tokens, ChatCompletionRequestMessage};
let messages = vec![
ChatCompletionRequestMessage {
content: Some("You are a helpful assistant that only speaks French.".to_string()),
role: "system".to_string(),
name: None,
function_call: None,
},
ChatCompletionRequestMessage {
content: Some("Hello, how are you?".to_string()),
role: "user".to_string(),
name: None,
function_call: None,
},
ChatCompletionRequestMessage {
content: Some("Parlez-vous francais?".to_string()),
role: "system".to_string(),
name: None,
function_call: None,
},
];
let max_tokens = get_chat_completion_max_tokens("o1-mini", &messages).unwrap();
println!("max_tokens: {}", max_tokens);
Supported OpenAI Model Encodings
tiktoken-rs
supports a range of encodings compatible with different OpenAI models, ensuring flexibility across various applications:
o200k_base
for GPT-4o and o1 modelscl100k_base
for ChatGPT models and text-embedding-ada-002p50k_base
for Code models, text-davinci-002, and text-davinci-003p50k_edit
for editing tasks like text-davinci-edit-001r50k_base
, also known asgpt2
, for older GPT-3 models
Contribution and Feedback
The tiktoken-rs
library is an open-source project that welcomes feedback, suggestions, and contributions. If you encounter any bugs or have ideas for improvement, you are encouraged to raise an issue on the library's GitHub repository.
Acknowledgements and Licensing
Special thanks to @spolu for the original code and the associated .tiktoken
files. The project is available under the MIT License, allowing for wide usage and adaptation in the community.
In closing, tiktoken-rs
is a robust tool for developers looking to harness OpenAI’s powerful tokenization capabilities within the Rust programming environment. Its ease of integration and rich feature set make it a valuable asset in the toolkit of anyone working with language models.