tiktoken-go - Efficient Go-Based BPE Tokeniser for OpenAI Models

Introduction to tiktoken-go

tiktoken-go is a Go adaptation of OpenAI's highly efficient Byte Pair Encoding (BPE) tokenizer, known as tiktoken. This tool is essential for developers working with OpenAI models, as it aids in encoding text data in a format suitable for model input. The original tiktoken project is well-regarded for its speed and reliability, and tiktoken-go brings those advantages to the Go programming language.

Installation

To incorporate tiktoken-go into a Go project, users can easily install it using:

go get github.com/pkoukk/tiktoken-go

This single line fetches the necessary package and sets up the development environment to leverage tiktoken-go's functionalities.

Caching Mechanism

tiktoken-go employs a caching mechanism similar to its original counterpart, ensuring efficiency and reduced load times. Users can define a specific cache directory by setting the TIKTOKEN_CACHE_DIR environment variable. By configuring this, the tokenizer caches the token dictionary locally, avoiding repetitive downloads. If not set, the dictionary will be downloaded each time tiktoken-go is initialized.

Alternative BPE Loaders

For those looking to skirt the caching system or the default dictionary download, tiktoken-go offers the flexibility of alternative BPE loaders. Developers can implement their own by utilizing the BpeLoader interface. An offline BPE loader option is available, which utilizes embedded files for dictionary loading, helpful for situations where runtime downloads are undesirable. The offline loader can be found in a separate project: tiktoken_loader.

Practical Usage Examples

Token Encoding

Below is a simple example of how to use tiktoken-go to encode text:

package main

import (
    "fmt"
    "github.com/pkoukk/tiktoken-go"
)

func main() {
	text := "Hello, world!"
	encoding := "cl100k_base"

	tke, err := tiktoken.GetEncoding(encoding)
	if err != nil {
		fmt.Printf("Error getting encoding: %v\n", err)
		return
	}

	token := tke.Encode(text, nil, nil)
	fmt.Println(token)
	fmt.Println(len(token))
}

Model-Based Tokenization

Using tiktoken-go to tokenize data according to specific OpenAI models is straightforward:

package main

import (
    "fmt"
    "github.com/pkoukk/tiktoken-go"
)

func main() {
	text := "Hello, world!"
	model := "gpt-3.5-turbo"

	tkm, err := tiktoken.EncodingForModel(model)
	if err != nil {
		fmt.Printf("Error getting model encoding: %v\n", err)
		return
	}

	token := tkm.Encode(text, nil, nil)
	fmt.Println(token)
	fmt.Println(len(token))
}

Token Counting for Chat Applications

Using tiktoken-go, developers can effectively count tokens for messages exchanged with models like gpt-3.5-turbo or gpt-4, which is crucial for optimizing API usage and cost.

Supported Encodings and Models

tiktoken-go supports various encoding schemes and model types:

Encodings: such as o200k_base, cl100k_base, p50k_base, and r50k_base.
Models: including gpt-4, gpt-3.5-turbo, text-davinci, and others, each mapped to different encoding schemes.

Testing and Benchmarking

Testing compatibility with the original tiktoken and benchmarking performance are integral to maintaining tiktoken-go's reliability. Benchmarks indicate performance parity with the original library, especially on macOS, while also identifying areas like o200k_base encoding where improvements could be made.

Licensing

tiktoken-go is released under the MIT License, allowing users flexibility in using and modifying the software.

By integrating tiktoken-go, developers can enhance their Go-based applications with robust text encoding capabilities tailored to OpenAI's model requirements. Whether through efficient caching, adaptable encoding schemes, or precise token counting, tiktoken-go stands as a reliable tool for advanced natural language processing tasks.