TiktokenSharp - Accurate and Updated Token Encoding in C# for OpenAI Models

Introduction to TiktokenSharp

TiktokenSharp is a practical library for developers working with tokenization processes in C#. The project draws inspiration from OpenAI's official implementation in Rust, aiming to bring similar functionalities to the C# environment. It supports various encoding algorithms such as o200k_base, cl100k_base, and p50k_base, allowing users to efficiently manage and manipulate token encodings through model names or encoding schemes.

Getting Started

One can easily incorporate TiktokenSharp into their C# projects via the NuGet package. This integration facilitates straightforward usage of the library by providing methods to encode and decode strings using different models or encoding names.

Example Usage

Here is a basic illustration demonstrating how to use TiktokenSharp in a C# project:

using TiktokenSharp;

// Using a model name
TikToken tikToken = TikToken.EncodingForModel("gpt-3.5-turbo");
var encoded = tikToken.Encode("hello world"); //[15339, 1917]
var decoded = tikToken.Decode(encoded); //hello world

// Using an encoding name
TikToken tikToken = TikToken.GetEncoding("cl100k_base");
var encoded = tikToken.Encode("hello world"); //[15339, 1917]
var decoded = tikToken.Decode(encoded); //hello world

Initially, the library downloads required tiktoken files for an encoder from the internet, but this only occurs once per system. Developers can specify their preferred directory for storing these files by setting TikToken.PBEFileDirectory before initiating the encoder. This feature is particularly useful in environments with network limitations or where cloud deployments prevent local file reads and writes.

Advantages of External File Management

TiktokenSharp refrains from integrating tiktoken files directly within the package to keep the package size manageable and in line with OpenAI's Python standards. By handling this outside the package, developers enjoy a more responsive and flexible implementation.

Benchmark Testing

For performance-conscious users, TiktokenSharp provides benchmark comparisons against other libraries such as SharpToken. Using .Net 6.0 in Debug mode with the cl100k_base encoder, tests show TiktokenSharp to be relatively efficient, albeit with some trade-offs in memory allocation:

Method	Job	Runtime	Mean	Error	StdDev	Gen0	Allocated
TiktokenSharp	.NET 8.0	.NET 8.0	98.34 ms	0.198 ms	0.176 ms	9833.3333	82321080 B
SharpToken	.NET 8.0	.NET 8.0	116.38 ms	1.026 ms	0.909 ms	2000.0000	23201696 B

These results suggest that TiktokenSharp delivers faster token processing but at a higher memory cost compared to SharpToken.

Updates

TiktokenSharp is regularly updated to integrate new features and optimizations. Recent updates have included:

1.1.5: Incorporation of support for o1 models (o200k_base).
1.1.4: Addition of support for gpt-4o (o200k_base).
1.1.0: Algorithm efficiency improvements.
1.0.9: Support for new OpenAI embeddings.

These updates demonstrate a commitment to keeping the library relevant and efficient for modern development needs.

In conclusion, TiktokenSharp is a robust option for developers needing reliable tokenization in C#, with a focus on efficiency and compatibility with established OpenAI models. It’s a valuable tool for those working extensively with modern NLP tools and models, especially within the .NET ecosystem.