Introduction to TiktokenSharp
TiktokenSharp is a practical library for developers working with tokenization processes in C#. The project draws inspiration from OpenAI's official implementation in Rust, aiming to bring similar functionalities to the C# environment. It supports various encoding algorithms such as o200k_base
, cl100k_base
, and p50k_base
, allowing users to efficiently manage and manipulate token encodings through model names or encoding schemes.
Getting Started
One can easily incorporate TiktokenSharp into their C# projects via the NuGet package. This integration facilitates straightforward usage of the library by providing methods to encode and decode strings using different models or encoding names.
Example Usage
Here is a basic illustration demonstrating how to use TiktokenSharp in a C# project:
using TiktokenSharp;
// Using a model name
TikToken tikToken = TikToken.EncodingForModel("gpt-3.5-turbo");
var encoded = tikToken.Encode("hello world"); //[15339, 1917]
var decoded = tikToken.Decode(encoded); //hello world
// Using an encoding name
TikToken tikToken = TikToken.GetEncoding("cl100k_base");
var encoded = tikToken.Encode("hello world"); //[15339, 1917]
var decoded = tikToken.Decode(encoded); //hello world
Initially, the library downloads required tiktoken files for an encoder from the internet, but this only occurs once per system. Developers can specify their preferred directory for storing these files by setting TikToken.PBEFileDirectory
before initiating the encoder. This feature is particularly useful in environments with network limitations or where cloud deployments prevent local file reads and writes.
Advantages of External File Management
TiktokenSharp refrains from integrating tiktoken files directly within the package to keep the package size manageable and in line with OpenAI's Python standards. By handling this outside the package, developers enjoy a more responsive and flexible implementation.
Benchmark Testing
For performance-conscious users, TiktokenSharp provides benchmark comparisons against other libraries such as SharpToken. Using .Net 6.0
in Debug mode with the cl100k_base
encoder, tests show TiktokenSharp to be relatively efficient, albeit with some trade-offs in memory allocation:
Method | Job | Runtime | Mean | Error | StdDev | Gen0 | Allocated |
---|---|---|---|---|---|---|---|
TiktokenSharp | .NET 8.0 | .NET 8.0 | 98.34 ms | 0.198 ms | 0.176 ms | 9833.3333 | 82321080 B |
SharpToken | .NET 8.0 | .NET 8.0 | 116.38 ms | 1.026 ms | 0.909 ms | 2000.0000 | 23201696 B |
These results suggest that TiktokenSharp delivers faster token processing but at a higher memory cost compared to SharpToken.
Updates
TiktokenSharp is regularly updated to integrate new features and optimizations. Recent updates have included:
- 1.1.5: Incorporation of support for o1 models (o200k_base).
- 1.1.4: Addition of support for gpt-4o (o200k_base).
- 1.1.0: Algorithm efficiency improvements.
- 1.0.9: Support for new OpenAI embeddings.
These updates demonstrate a commitment to keeping the library relevant and efficient for modern development needs.
In conclusion, TiktokenSharp is a robust option for developers needing reliable tokenization in C#, with a focus on efficiency and compatibility with established OpenAI models. It’s a valuable tool for those working extensively with modern NLP tools and models, especially within the .NET ecosystem.