GPT3 Tokenizer
Overview
GPT3 Tokenizer is a TypeScript library designed for tokenizing text that is compatible with OpenAI's GPT-3 model. This library offers support for two types of tokenization: gpt3
and codex
. The tokenizer functions smoothly in both NodeJS and Browser environments, making it versatile for various development needs.
Key Features
- Isomorphic Design: The GPT3 Tokenizer is built to work seamlessly in different environments, catering to both server-side and client-side applications.
- GPT-3 and Codex Support: It provides tokenization support specifically tailored to OpenAI's
gpt3
andcodex
models. - High Compatibility: The tokenization results are consistent with the outputs from OpenAI’s GPT-3 Playground, ensuring you get reliable and accurate tokenization.
How to Use
To start using the GPT3 Tokenizer, follow these simple steps:
-
Installation: The library can be installed using yarn. This is the command you need to execute in your terminal:
yarn add gpt3-tokenizer
-
Implementation: Once installed, you can use the tokenizer in your codebase as shown below:
import GPT3Tokenizer from 'gpt3-tokenizer'; const tokenizer = new GPT3Tokenizer({ type: 'gpt3' }); // You can also specify 'codex' const str = "hello 👋 world 🌍"; const encoded: { bpe: number[]; text: string[] } = tokenizer.encode(str); const decoded = tokenizer.decode(encoded.bpe);
This snippet demonstrates the basic encoding and decoding functionalities of String text with the tokenizer.
Underlying Technology
The GPT3 Tokenizer is built upon resources and inspiration from the following:
- OpenAI Tokenizer Page Source: The official tokenizer page from OpenAI provides foundational insights.
- gpt-3-encoder: This existing library served as a critical reference, though GPT3 Tokenizer offers additional functionality such as support for both
gpt3
andcodex
tokenization.
Moreover, the library opts for Map API over traditional JavaScript objects. This is particularly beneficial for the bpeRanks
object where it introduces performance enhancements.
Licensing
The GPT3 Tokenizer is released under the MIT License, which allows for flexibility in usage and modification. This means developers are free to use, copy, modify, and distribute the software under the terms of this license.
In summary, GPT3 Tokenizer offers a robust, flexible, and precise way to handle text tokenization for those utilizing OpenAI's models, providing ease of use in diverse programming environments.