floret - Compact Bloom-Based Word Embeddings with fastText Integration

Floret: Enhanced Word Representation with FastText and Bloom Embeddings

Floret is a fascinating adaptation of the widely-used FastText library, tailored to produce word representations for any word while maintaining a compact vector table. It merges the features of FastText with Bloom embeddings to offer extensive word coverage efficiently, and it integrates smoothly with the spaCy NLP library for enhanced processing capabilities.

Core Features of Floret

Subword Embeddings with FastText: Originally, FastText utilizes subword information to generate embeddings even for words not seen during training. This is achieved by breaking words into parts called subwords, ensuring robust handling of out-of-vocabulary words.
Bloom Embeddings for Compact Storage: Floret employs Bloom embeddings, also known as the "hashing trick." By hashing each word and storing the vector in multiple table rows, it drastically reduces the required storage space while maintaining distinct representations for each word. This approach prevents collisions and ensures each entry is unique and accurate.

Getting Started with Floret

To experiment with Floret, users can explore an example notebook that demonstrates how to work with English vectors. This hands-on introduction is available on Google Colab for interactive learning.

Installation

From Source: To build Floret from its source:
```
git clone https://github.com/explosion/floret
cd floret
make
```
This process generates the floret binary.
Python Integration: For Python users, Floret can be installed via pip:
```
pip install floret
```
Alternatively, you can install directly from the source in developer mode.

Usage of Floret

Floret extends the FastText command-line tool with options for context and compact storage:

-mode: Can be set to either fasttext or floret. In floret mode, both word and char ngrams are hashed into buckets.
-hashCount: Specifies the number of hash operations (ranging from 1 to 4) used per word/subword.

The additional floret mode allows for the creation of a highly compact vector table by storing word entries together with subword embeddings, significantly reducing space while maintaining performance.

An example command to train embeddings using Continuous Bag of Words (CBOW) technique with subwords as 4-grams and 5-grams is:

floret cbow -dim 300 -minn 4 -maxn 5 -mode floret -hashCount 2 -bucket 50000 -input input.txt -output vectors

How Floret Operates

Originally, FastText stores words and subwords in two separate extensive tables. However, Floret reduces this redundancy by merging them into a single compact table, leveraging Bloom embeddings for efficient storage.

Using Bloom embeddings, each entry is represented as the sum of hashes across multiple rows. With settings like -minn 4 -maxn 5 -mode floret -hashCount 2, each word is converted into ngrams (e.g., for the word "apple", ngrams include <appl, apple, and more) and stored as hashed sums. This method maintains unique vector representations with minimal storage requirements.

Integration with spaCy

Floret adds a .floret vector table, compatible with spaCy from version 3.2 onwards, which can be imported using:

spacy init vectors LANG vectors.floret spacy_vectors_dir --mode floret

Conclusion

Floret cleverly combines the strengths of FastText and Bloom embeddings to offer an efficient and compact solution for word representation, suitable even for languages with rich morphology. Its seamless integration with spaCy makes it an attractive choice for developers working on cutting-edge NLP projects.