setfit - Streamline Fine-Tuning with Sentence Transformers Without Prompts or Large Models

Introduction to SetFit: Efficient Few-shot Learning with Sentence Transformers

SetFit is a cutting-edge, prompt-free framework designed for few-shot fine-tuning of Sentence Transformers. This method stands out because it allows for high accuracy even with minimal labeled data. For example, using just 8 labeled samples per class on a sentiment analysis dataset like Customer Reviews, SetFit can achieve accuracy comparable to fine-tuning advanced models like RoBERTa Large using a full 3,000-example dataset.

Unique Features of SetFit

SetFit sets itself apart from other few-shot learning techniques with several distinct features:

No Prompts or Verbalizers: Traditional few-shot methods typically rely on crafted prompts or verbalizers to format examples for language models. SetFit eliminates this need by directly generating rich embeddings from the text, simplifying the process.
Fast Training: Unlike other methods that require large-scale models such as T0 or GPT-3 for high accuracy, SetFit is significantly faster, often by an order of magnitude, both in training and during inference.
Multilingual Support: By leveraging any downloadable Sentence Transformer model from the Hugging Face Hub, SetFit supports text classification in various languages by fine-tuning a multilingual checkpoint.

Installation Instructions

To get started with SetFit, you can easily install it using pip:

pip install setfit

For those interested in the latest features, you can install the bleeding-edge version from the source:

pip install git+https://github.com/huggingface/setfit.git

How to Use SetFit

The Quickstart Guide is an excellent resource for understanding SetFit's training, saving, and inference capabilities. Additional examples and guidance can be found in the Notebooks, Tutorials, and How-to Guides.

Sample Training Process

SetFit provides two main classes integrated with the Hugging Face Hub:

SetFitModel: Combines a pretrained Sentence Transformer with a classification head from scikit-learn or SetFit's own differentiable head.
Trainer: Manages the fine-tuning process for SetFit models.

Below is a simple, comprehensive example of training a SetFit model:

from datasets import load_dataset
from setfit import SetFitModel, Trainer, TrainingArguments, sample_dataset

# Load dataset from Hugging Face Hub
dataset = load_dataset("sst2")

# Example few-shot technique: 8 examples per class
train_dataset = sample_dataset(dataset["train"], label_column="label", num_samples=8)
eval_dataset = dataset["validation"].select(range(100))
test_dataset = dataset["validation"].select(range(100, len(dataset["validation"])))

# Load a SetFit model from Hub
model = SetFitModel.from_pretrained(
    "sentence-transformers/paraphrase-mpnet-base-v2",
    labels=["negative", "positive"]
)

args = TrainingArguments(
    batch_size=16,
    num_epochs=4,
    eval_strategy="epoch",
    save_strategy="epoch",
    load_best_model_at_end=True
)

trainer = Trainer(
    model=model,
    args=args,
    train_dataset=train_dataset,
    eval_dataset=eval_dataset,
    metric="accuracy",
    column_mapping={"sentence": "text", "label": "label"}  # Maps dataset columns
)

# Training and evaluation
trainer.train()
metrics = trainer.evaluate(test_dataset)
print(metrics)  # Outputs: {'accuracy': 0.8691709844559585}

# Publish model to the Hub
trainer.push_to_hub("tomaarsen/setfit-paraphrase-mpnet-base-v2-sst2")

# Download model from the Hub
model = SetFitModel.from_pretrained("tomaarsen/setfit-paraphrase-mpnet-base-v2-sst2")
# Perform inference
preds = model.predict(["i loved the spiderman movie!", "pineapple on pizza is the worst 🤮"])
print(preds)  # Outputs: ["positive", "negative"]

Reproducing Paper Results

Scripts are available to reproduce the results for SetFit and its baselines as shown in the scientific paper's Table 2. Setup and training instructions can be found in the scripts/ directory.

Developer Information

For development, a Python virtual environment is recommended. Installation can be done via Conda:

conda create -n setfit python=3.9 && conda activate setfit

Then, install the base requirements:

pip install -e '.[dev]'

Code Formatting

SetFit uses black and isort for consistent code formatting. Post-installation, developers can verify code quality with:

make style && make quality

Project Architecture

The project is structured as follows:

├── LICENSE
├── Makefile                <- Commands like `make style` or `make tests`
├── README.md               <- Top-level README
├── docs                    <- Documentation source files
├── notebooks               <- Jupyter notebooks
├── final_results           <- Predictions from the paper
├── scripts                 <- Scripts for training and inference
├── setup.cfg               <- Package metadata config
├── setup.py                <- Make pip installable with `pip install -e`
├── src                     <- Source code
└── tests                   <- Unit tests

Related Work

By eliminating the need for large model structures and crafted prompts, SetFit stands as a versatile and efficient solution for few-shot learning, enabling multilingual capabilities straight out of the box.