Introduction to SetFit: Efficient Few-shot Learning with Sentence Transformers
SetFit is a cutting-edge, prompt-free framework designed for few-shot fine-tuning of Sentence Transformers. This method stands out because it allows for high accuracy even with minimal labeled data. For example, using just 8 labeled samples per class on a sentiment analysis dataset like Customer Reviews, SetFit can achieve accuracy comparable to fine-tuning advanced models like RoBERTa Large using a full 3,000-example dataset.
Unique Features of SetFit
SetFit sets itself apart from other few-shot learning techniques with several distinct features:
-
No Prompts or Verbalizers: Traditional few-shot methods typically rely on crafted prompts or verbalizers to format examples for language models. SetFit eliminates this need by directly generating rich embeddings from the text, simplifying the process.
-
Fast Training: Unlike other methods that require large-scale models such as T0 or GPT-3 for high accuracy, SetFit is significantly faster, often by an order of magnitude, both in training and during inference.
-
Multilingual Support: By leveraging any downloadable Sentence Transformer model from the Hugging Face Hub, SetFit supports text classification in various languages by fine-tuning a multilingual checkpoint.
Installation Instructions
To get started with SetFit, you can easily install it using pip:
pip install setfit
For those interested in the latest features, you can install the bleeding-edge version from the source:
pip install git+https://github.com/huggingface/setfit.git
How to Use SetFit
The Quickstart Guide is an excellent resource for understanding SetFit's training, saving, and inference capabilities. Additional examples and guidance can be found in the Notebooks, Tutorials, and How-to Guides.
Sample Training Process
SetFit provides two main classes integrated with the Hugging Face Hub:
-
SetFitModel: Combines a pretrained Sentence Transformer with a classification head from scikit-learn or SetFit's own differentiable head.
-
Trainer: Manages the fine-tuning process for SetFit models.
Below is a simple, comprehensive example of training a SetFit model:
from datasets import load_dataset
from setfit import SetFitModel, Trainer, TrainingArguments, sample_dataset
# Load dataset from Hugging Face Hub
dataset = load_dataset("sst2")
# Example few-shot technique: 8 examples per class
train_dataset = sample_dataset(dataset["train"], label_column="label", num_samples=8)
eval_dataset = dataset["validation"].select(range(100))
test_dataset = dataset["validation"].select(range(100, len(dataset["validation"])))
# Load a SetFit model from Hub
model = SetFitModel.from_pretrained(
"sentence-transformers/paraphrase-mpnet-base-v2",
labels=["negative", "positive"]
)
args = TrainingArguments(
batch_size=16,
num_epochs=4,
eval_strategy="epoch",
save_strategy="epoch",
load_best_model_at_end=True
)
trainer = Trainer(
model=model,
args=args,
train_dataset=train_dataset,
eval_dataset=eval_dataset,
metric="accuracy",
column_mapping={"sentence": "text", "label": "label"} # Maps dataset columns
)
# Training and evaluation
trainer.train()
metrics = trainer.evaluate(test_dataset)
print(metrics) # Outputs: {'accuracy': 0.8691709844559585}
# Publish model to the Hub
trainer.push_to_hub("tomaarsen/setfit-paraphrase-mpnet-base-v2-sst2")
# Download model from the Hub
model = SetFitModel.from_pretrained("tomaarsen/setfit-paraphrase-mpnet-base-v2-sst2")
# Perform inference
preds = model.predict(["i loved the spiderman movie!", "pineapple on pizza is the worst ๐คฎ"])
print(preds) # Outputs: ["positive", "negative"]
Reproducing Paper Results
Scripts are available to reproduce the results for SetFit and its baselines as shown in the scientific paper's Table 2. Setup and training instructions can be found in the scripts/
directory.
Developer Information
For development, a Python virtual environment is recommended. Installation can be done via Conda:
conda create -n setfit python=3.9 && conda activate setfit
Then, install the base requirements:
pip install -e '.[dev]'
Code Formatting
SetFit uses black
and isort
for consistent code formatting. Post-installation, developers can verify code quality with:
make style && make quality
Project Architecture
The project is structured as follows:
โโโ LICENSE
โโโ Makefile <- Commands like `make style` or `make tests`
โโโ README.md <- Top-level README
โโโ docs <- Documentation source files
โโโ notebooks <- Jupyter notebooks
โโโ final_results <- Predictions from the paper
โโโ scripts <- Scripts for training and inference
โโโ setup.cfg <- Package metadata config
โโโ setup.py <- Make pip installable with `pip install -e`
โโโ src <- Source code
โโโ tests <- Unit tests
Related Work
By eliminating the need for large model structures and crafted prompts, SetFit stands as a versatile and efficient solution for few-shot learning, enabling multilingual capabilities straight out of the box.