concept-erasure - Utilizing Concept Erasure for Improved Model Fairness and Interpretability

Introduction to Least-Squares Concept Erasure (LEACE)

Concept erasure is a technique used to remove specific features or concepts from a data representation. This process can enhance fairness by ensuring that algorithms do not inadvertently use sensitive information, such as gender or race, to make decisions. Additionally, it can improve the interpretability of models by allowing us to observe how the absence of certain concepts affects a model's behavior. The LEAst-squares Concept Erasure (LEACE) project provides a method to achieve this erasure in a way that guarantees linear classifiers cannot detect the removed concept, while minimizing the impact on the representation's overall structure. The detailed methodology is presented in a research paper.

Installation

To start using LEACE, ensure you have Python 3.10 or newer. The package can be easily installed from PyPI with the following command:

pip install concept-erasure

Core Components: LeaceFitter and LeaceEraser

The LEACE project revolves around two primary classes: LeaceFitter and LeaceEraser.

LeaceFitter: This class is responsible for calculating and maintaining the covariance and cross-covariance statistics essential for the LEACE erasure function. These statistics can be progressively updated using the LeaceFitter.update() method. There's no immediate computation of the erasure function; it is only calculated when needed through the .eraser property. This approach requires a memory allocation of O(d²), where d is the feature dimension.
LeaceEraser: This class provides a more memory-efficient representation of the LEACE erasure function, requiring only O(dk) memory. Here, k represents the number of classes associated with the concept to be erased.

Batch Usage

For users working with batches of data, such as feature vectors X with corresponding concept labels Z, the easiest way to apply concept erasure is by utilizing the LeaceEraser.fit() method. Here's a brief example:

import torch
from sklearn.datasets import make_classification
from sklearn.linear_model import LogisticRegression
from concept_erasure import LeaceEraser

n, d, k = 2048, 128, 2
X, Y = make_classification(n_samples=n, n_features=d, n_classes=k, random_state=42)
X_t = torch.from_numpy(X)
Y_t = torch.from_numpy(Y)

# Verify the logistic regression learns pre-erasure
real_lr = LogisticRegression(max_iter=1000).fit(X, Y)
beta = torch.from_numpy(real_lr.coef_)
assert beta.norm(p=torch.inf) > 0.1

# Fit eraser
eraser = LeaceEraser.fit(X_t, Y_t)
X_ = eraser(X_t)

# Post-erasure, learning is nullified
null_lr = LogisticRegression(max_iter=1000, tol=0.0).fit(X_.numpy(), Y)
beta = torch.from_numpy(null_lr.coef_)
assert beta.norm(p=torch.inf) < 1e-4

Streaming Usage

When dealing with streaming data, LeaceFitter.update() can be utilized for continuously updating statistics without the need for holding all data in memory. The following example illustrates this process:

from concept_erasure import LeaceFitter
from sklearn.datasets import make_classification
import torch

n, d, k = 2048, 128, 2
X, Y = make_classification(n_samples=n, n_features=d, n_classes=k, random_state=42)
X_t = torch.from_numpy(X)
Y_t = torch.from_numpy(Y)

fitter = LeaceFitter(d, 1, dtype=X_t.dtype)

# Update with batched data
for x, y in zip(X_t.chunk(2), Y_t.chunk(2)):
    fitter.update(x, y)

# Perform erasure
x_ = fitter.eraser(X_t[0])

Paper Replication and Concept Scrubbing

Scripts for generating part-of-speech tags, relevant to the concept scrubbing experiments, are available in a separate repository. Additionally, tagged datasets will soon be accessible on the HuggingFace Hub. The project currently features implementations tailored for different model families, such as LLaMA and GPT-NeoX, which are housed within the concept_erasure.scrubbing submodule.