LESS - Selecting Influential Data for Enhanced Instruction Tuning in Machine Learning

LESS: Selecting Influential Data for Targeted Instruction Tuning

The LESS project aims to revolutionize the way data is selected for training machine learning models by honing in on the most influential data points. This approach is discussed in detail in the ICML 2024 paper titled "LESS: Selecting Influential Data for Targeted Instruction Tuning." At its core, LESS introduces a methodology to enhance models by identifying and utilizing data that are most likely to improve desired capabilities.

Install Requirements

To begin using the LESS framework, you must first ensure that the essential software requirements are installed:

PyTorch Installation: Ensure you have PyTorch installed. You can install appropriate versions using:
```
pip3 install torch==2.1.2 torchvision torchaudio
```
Additional Packages: Once PyTorch is set up, proceed to install other dependencies required by the LESS framework:
```
cd LESS
pip install -r requirement.txt
```
LESS Package Installation: Finally, install the LESS package in editable mode to facilitate development and adjustments within your local environment:
```
pip install -e .
```

Data Preparation

LESS leverages several datasets to fine-tune its data selection methodology. A combination of four instruction tuning datasets—Flan v2, COT, Dolly, and Open Assistant—is utilized for training. To evaluate the framework's performance, additional datasets like MMLU, Tydiqa, and BBH are used. A processed version of these datasets can be accessed here.

Data Selection Pipeline

The data selection pipeline in LESS is a structured multistep process aimed at identifying and utilizing the most impactful training data, broken down as follows:

Step 1: Warmup Training

This initial phase focuses on improving downstream performance by training on a small percentage of the whole dataset. This process, also known as warmup training, employs the LoRA (Low-Rank Adapter) method. A sample command to initiate warmup training is shared in the project documents.

Step 2: Building the Gradient Datastore

Following warmup training, the next step involves collecting gradients for the entire set of training data. The collected gradients are essential as they form the basis for selecting relevant data during the training process.

Step 3: Selecting Data for a Task

Data specific to any downstream task must be selected using a similar format. LESS facilitates this by providing data loading modules for predefined evaluation datasets such as BBH, TydiQA, and MMLU. For additional tasks, the selection scripts can be adapted and used to derive stochastic gradient descent (SGD) gradients required for validation.

Step 4: Train with Your Selected Data

Once the influential data is selected, the next step is to train your model using this curated dataset. This training can be done using a predefined script, and users have the option to perform a full-parameter finetuning by adjusting specific training parameters.

Evaluation

Detailed guidance for evaluating the performance of models trained with selected data is found in the evaluation section of the project documentation. This evaluation step ensures that the selected data is indeed enhancing model performance as intended.

Bugs or Questions?

For questions regarding the LESS code or related research, Mengzhou is the point of contact and can be reached via email. Should any technical issues arise while using the LESS framework, opening an issue with detailed information will help expedite a resolution.

Citation

If the LESS framework proves beneficial to your work, crediting the authors by citing their ICML 2024 paper is appreciated:

@inproceedings{xia2024less,
   title={{LESS}: Selecting Influential Data for Targeted Instruction Tuning},
   author={Xia, Mengzhou and Malladi, Sadhika and Gururangan, Suchin and Arora, Sanjeev and Chen, Danqi},
   booktitle={International Conference on Machine Learning (ICML)},
   year={2024}
}

LESS stands out as a pivotal tool in the effort to refine machine learning processes by prioritizing data that drives performance improvements, thus optimizing the instructional tuning of models for enhanced capabilities.