florence2-finetuning - Fine-tuning Florence-2: Optimizing Microsoft’s Vision-Language Models for Versatile Tasks

Fine-tuning Florence-2: Microsoft's Cutting-edge Vision Language Models

Florence-2 is an innovative vision-language model introduced by Microsoft in June 2024. Known for its compact size, with versions at 0.2 billion and 0.7 billion parameters, Florence-2 delivers impressive performance across a range of computer vision and vision-language tasks. This adaptability makes it attractive for various applications, such as captioning, object detection, and optical character recognition (OCR), among others.

While Florence-2 offers robust pre-trained capabilities, there are scenarios where fine-tuning is essential. This is particularly true when your task isn't directly supported or when you need to tailor the model's output to suit specific requirements.

Installation

To get started with Florence-2, the project uses UV, a speedy Python package installer written in Rust, to manage dependencies. You can install UV and set up your environment with the following:

uv venv
source .venv/bin/activate
uv pip install -r requirements.txt

Should you face any issues with "flash-attn," a simple update command can resolve it:

uv pip install -U flash-attn --no-build-isolation

Data Preparation

For experimentation, the DocVQA dataset is utilized. This dataset is readily available on the Hugging Face hub, preprocessed for easy use. Loading it with the Hugging Face dataset library is straightforward:

from datasets import load_dataset

data = load_dataset('HuggingFaceM4/DocumentVQA')

print(data)

This outputs a structured dataset with training, validation, and test splits, each containing multiple features pertinent to Vision Question Answering tasks.

Updating Florence-2 for Fine-Tuning

Certain adjustments had to be made to the Florence2Seq2SeqLMOutput class to fine-tune Florence-2 effectively. These changes are documented in pull requests, which can be accessed through the codebase. To leverage these modifications, load the model as follows:

model = AutoModelForCausalLM.from_pretrained(
        "andito/Florence-2-large-ft", trust_remote_code=True
    ).to(device)
alternative_model = AutoModelForCausalLM.from_pretrained(
        "microsoft/Florence-2-large-ft", trust_remote_code=True, revision="refs/pr/10"
    ).to(device)

These revisions ensure the model is optimized for fine-tuning across various tasks.

Single GPU Training

Conducting training on a single GPU is simple with the provided script:

python train.py

This will commence training using the DocVQA dataset. Note that training with one GPU on the Cauldron dataset is not recommended due to computational demands.

Distributed Training

For enhanced efficiency, particularly when using multiple GPUs, the distributed_train.py script facilitates distributed data parallelism. To deploy this method, execute:

python distributed_train.py --dataset <dataset_name> --epochs <num_epochs> --eval-steps <evaluation_steps>

For instance:

python distributed_train.py --dataset docvqa --epochs 10 --eval-steps 1000

dataset_name: Specifies the dataset (e.g., docvqa or cauldron).
num_epochs: Defines how many epochs to train (default is 10).
evaluation_steps: Sets evaluation frequency during the training process (default is every 10000 steps).

By adhering to these guidelines, users can effectively fine-tune Florence-2 to meet their specific project requirements, unlocking the model's full potential in vision-language tasks.