TF-ID - Identify Academic Tables and Figures Accurately with TF-ID Models

Introduction to TF-ID: Table/Figure Identifier

TF-ID, or Table/Figure Identifier, is an innovative project aimed at refining object detection techniques specifically to identify tables and figures in academic papers. This robust solution was developed by Yifei Hu and is made accessible through open-source platforms. Here, we dive into the details of the TF-ID project to understand its significance and utility in the realm of academic research.

Model Overview

TF-ID is centered on a family of object detection models designed with precision to extract tables and figures from academic literature. The models are categorized into four distinct versions, offering flexibility based on the user's needs:

TF-ID-base: This variant, with a model size of 0.23B, focuses on extracting both tables/figures and their associated captions.
TF-ID-large: Recommended for most users, this model, at 0.77B, also extracts tables/figures along with their captions, providing enhanced accuracy.
TF-ID-base-no-caption: Similar in size to TF-ID-base at 0.23B, this version extracts tables and figures without the captions.
TF-ID-large-no-caption: With the same capacity as the TF-ID-large model, this option is tailored to extract figures without the captions, suiting different analysis needs.

All these models build on the pre-existing architecture of the Florence-2 models by Microsoft.

Practical Application

TF-ID models simplify the process of extracting important visual information from academic papers:

Users can deploy the script python inference.py to deduce bounding boxes from a single image.
To extract all tables and figures from an entire PDF paper, python pdf_to_table_figures.py is utilized, automatically saving the result in a designated output directory.

By default, the larger models are employed in operations, but switching to a different model version can be achieved by adjusting the model_id in the scripts.

Building TF-ID Models from Scratch

For those inclined to develop or adapt TF-ID models themselves, the process is straightforward:

Begin by cloning the repository and navigating to the TF-ID directory.
Download the necessary dataset from Hugging Face.
Organize the dataset by moving annotation files and image data to specified directories.
Convert dataset formats conducive to the Florence 2 framework.
Commence training using the Accelerate tool, with checkpoints regularly saved during the process.

Hardware Requirements

The complexity and size of the models dictate specific hardware demands. The larger Florence-2 model requires a minimum of 40GB VRAM to function optimally at a batch size of 4 on a single GPU. Adjustments to batch size and training checkpoints may lessen resource requirements.

Performance Evaluations

TF-ID models have demonstrated high accuracy in detecting tables and figures, even on datasets not previously encountered during training. The success rate of model output varies slightly, nearing or surpassing 97% across different versions. This consistent performance ensures reliability for academic purposes.

Concluding Remarks

TF-ID not only advances the capabilities of object detection models but also provides open availability for community collaboration and development. This initiative is supported by valuable insights from the groundbreaking tutorial by Roboflow on utilizing Florence 2 models, alongside contributions from collaborators like Yi Zhang.

For researchers, developers, and academic professionals, TF-ID stands as a testament to the power of open-source collaboration and innovation in enhancing academic research methodologies.