Tabular LLM: Building Large Language Models for Table-Centric Intelligent Tasks
Overview
The Tabular LLM project sets out to create large language models (LLMs) specifically tailored for intelligent tasks involving tabular data. Building on the foundation of the Alpaca-CoT project, a lightweight LLM fine-tuning platform, the primary aim is to gather and organize datasets for various tasks such as table question answering and table-to-text generation. These datasets will be converted into instruction-tuning formats and used to fine-tune LLMs, thereby enhancing their understanding of table data.
This project focuses on consolidating existing academic datasets related to table intelligence and encourages contributions of any table-centric task datasets not yet included. The overarching goal is to boost the open-source community's ability to replicate and extend the table processing capabilities of AI models, like ChatGPT, and provide a robust foundation for researchers to develop domain-specific table-oriented LLMs.
Interested individuals are invited to join the project's discussion groups to collaborate with like-minded researchers. The Tabular LLM project also maintains an up-to-date list of relevant research papers for those engaged in LLM and table-related studies.
Project Objectives
-
Comprehensive Dataset Collection: Gathering open-source datasets for a variety of table-related tasks, transforming the raw data into a format suitable for instruction tuning, and using the Alpaca-CoT platform to fine-tune LLMs.
-
Enhancing Table Comprehension in LLMs: By fine-tuning models using task-specific data, the project aims to improve the models' ability to understand and work with table data, ultimately developing highly adept LLMs for table-related tasks.
-
Contributing to Open Source and Research: The project seeks to facilitate the reproduction and enhancement of ChatGPT-like capabilities within the open-source community while also providing a better data and model basis for further research.
Key Developments and Updates
- April 22, 2024: New datasets for table instruction fine-tuning were added, featuring a broader range of instruction templates and supporting diverse tasks like table-text generation and table structure understanding.
- May 5, 2023: Project launch.
Background and Initial Findings
Research indicates that advanced LLMs like ChatGPT have already made significant strides in handling table data, supporting tasks such as:
- Table drawing and modification
- Table-based question answering (extracting details from tables)
- Text-to-table and table-to-text conversions
- Table fact verification (checking the consistency between table data and given statements)
However, there are still notable challenges and limitations:
- Formatting Constraints: Current LLMs primarily support simple tables in Markdown format, struggling with more complex table structures like those with merged cells.
- Task-Specific Limitations: While basic table processing tasks are within reach, more advanced capabilities, such as complex numerical reasoning in table question answering, require further development.
- Lack of Open Training Data: Current LLMs have not had their training data publicly disclosed, especially concerning table-based tasks.
Future Plans
The project plans to continuously update its collection and formatting of diverse table types and task datasets while releasing open-source table intelligent LLMs and conducting thorough testing and analysis.
Methodology: Representing Tables for LLMs
A critical aspect of the project is determining how to represent tables as text sequences for LLM learning. Various methods are explored, including:
- Markdown: Best suited for simple tables without merged cells.
- HTML: Ideal for complex tables with merging and other advanced formatting needs.
- LaTeX: Alternatively used in academic contexts.
These methods ensure that table data is presented in a way conducive to the LLM's comprehension, allowing it to more effectively learn from and perform tasks with table data.
Conclusion
The Tabular LLM project is a significant step toward advancing LLMs' capabilities with tabular data. By harnessing diverse datasets and fine-tuning techniques, the project seeks to open new avenues for research and application in the realm of table intelligence. Researchers and developers are encouraged to participate, contribute, and further explore the potentialities of integrating LLMs with table processing tasks.