Project Introduction: Text2SQL-Data
The Text2SQL-Data repository is a comprehensive resource designed for building and evaluating systems that translate natural language sentences into SQL queries. This project was developed as a part of research focusing on improving text-to-SQL evaluation methodologies, presented in a paper by Catherine Finegan-Dollak et al. at ACL 2018.
Features of the Repository
The repository offers various resources across different domains, including:
- Sentences that come with annotated variables
- Corresponding SQL queries
- Comprehensive database schemas
- Actual databases
This dataset not only includes improved versions of existing databases but also introduces a new dataset developed by the authors. Separate files are provided that detail the datasets, the systems involved, and the tools used in the project.
Evolution of the Dataset
The dataset has undergone several updates:
- Version 4: Introduced data fixes.
- Version 3: Additional data from the Spider and WikiSQL datasets were integrated along with data fixes.
- Version 2: Addressed errors related to variables that were inaccurately defined in questions.
- Version 1: The dataset used for the original ACL 2018 paper.
Usage and Citation
Researchers using this dataset are requested to cite the original ACL paper and the respective sources for the datasets. It's encouraged to note the version number of the dataset in their work. Examples of how to cite specific datasets, such as Academic, Advising, ATIS, etc., are provided in detail along with corresponding BibTeX entries for accurate referencing.
Contributions and Improvements
The authors have improved the dataset by fixing numerous bugs. However, they acknowledge that not all issues have been resolved. They welcome the community to contribute by submitting pull requests for any bugs they discover. They implement a systematic approach to updates, balancing the need for consistency in system comparisons with ongoing data improvements. They maintain a list of known issues to guide prospective contributors.
Acknowledgments
The project acknowledges support from IBM under a dedicated contract. The findings and opinions within the work are attributed to the authors and may not necessarily represent IBM's views.
By offering structured data and tools, the Text2SQL-Data repository aims to enhance the capabilities of systems designed to understand and convert natural language into SQL, a key task in the integration of machine learning and database management.