Tabular Transformers for Modeling Multivariate Time Series
TabFormer is an innovative project focused on employing tabular transformers to model multivariate time series data. Funded by promising research, its core details are presented in the paper "Tabular Transformers for Modeling Multivariate Time Series," scheduled for presentation at the ICASSP 2021 conference.
Project Overview
-
Modules for Hierarchical Transformers: TabFormer introduces specialized modules for utilizing hierarchical transformers to manage tabular data efficiently.
-
Synthetic Credit Card Transaction Dataset: It provides a synthetic dataset specifically designed to represent credit card transactions, facilitating advanced data analysis.
-
Adaptive Softmax Enhancement: The project modifies the adaptive softmax function to better handle masking scenarios in the dataset.
-
Customized Data Collator: A modified version of DataCollatorForLanguageModeling is used, optimized for processing tabular data.
-
Integration with HuggingFace Transformers: Leveraging the popular HuggingFace transformers library enhances the project's accessibility and functionality.
Technology Framework
To successfully run TabFormer, the following software versions are recommended:
- Python: 3.7
- Pytorch: 1.6.0
- HuggingFace/Transformer: 3.2.0
- scikit-learn: 0.23.2
- Pandas: 1.1.2
These can be conveniently installed using a pre-configured yaml
setup file via:
conda env create -f setup.yml
Datasets
Credit Card Transaction Dataset
The repository includes a synthetic credit card transaction dataset, consisting of 24 million records across 12 fields, stored under the directory ./data/credit_card/
. Git LFS (Large File Storage) is required to access this dataset; however, a direct download link is available to circumvent bandwidth limitations.
PRSA Dataset
A separate dataset involving air quality data, the PRSA dataset, can be downloaded from Kaggle. Users need to manually place these files in the ./data/card/
directory.
Model Training
Tabular BERT Model: This model can be trained using the command:
python main.py --do_train --mlm --field_ce --lm_type bert \
--field_hs 64 --data_type [prsa/card] \
--output_dir [output_dir]
Tabular GPT2 Model: To focus training on specific user data, the following command is executed:
python main.py --do_train --lm_type gpt2 --field_ce --flatten --data_type card \
--data_root [path_to_data] --user_ids [user-id] \
--output_dir [output_dir]
Command Options
--data_type
: Selects between 'prsa' or 'card' datasets.--mlm
: Activates the masked language model functionality, particularly useful in BERT.--field_hs
: Specifies the hidden size for field-level transformers.--lm_type
: Allows the choice between 'bert' or 'gpt2' models.--user_ids
: Filters the dataset based on specified user IDs for targeted analysis.
Citing the Project
For those referencing this project, the official citation is provided:
@inproceedings{padhi2021tabular,
title={Tabular transformers for modeling multivariate time series},
author={Padhi, Inkit and Schiff, Yair and Melnyk, Igor and Rigotti, Mattia and Mroueh, Youssef and Dognin, Pierre and Ross, Jerret and Nair, Ravi and Altman, Erik},
booktitle={ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)},
pages={3565--3569},
year={2021},
organization={IEEE},
url={https://ieeexplore.ieee.org/document/9414142}
}
TabFormer stands as a robust project pushing the limits of how tabular transformers can be applied in analyzing complex time series data, showing a promising direction for future research and applications in data analysis.