Introducing TorchData
What is TorchData?
TorchData is an evolving project designed to enhance the PyTorch torch.utils.data.DataLoader
and torch.utils.data.Dataset/IterableDataset
, aiming to make them more scalable and efficient for data loading tasks. These enhancements are carried out through the TorchData repository, which offers an array of tools and updates intended to improve performance and functionality for users engaged in data manipulation and model training using PyTorch.
As of June 2024, important changes are underway: the TorchData team has decided to refocus on iteratively enhancing the existing DataLoader
rather than continuing with the DataPipes
and DataLoaderV2
. These two components will be deprecated in version 0.8.0 (July 2024) and subsequently removed. Users relying on these solutions are suggested to stick to versions up to 0.9.0 until they can transition away from these components.
Stateful DataLoader
One of the primary innovations from TorchData is the StatefulDataLoader
. This feature serves as a seamless replacement for the traditional torch.utils.data.DataLoader
with added functionalities like state_dict
and load_state_dict
. These features allow users to implement mid-epoch checkpointing, which is crucial for managing long training sessions or recovering from interruptions without restarting the process. The StatefulDataLoader
also enables the tracking of progress and custom states in data loading processes, such as token buffers and random number generator (RNG) states.
For a practical demonstration and examples of its usage, users can refer to the Stateful DataLoader main page and explore examples provided in a Colab notebook.
Installation
TorchData supports various versions of PyTorch and Python. Here is the compatibility matrix:
torch | torchdata | python |
---|---|---|
master / nightly | main / nightly | >=3.9 , <=3.12 |
2.5.0 | 0.9.0 | >=3.9 , <=3.12 |
2.4.0 | 0.8.0 | >=3.8 , <=3.12 |
2.0.0 | 0.6.0 | >=3.8 , <=3.11 |
1.13.1 | 0.5.1 | >=3.7 , <=3.10 |
1.12.1 | 0.4.1 | >=3.7 , <=3.10 |
1.12.0 | 0.4.0 | >=3.7 , <=3.10 |
1.11.0 | 0.3.0 | >=3.7 , <=3.10 |
To install TorchData locally, users can use either pip
or conda
:
-
Using pip:
pip install torchdata
-
Using conda:
conda install -c pytorch torchdata
For those interested in the latest updates, a nightly version of TorchData is available which can be installed as follows:
-
Using pip:
pip install --pre torchdata --index-url https://download.pytorch.org/whl/nightly/cpu
-
Using conda:
conda install torchdata -c pytorch-nightly
Contributing
TorchData is open to contributions from the community. Users who wish to participate can find guidance in the CONTRIBUTING documentation.
Beta Usage and Feedback
The project team encourages early adopters to provide feedback and help shape the future of TorchData. Engagement with the community can be done through raising issues or suggesting improvements.
License
TorchData is distributed under the BSD license, and the full license text is available in the LICENSE file.