Safetensors
Introduction
Safetensors is an innovative project developed by Hugging Face to address the need for a secure and efficient way to store tensor data, particularly in machine learning fields. The core aim of Safetensors is to provide a safe alternative to traditional methods like pickle
, which can potentially execute arbitrary code when loading data. The project supports Python and Rust programming languages and boasts a straightforward design focused on fast, zero-copy data access.
Installation
Using Pip
For Python users, installing safetensors is straightforward using the pip package manager. You can install it using the following command:
pip install safetensors
From Source
Developers who prefer building from the source will need Rust installed on their system. The steps are as follows:
# Install Rust
curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh
# Update and clone repository
rustup update
git clone https://github.com/huggingface/safetensors
cd safetensors/bindings/python
pip install setuptools_rust
pip install -e .
Getting Started
Once installed, using safetensors to save and load tensors is simple. Below is a Python example demonstrating its usage:
import torch
from safetensors import safe_open
from safetensors.torch import save_file
tensors = {
"weight1": torch.zeros((1024, 1024)),
"weight2": torch.zeros((1024, 1024))
}
save_file(tensors, "model.safetensors")
tensors = {}
with safe_open("model.safetensors", framework="pt", device="cpu") as f:
for key in f.keys():
tensors[key] = f.get_tensor(key)
Format Specifications
Safetensors utilizes a specific file format designed for safety and efficiency. It contains a header and a byte buffer:
- The header consists of metadata about the tensor data, formatted as a JSON string.
- The header begins with the size (in bytes), followed by metadata describing each tensor, including data type, shape, and byte offsets.
- The rest of the file is a byte buffer containing the tensor data.
The format ensures no overlap in memory addresses and supports features like zero-copy access, which is crucial for handling large datasets.
Advantages Over Other Formats
Safetensors stands out among existing tensor formats by balancing safety, speed, and ease of use. Key advantages include:
- Safe: Unlike
pickle
, Safetensors ensures tensors can be loaded without risk of executing arbitrary code. - Zero-copy and Lazy Loading: Facilitates efficient data loading and manipulation without additional memory overhead.
- No File Size Limit: Unlike some alternatives with file size restrictions, Safetensors supports virtually unlimited file sizes.
- Fast Loading: Offers quick tensor loading times, especially beneficial in distributed systems and multi-GPU setups.
Comparisons with Other Formats
The format offers distinct benefits over others:
- Pickle: Mains unsafe due to potential code execution.
- HDF5: Generally safe but discouraged for certain applications like TensorFlow due to security and usability concerns.
- Protobuf: Has size limitations and less flexible.
- Numpy (npz): Vulnerable to zip bombs and not zero-copy.
Additional Benefits
The innovation in Safetensors allows preventing denial-of-service attacks by implementing strict parsing limits and maintaining memory constraints. It supports fast tensor loading, often faster than PyTorch alternatives, and enables effective lazy loading crucial in parallel and distributed computing environments.
Conclusion
Safetensors is a well-rounded library offering numerous benefits for machine learning practitioners and developers. With its focus on security, efficiency, and ease of use, it fills a critical gap left by existing tensor file formats, particularly for those working with PyTorch.