Torch-TensorRT
Introduction
Torch-TensorRT is a powerful tool designed to enhance the performance of PyTorch models on NVIDIA platforms. By integrating TensorRT with PyTorch, it significantly reduces inference latency—by up to 5 times compared to standard eager execution—all with just a single line of code. This provides users with a seamless and efficient way to optimize their machine learning models for deployment.
Installation
To get started with Torch-TensorRT, users can choose to install stable versions directly from PyPI using:
pip install torch-tensorrt
Alternatively, for those interested in the latest updates, nightly versions are available from the PyTorch package index:
pip install --pre torch-tensorrt --index-url https://download.pytorch.org/whl/nightly/cu124
For a more comprehensive setup, Torch-TensorRT is available as part of the NVIDIA NGC PyTorch Container, which comes pre-packaged with all necessary dependencies and example notebooks.
Quickstart
Option 1: torch.compile
Torch-TensorRT can be used wherever torch.compile
is employed. Here is how you can start:
import torch
import torch_tensorrt
model = MyModel().eval().cuda() # Define your model here
x = torch.randn((1, 3, 224, 224)).cuda() # Define the input data
optimized_model = torch.compile(model, backend="tensorrt")
optimized_model(x) # Compiles on the first run
optimized_model(x) # This execution will be fast!
Option 2: Export
For those looking to optimize models ahead-of-time or deploy them in a C++ environment, Torch-TensorRT provides an export-style workflow. This allows models to be serialized for deployment without Python dependencies.
- Step 1: Optimize + Serialize
import torch
import torch_tensorrt
model = MyModel().eval().cuda() # Define your model here
inputs = [torch.randn((1, 3, 224, 224)).cuda()] # Define your inputs
trt_gm = torch_tensorrt.compile(model, ir="dynamo", inputs=inputs)
torch_tensorrt.save(trt_gm, "trt.ep", inputs=inputs) # For PyTorch runtime
torch_tensorrt.save(trt_gm, "trt.ts", output_format="torchscript", inputs=inputs) # For C++ deployment
-
Step 2: Deploy
- Deployment in PyTorch:
import torch import torch_tensorrt inputs = [torch.randn((1, 3, 224, 224)).cuda()] # Define inputs # Run in a new Python session if required model = torch.export.load("trt.ep").module() model(*inputs)
- Deployment in C++:
#include "torch/script.h" #include "torch_tensorrt/torch_tensorrt.h" auto trt_mod = torch::jit::load("trt.ts"); auto input_tensor = [...]; // Populate with input data auto results = trt_mod.forward({input_tensor});
Platform Support
Torch-TensorRT supports a variety of platforms. It fully supports Linux AMD64 with GPU, has partial support for Windows GPU using the Dynamo backend, and offers native compilation support for Linux aarch64 platforms on JetPack-4.4+. However, it does not currently support Linux ppc64le with GPU.
Dependencies
For effective operation, Torch-TensorRT relies on certain dependencies like Bazel 6.3.2, Libtorch 2.5.0.dev, CUDA 12.4, and TensorRT 10.3.0.26. While it can work with other versions, these are the dependencies verified through its test cases.
Deprecation Policy
Torch-TensorRT adopts a clear deprecation policy beginning with version 2.3. It provides a 6-month migration period where deprecated APIs continue to function with warnings, allowing developers to transition smoothly before their removal, aligned with semantic versioning principles.
Contributing
Contributors are welcome to review the project's contribution guidelines in the CONTRIBUTING.md file to get involved in the development process.
License
The project is licensed under the BSD-3-Clause license. Details can be found in the LICENSE file.