TensorRT - Optimize PyTorch Models on NVIDIA with TensorRT Integration

Torch-TensorRT

Introduction

Torch-TensorRT is a powerful tool designed to enhance the performance of PyTorch models on NVIDIA platforms. By integrating TensorRT with PyTorch, it significantly reduces inference latency—by up to 5 times compared to standard eager execution—all with just a single line of code. This provides users with a seamless and efficient way to optimize their machine learning models for deployment.

Installation

To get started with Torch-TensorRT, users can choose to install stable versions directly from PyPI using:

pip install torch-tensorrt

Alternatively, for those interested in the latest updates, nightly versions are available from the PyTorch package index:

pip install --pre torch-tensorrt --index-url https://download.pytorch.org/whl/nightly/cu124

For a more comprehensive setup, Torch-TensorRT is available as part of the NVIDIA NGC PyTorch Container, which comes pre-packaged with all necessary dependencies and example notebooks.

Quickstart

Option 1: torch.compile

Torch-TensorRT can be used wherever torch.compile is employed. Here is how you can start:

import torch
import torch_tensorrt

model = MyModel().eval().cuda()  # Define your model here
x = torch.randn((1, 3, 224, 224)).cuda()  # Define the input data

optimized_model = torch.compile(model, backend="tensorrt")
optimized_model(x)  # Compiles on the first run

optimized_model(x)  # This execution will be fast!

Option 2: Export

For those looking to optimize models ahead-of-time or deploy them in a C++ environment, Torch-TensorRT provides an export-style workflow. This allows models to be serialized for deployment without Python dependencies.

Step 1: Optimize + Serialize

import torch
import torch_tensorrt

model = MyModel().eval().cuda()  # Define your model here
inputs = [torch.randn((1, 3, 224, 224)).cuda()]  # Define your inputs

trt_gm = torch_tensorrt.compile(model, ir="dynamo", inputs=inputs)
torch_tensorrt.save(trt_gm, "trt.ep", inputs=inputs)  # For PyTorch runtime
torch_tensorrt.save(trt_gm, "trt.ts", output_format="torchscript", inputs=inputs)  # For C++ deployment

Step 2: Deploy

Deployment in PyTorch:

import torch
import torch_tensorrt

inputs = [torch.randn((1, 3, 224, 224)).cuda()]  # Define inputs

# Run in a new Python session if required
model = torch.export.load("trt.ep").module()
model(*inputs)

Deployment in C++:

#include "torch/script.h"
#include "torch_tensorrt/torch_tensorrt.h"

auto trt_mod = torch::jit::load("trt.ts");
auto input_tensor = [...]; // Populate with input data
auto results = trt_mod.forward({input_tensor});

Platform Support

Torch-TensorRT supports a variety of platforms. It fully supports Linux AMD64 with GPU, has partial support for Windows GPU using the Dynamo backend, and offers native compilation support for Linux aarch64 platforms on JetPack-4.4+. However, it does not currently support Linux ppc64le with GPU.

Dependencies

For effective operation, Torch-TensorRT relies on certain dependencies like Bazel 6.3.2, Libtorch 2.5.0.dev, CUDA 12.4, and TensorRT 10.3.0.26. While it can work with other versions, these are the dependencies verified through its test cases.

Deprecation Policy

Torch-TensorRT adopts a clear deprecation policy beginning with version 2.3. It provides a 6-month migration period where deprecated APIs continue to function with warnings, allowing developers to transition smoothly before their removal, aligned with semantic versioning principles.

Contributing

Contributors are welcome to review the project's contribution guidelines in the CONTRIBUTING.md file to get involved in the development process.

License

The project is licensed under the BSD-3-Clause license. Details can be found in the LICENSE file.