Latte
The project presents an innovative approach to video generation using Latent Diffusion Transformers with PyTorch. It utilizes spatio-temporal token extraction and Transformer blocks for modeling video distribution in latent spaces, improving video quality on datasets such as FaceForensics and Taichi-HD. Including efficient model variants and extensions for text-to-video generation, the project achieves advanced performance benchmarks. The integration into diffusers also lowers GPU demands, facilitating access to efficient video creation infrastructures.