speech-driven-animation - Generate Facial Animations Through Speech Synthesis Models

Speech-Driven Animation: A Plain Language Introduction

The Speech-Driven Animation project is a fascinating initiative that brings together the realms of speech processing and animation. This project utilizes advanced technology to create realistic facial animations from audio input, a capability with significant potential across various fields such as entertainment, communication, and human-computer interaction.

Overview of the Project

At the heart of the Speech-Driven Animation project is an end-to-end facial synthesis model. This model allows a computer to generate facial movements that correspond to given audio input, effectively creating an animation that 'speaks' the words from the audio file. This innovative solution is based on the research detailed in a specific paper, co-authored by experts Konstantinos Vougioukas, Honglie Chen, and Pingchuan Ma.

Getting Started with the Models

Initially, the models used in this project were hosted on Git Large File Storage. However, due to overwhelming demand and storage limitations, they have been moved to Google Drive. Those interested in exploring the project can download the models from here and place them in the sda/data/ directory for use.

Installation Process

To get started with the project, you need to install the necessary library. This is done simply by running the pip install command:

$ pip install .

Running an Animation Example

Creating animations using this project involves a straightforward process. You'll need to first instantiate the VideoAnimator class. After doing so, you can input an image and an audio clip (or specify the paths to these files), and the program will generate a video.

Choosing and Using Models

The animation models within the project have been pre-trained using several datasets which include GRID, TCD-TIMIT, CREMA-D, and LRW. By default, the GRID model is used, but users can choose another model by specifying it during the instantiation of the VideoAnimator class:

import sda
va = sda.VideoAnimator(gpu=0, model_path="crema")

However, note that not all datasets are currently available; for instance, the LRW dataset is not uploaded yet.

Example Usage

For practical usage, the project can handle inputs either via file paths or using numpy arrays:

Using File Paths:

import sda
va = sda.VideoAnimator(gpu=0)
vid, aud = va("example/image.bmp", "example/audio.wav")

Using Numpy Arrays:

import sda
from PIL import Image
import scipy.io.wavfile as wav

va = sda.VideoAnimator(gpu=0)
fs, audio_clip = wav.read("example/audio.wav")
still_frame = Image.open("example/image.bmp")
vid, aud = va(still_frame, audio_clip, fs=fs)

Once your video is created, you can save it using the library's functions:

va.save_video(vid, aud, "generated.mp4")

Utilizing Encoders for Classification

The project also provides encoders for both audio and video, which can be useful for producing features for classification tasks. For instance, the audio encoder combines an audio-frame encoder with a recurrent neural network (RNN) to process audio input effectively.

import sda
encoder, info = sda.get_audio_feature_extractor(gpu=0)

Citation for Academic Use

If researchers find this project beneficial for their work, they are encouraged to cite the original paper in their academic outputs using the provided citation:

@inproceedings{vougioukas2019end,
  title={End-to-End Speech-Driven Realistic Facial Animation with Temporal GANs.},
  author={Vougioukas, Konstantinos and Petridis, Stavros and Pantic, Maja},
  booktitle={CVPR Workshops},
  pages={37--40},
  year={2019}
}

In summary, the Speech-Driven Animation project offers a robust platform for generating realistic facial animations driven by audio input, making it a valuable tool for various applications.