Digital-Life-DL-B - Digital Avatar Creation Using ChatGLM, Wav2lip, and So-VITS-SVC Technologies

Introduction to the Digital Life DL-B Project

Overview

Digital Life DL-B is an open-source project that serves as a digital persona solution, integrating advanced technologies from ChatGLM, Wav2Lip, and so-vits-svc. This project was completed in mid-March 2023 and has not undergone any optimization or updates since its initial development. Currently, the project is part of a competition journey that will progress to the provincial stage in late June. While further developments are underway, specifically with DL-C and DL-D, detailed codes or insights will not be publicly shared until the competition concludes. Post-competition, the project will be continued by AI学社 (AI Society) with user-friendly packages and frameworks for easier implementation.

Technical Framework

Digital Life DL-B combines three primary components:

ChatGLM: A powerful language model designed for conversational tasks.
Wav2Lip: A model that synchronizes lip movements with speech inputs for realistic facial animations.
so-vits-svc: A voice conversion technology allowing nuanced speech synthesis.

The original code is straightforward, reflecting the developer's background as an undergraduate in finance rather than software engineering. Subsequent improvements are planned after the project's transition to AI学社.

Hardware and Software Requirements

The project was developed using a platform with the following specifications:

Graphics Card: RTX 3060 12G
CPU: Intel i5-12400F
RAM: 16 GB
Storage: 30 GB

For software, the project is tested on Python 3.9.13 64-bit. Dependencies are installed using pip install -r requirements.txt, and a separate Python 3.8 environment is needed specifically for So-VITS.

Model Training

ChatGLM

Users can fine-tune ChatGLM according to their needs. Tsinghua University has provided an extensive guide on P-tuning, with practical illustrations like a tuning example based on Chinese opera. For those with poor network conditions, models can be downloaded and loaded locally to save time.

Training on personal data involves adjusting train.sh and evaluate.sh scripts, replacing file paths with those pointing to the user's JSON datasets, and setting the appropriate configuration for data format and input-output lengths. Multi-turn dialogue data is also supported, providing an interactive experience development.

So-VITS-SVC

Renowned for its popularity and maturity, So-VITS has ample tutorials available, especially on Bilibili, a popular Chinese video-sharing platform, making it unnecessary to elaborate here. Users require some additional model files and can use available pre-trained models to get started quickly.

Wav2Lip

Wav2Lip enables realistic lip-syncing, and alternative models with different qualities are available. Users can enhance output by collecting short video clips, ideally in .mp4 format with a resolution of 720p or 480p, to capture facial data for more authentic results.

Code Customization

Certain modifications are necessary in the core scripts:

Adjust file paths in main_demo.py and other scripts to point to the personalized models and data paths.
Conduct these adjustments per specific instructions in each script to ensure seamless operation of the project.

Conclusion

The Digital Life DL-B project is a promising digital persona solution, uniquely combining voice, text, and lip-sync technologies. While it's currently in its initial phase, further development and user-friendly enhancements are on the horizon, making it an interesting area for tech enthusiasts to explore and contribute to.

Ultimately, more detailed documentation and expanded resources will be available once the project matures post-competition, encouraging broader participation and community-driven innovation.