Introduction to the Digital Life DL-B Project
Overview
Digital Life DL-B is an open-source project that serves as a digital persona solution, integrating advanced technologies from ChatGLM, Wav2Lip, and so-vits-svc. This project was completed in mid-March 2023 and has not undergone any optimization or updates since its initial development. Currently, the project is part of a competition journey that will progress to the provincial stage in late June. While further developments are underway, specifically with DL-C and DL-D, detailed codes or insights will not be publicly shared until the competition concludes. Post-competition, the project will be continued by AI学社 (AI Society) with user-friendly packages and frameworks for easier implementation.
Technical Framework
Digital Life DL-B combines three primary components:
- ChatGLM: A powerful language model designed for conversational tasks.
- Wav2Lip: A model that synchronizes lip movements with speech inputs for realistic facial animations.
- so-vits-svc: A voice conversion technology allowing nuanced speech synthesis.
The original code is straightforward, reflecting the developer's background as an undergraduate in finance rather than software engineering. Subsequent improvements are planned after the project's transition to AI学社.
Hardware and Software Requirements
The project was developed using a platform with the following specifications:
- Graphics Card: RTX 3060 12G
- CPU: Intel i5-12400F
- RAM: 16 GB
- Storage: 30 GB
For software, the project is tested on Python 3.9.13 64-bit. Dependencies are installed using pip install -r requirements.txt
, and a separate Python 3.8 environment is needed specifically for So-VITS.
Model Training
ChatGLM
Users can fine-tune ChatGLM according to their needs. Tsinghua University has provided an extensive guide on P-tuning, with practical illustrations like a tuning example based on Chinese opera. For those with poor network conditions, models can be downloaded and loaded locally to save time.
Training on personal data involves adjusting train.sh
and evaluate.sh
scripts, replacing file paths with those pointing to the user's JSON datasets, and setting the appropriate configuration for data format and input-output lengths. Multi-turn dialogue data is also supported, providing an interactive experience development.
So-VITS-SVC
Renowned for its popularity and maturity, So-VITS has ample tutorials available, especially on Bilibili, a popular Chinese video-sharing platform, making it unnecessary to elaborate here. Users require some additional model files and can use available pre-trained models to get started quickly.
Wav2Lip
Wav2Lip enables realistic lip-syncing, and alternative models with different qualities are available. Users can enhance output by collecting short video clips, ideally in .mp4 format with a resolution of 720p or 480p, to capture facial data for more authentic results.
Code Customization
Certain modifications are necessary in the core scripts:
- Adjust file paths in
main_demo.py
and other scripts to point to the personalized models and data paths. - Conduct these adjustments per specific instructions in each script to ensure seamless operation of the project.
Conclusion
The Digital Life DL-B project is a promising digital persona solution, uniquely combining voice, text, and lip-sync technologies. While it's currently in its initial phase, further development and user-friendly enhancements are on the horizon, making it an interesting area for tech enthusiasts to explore and contribute to.
Ultimately, more detailed documentation and expanded resources will be available once the project matures post-competition, encouraging broader participation and community-driven innovation.