Introducing the Text2Video Project
The Text2Video project is an innovative tool designed to transform written text into engaging videos that can be saved onto a local device. Originally conceived to enhance the visual reading experience of novels, this tool offers a unique blend of imagery, audio, and text to deliver a rich multimedia representation of any given content.
Implementation Overview
The process that drives Text2Video can be broken down into several key steps:
-
Text Segmentation: The tool begins by parsing the input text into manageable segments. Currently, this is achieved by breaking the text down into sentences using punctuation marks as delimiters.
-
Image and Sound Generation: For each sentence, the tool generates corresponding images and audio. Image generation is powered by the widely-known open-source tool, Stable-Diffusion, while text-to-speech conversion is handled by Edge-TTS.
-
Advanced Textual Prompts: To enhance the quality and relevance of the generated images, the tool utilizes large models to create "midjourney-like" prompts. These prompts are then fed into a model from Hugging Face to produce the final visuals.
-
Video Compilation: Using OpenCV, the generated images are stitched together into a video in MP4 format. Each sentence of the input text is displayed as subtitles at the bottom of the video.
-
Audio Integration: Audio clips, which reflect the duration necessary for each image to be displayed, are synchronized with the video. This synchronization is managed with FFmpeg, which merges the audio with the video to create a cohesive viewing experience.
The end result is a fully synchronized video that narrates the original text with corresponding images and audio.
Quick Start with Docker
Users can initiate the Text2Video tool seamlessly via Docker:
docker-compose up --build
Local Development Environment
For development purposes, the tool is built to run on macOS with Python version 3.10.12. It is crucial to have FFmpeg installed due to its role in audio and video integration.
ffmpeg -version
Development prerequisites can be installed using:
pip install -r requirements.txt
Enhancing Image Quality with Prompts
To boost image quality, users can configure the tool with an OpenAI API key. This configuration supports proxy use:
OPEN_AI_API_KEY="your open ai api key"
OPEN_AI_BASE_URL="https://api.moonshot.cn/v1"
Configuration for Hugging Face API
To leverage Hugging Face's image generation capabilities, an API token is required. It can be obtained here and should be added to the environment file as follows:
API_TOKEN="your huggingface api token"
Optional AI Models
The tool can also utilize the Pollinations-AI model, which does not require a token. This model is based on ChatGPT’s Dalle-2.
Installation of FFmpeg
The integration of sound into video necessitates the installation of FFmpeg.
Usage Instructions
Once everything is set up, you can start the application with:
python3.10 app.py
Access the application through:
http://127.0.0.1:5001/
Support and Sponsorship
Users who appreciate the project are encouraged to contribute. Donations can be made with a note of your GitHub username. For further engagement and discussion, follow the author's WeChat public account: 老码沉思录.
License
The Text2Video project is released under the MIT License, ensuring freedom to use, modify, and distribute the tool.