text2video - Convert Text into Multimedia Videos Using Advanced AI Tools

Introducing the Text2Video Project

The Text2Video project is an innovative tool designed to transform written text into engaging videos that can be saved onto a local device. Originally conceived to enhance the visual reading experience of novels, this tool offers a unique blend of imagery, audio, and text to deliver a rich multimedia representation of any given content.

Implementation Overview

The process that drives Text2Video can be broken down into several key steps:

Text Segmentation: The tool begins by parsing the input text into manageable segments. Currently, this is achieved by breaking the text down into sentences using punctuation marks as delimiters.
Image and Sound Generation: For each sentence, the tool generates corresponding images and audio. Image generation is powered by the widely-known open-source tool, Stable-Diffusion, while text-to-speech conversion is handled by Edge-TTS.
Advanced Textual Prompts: To enhance the quality and relevance of the generated images, the tool utilizes large models to create "midjourney-like" prompts. These prompts are then fed into a model from Hugging Face to produce the final visuals.
Video Compilation: Using OpenCV, the generated images are stitched together into a video in MP4 format. Each sentence of the input text is displayed as subtitles at the bottom of the video.
Audio Integration: Audio clips, which reflect the duration necessary for each image to be displayed, are synchronized with the video. This synchronization is managed with FFmpeg, which merges the audio with the video to create a cohesive viewing experience.

The end result is a fully synchronized video that narrates the original text with corresponding images and audio.

Quick Start with Docker

Users can initiate the Text2Video tool seamlessly via Docker:

docker-compose up --build

Local Development Environment

For development purposes, the tool is built to run on macOS with Python version 3.10.12. It is crucial to have FFmpeg installed due to its role in audio and video integration.

ffmpeg -version

Development prerequisites can be installed using:

pip install -r requirements.txt

Enhancing Image Quality with Prompts

To boost image quality, users can configure the tool with an OpenAI API key. This configuration supports proxy use:

OPEN_AI_API_KEY="your open ai api key"
OPEN_AI_BASE_URL="https://api.moonshot.cn/v1"

Configuration for Hugging Face API

To leverage Hugging Face's image generation capabilities, an API token is required. It can be obtained here and should be added to the environment file as follows:

API_TOKEN="your huggingface api token"

Optional AI Models

The tool can also utilize the Pollinations-AI model, which does not require a token. This model is based on ChatGPT’s Dalle-2.

Installation of FFmpeg

The integration of sound into video necessitates the installation of FFmpeg.

Usage Instructions

Once everything is set up, you can start the application with:

python3.10 app.py

Access the application through:

http://127.0.0.1:5001/

Support and Sponsorship

Users who appreciate the project are encouraged to contribute. Donations can be made with a note of your GitHub username. For further engagement and discussion, follow the author's WeChat public account: 老码沉思录.

Author's WeChat

License

The Text2Video project is released under the MIT License, ensuring freedom to use, modify, and distribute the tool.