Project Introduction: Magic-Doc
Magic-Doc is an innovative, lightweight open-source tool designed to simplify the process of converting various document file types into markdown format. It supports documents in formats such as PPT, PPTX, DOC, DOCX, and PDF. Magic-Doc caters to a wide range of users by enabling both local and remote document conversion, including files stored on Amazon S3.
Installation & Prerequisites
To use Magic-Doc, ensure you have Python 3.10 installed on your machine. Depending on your operating system, additional dependencies may be required.
- Linux/OSX Users: Install LibreOffice using your package manager (e.g.,
apt-get
,yum
, orbrew
). - Windows Users: Download and install LibreOffice, then add the LibreOffice program directory to your system's environment PATH.
After setting up the prerequisites, install Magic-Doc through pip with one of the following commands:
- For CPU version:
pip install fairy-doc[cpu]
- For GPU version:
pip install fairy-doc[gpu]
Features & Usage Example
Magic-Doc is remarkably versatile and easy to use. Here’s how you can utilize its functionalities:
Local File Conversion
from magic_doc.docconv import DocConverter
converter = DocConverter(s3_config=None)
markdown_content, time_cost = converter.convert("some_doc.pptx", conv_timeout=300)
S3 File Conversion
from magic_doc.docconv import DocConverter, S3Config
s3_config = S3Config(ak='${ak}', sk='${sk}', endpoint='${endpoint}')
converter = DocConverter(s3_config=s3_config)
markdown_content, time_cost = converter.convert("s3://some_bucket/some_doc.pptx", conv_timeout=300)
Whether your files are stored locally or on S3, Magic-Doc provides an efficient conversion process with minimal setup.
Performance
Magic-Doc demonstrates impressive conversion speeds, ensuring that users can convert documents quickly and efficiently. Under a test environment with an AMD EPYC 7742 64-Core Processor and an NVIDIA A100 GPU on CentOS 7, the tool performed as follows:
File Type | Speed |
---|---|
PDF (digital) | 347 pages/s |
PDF (ocr) | 2.7 pages/s |
PPT | 20 pages/s |
PPTX | 149 pages/s |
DOC | 600 pages/s |
DOCX | 1482 pages/s |
Acknowledgments
Magic-Doc owes its development to several foundational technologies and libraries, including Antiword, LibreOffice, PyMuPDF, and PaddleOCR, without which this tool would not be possible.
Involvement and Contribution
The team behind Magic-Doc welcomes contributions and collaboration. They offer platforms like Discord and WeChat for users and contributors to connect and discuss improvements or issues.
Licensing and Citation
Magic-Doc is available under the Apache 2.0 License, allowing users to use and modify the tool as needed. For academic or professional use, you can cite the tool using the citation provided in the project documentation.
Magic-Doc stands out as a robust solution for document conversion needs, supporting a variety of formats and offering high conversion speeds, making it a vital tool for users requiring markdown transformations.