llm-ebook-summary - Efficiently Generate Detailed Summaries for E-books

Bulleted Notes Book Summaries

Introduction

The llm-ebook-summary project is designed to create detailed bulleted notes summaries of books and long texts, specifically focusing on ebooks in epub and pdf formats that contain appropriate Table of Contents (ToC) metadata. This enables the automated extraction of chapters from most books, which are then divided into manageable chunks of roughly 2000 tokens. The project provides a solution for documents lacking this metadata through fallback options.

Main Idea

The core concept of this project revolves around breaking down a document into smaller parts rather than tackling the entire content at once. This approach allows for more precise responses and summaries, focusing on each subsection instead of summarizing the book in a single page. Users can ask specific questions to these smaller parts, applying the same queries across different sections, enhancing the understanding of the material without overwhelming the system with too much information at one time.

Comparison with RAG

This project shares similarities with Retrieval Augmented Generation (RAG) systems by dividing the document into various segments to fit within context windows. However, it distinguishes itself by posing the same questions to every segment of the document rather than trying to identify the best part to query, which is crucial in leveraging the full potential of Language Models (LLM) without depending on multiple third-party applications.

Setup and Usage

Before starting, users need to ensure Python 3.11.9 is installed, using tools like conda or pyenv for version management. The setup involves installing dependencies, downloading necessary models, and updating the configuration file _config.yaml, specifying defaults for prompts and model configurations.

E-book Conversion

Users can convert their ebooks into chunked CSV or TXT files using an automated script if the document is in epub or pdf format. The book2text.py script can split these ebooks by chapter or section, producing output files ready for processing into summaries.

Summary Generation

Once the ebook is prepared, the sum.py script generates summaries of the processed text. The script requires choosing input types (CSV or TXT) and offers options to customize model and prompt usage. The output includes rendered markdown summaries and a CSV file with detailed information about each processed text segment.

Models

The project relies on downloadable models from platforms like Ollama and HuggingFace. These models are pre-trained and offer various size options, ensuring flexibility in handling different types of content.

Ebook ToC Check

Users can verify if their ebooks have clickable ToCs with tools like Firefox or Brave. While epub files generally handle this gracefully, occasional exceptions necessitate checking for proper formatting.

Disclaimer

The project emphasizes user responsibility in ensuring accurate summaries. Users should be cautious of references and other potential issues that may impact summary quality, such as mismatched content lengths or anomalies in the structure.

Inspiration

The project was inspired by the need for efficient book summarization to support psychological theory and practice research. Originally, manual attempts to summarize books were time-consuming. Through learning and applying LLM fine-tuning, the project founder developed a tool that significantly streamlines the content curation process, transforming it into a fast and reliable method for creating custom datasets from a wide range of source materials.