Introduction to Textbook Quality
The Textbook Quality project takes a modern approach to generating high-quality pretraining data that is crafted with the precision and depth of a textbook. This tool allows users to produce vast amounts of educational content, with an impressive example output being 70 million tokens in length. It operates with efficiency, capable of running multiple generative processes in parallel, utilizing either the OpenAI API or any compatible API service you have at your disposal. Moreover, it offers flexibility in content generation, enabling topics to emerge organically or to be built from a foundation of topics you input as seeds.
Features and Functionality
A key component of the project is its commitment to quality through retrieval mechanisms. Essentially, this means that by default, the system employs services like Serply to enhance the information gathering process. If preferred, you can opt for SerpAPI or forego retrieval entirely, tailoring the process to your specific needs.
Additionally, the project's core is designed for flexibility and growth. You can expand its capabilities by adding your own connectors (or "adaptors") to work with new APIs or data retrieval systems, ensuring the project's growth potential is immense.
Installation Instructions
Prerequisites
Before you can dive into the Textbook Quality project, make sure you have the following:
- A suitable Python installation, version 3.9 or later, is required (3.11 is recommended for optimal performance).
- A PostgreSQL database must be set up. Mac users can quickly install this via Homebrew with the command
brew install postgres
.
Setup Steps
To get the project up and running, follow these steps:
- Create a new database in PostgreSQL by executing:
psql postgres -c "create database textbook;"
- Clone the repository with:
git clone https://github.com/VikParuchuri/textbook_quality.git
- Navigate to the directory:
cd textbook_quality
- Install dependencies using Poetry:
poetry install
- Apply initial configurations for development with:
invoke migrate-dev
Configuration
The first configuration step involves creating a local.env
file in the project's root directory to manage your secret keys securely. Alternatively, you can configure these as environment variables.
The project permits precise configurations for optimal quality:
Utilizing OpenAI and Retrieval
- Insert your OpenAI API key in the format
OPENAI_KEY=sk-xxxxxx
. - Select your preferred search service with either a
SERPLY_KEY
or aSERPAPI_KEY
. - Choose a search backend with
SEARCH_BACKEND=serply
orSEARCH_BACKEND=serpapi
.
By default, the system employs gpt-3.5
. For those with access to gpt-4
, you can adjust your environment variables accordingly.
Alternative API Configurations
Should you use an alternative OpenAI-compatible API:
- Set a placeholder value for
OPENAI_KEY
and adjustOPENAI_BASE_URL
to your API's URL. - Specify your model using
LLM_TYPE
,LLM_INSTRUCT_TYPE
, andLLM_EXTENDED_TYPE
. - Tweak context length and finetuning settings to suit model specifications.
Operating Without Retrieval
Simply set SEARCH_BACKEND=none
to disable search features.
Usage Overview
Textbook Quality provides three core scripts for different stages of content generation:
Topic Generation
- Create topics from scratch by specifying a subject, file output, and number of iterations.
Example:python topic_generator.py "computer science with python" python_cs_titles.json --iterations 50
Topic Augmentation
- Use existing topics as seeds and enhance them, optionally restricting the domain.
Example:python topic_augmentor.py python_titles.json python_topics.json --domain python
Textbook Generation
From Titles:
- Generate textbooks using a topic list. The script runs multiple instances to increase efficiency.
Example:python book_generator.py topics.json books.jsonl --workers 5
From Outlines:
- Generate textbooks from detailed outlines specified in a JSONL format.
For a polished structure, the project includes tools to create clean tables of contents.
Development and Debugging
The project invites contributions for further enhancement. Incorporate new large language model adaptors, refine retrieval methods, or add new generator tasks. Debugging is streamlined by enabling error messages with the DEBUG=true
setting.
Textbook Quality stands as a versatile and expandable platform nurturing the creation of rich, educational content, adaptable to a spectrum of user needs and system capabilities.