1filellm - Efficient Data Collection for Large Language Models

1FileLLM: Efficient Data Aggregation for LLM Ingestion

1FileLLM is a versatile command-line tool that simplifies the process of creating detailed and information-rich prompts for large language models (LLMs). It does so by aggregating and pre-processing data from various sources into a singular text file, which can then be effortlessly copied to your clipboard for immediate use.

Features

1FileLLM is equipped with several useful features which include:

Automatic Source Detection: It intelligently identifies the type of input based on the path, URL, or identifier provided.
Wide Range of Input Support: It can process local files and directories, GitHub repositories, pull requests, and issues, academic papers from ArXiv, YouTube transcripts, web documentation, and Sci-Hub papers via DOI or PMID.
Multi-format Capability: The tool manages different file formats, including Jupyter Notebooks (.ipynb) and PDFs.
Web Crawling: It can extract content from linked web pages up to a specified depth.
Integration with Sci-Hub: Automatically downloads research papers using DOIs or PMIDs.
Text Preprocessing: It compresses and formats text, removes stopwords, transforms text to lowercase, and more.
Clipboard Copying: Automatically copies the uncompressed text to the clipboard for quick pasting into LLMs.
Token Count Reporting: Provides a report on the token count for both compressed and uncompressed outputs.
XML Output Format: Encapsulates output in XML tags for enhanced LLM performance.

Installation

Prerequisites

To get started, users must install the required dependencies using:

pip install -U -r requirements.txt

Optionally, a virtual environment can be created for isolation:

python -m venv .venv
source .venv/bin/activate
pip install -U -r requirements.txt

GitHub Personal Access Token

For access to private GitHub repositories, users need to generate a personal access token and configure it as specified in 1FileLLM's documentation.

Usage

To use the tool, run:

python onefilellm.py

Alternatively, provide a URL or path at the command line for less manual interaction:

python onefilellm.py https://github.com/jimmc414/1filellm

Expected Inputs and Outputs

1FileLLM supports a variety of inputs such as local files, GitHub URLs, academic papers, and more, each with specific actions to generate comprehensive text outputs that are encapsulated in XML and copied to the clipboard.

Local File: Converts file contents into a text file.
Local Directory: Segments and converts directory files into a single text file.
GitHub Repository/PR/Issue URLs: Processes repository components into text files.
Academic Paper URLs: Converts PDF papers into text files.
YouTube URLs: Transcribes video content to text.
Webpage URLs: Scrapes page content into text files.

The tool outputs include uncompressed_output.txt, compressed_output.txt, and processed_urls.txt for effectively managing extracted data.

Configuration

Users can modify the allowed file types and web crawling depth within the provided source code for a customized experience.

XML Output Format

Outputs are formatted using XML tags to delineate content types clearly, aiding in the enhanced understanding and processing by LLMs. This structured format contributes to better performance of prompts.

Recent Updates

The project has seen continuous improvements such as adopting XML output for better LLM interaction, as well as usability enhancements like direct command-line inputs and enriched error handling for robust performance.

With its broad functionalities and ease of use, 1FileLLM serves as a valuable tool for anyone working with large language models, streamlining essential data aggregation tasks effectively.