gpt_paper_assistant

Introduction to the GPT4 Paper Assistant Project

The GPT4 Paper Assistant is a user-friendly toolkit designed to help researchers and enthusiasts keep up with new publications on ArXiv, a popular repository for research papers, particularly in fields like computer science, physics, and mathematics. This project uses GPT4, a powerful language model, alongside author tracking to identify papers that might be of interest to users. It operates daily, forwarding findings to Slack via a bot, or displaying them on a static GitHub Pages site.

A simple working example showcasing daily selected papers can be explored here, focusing on the computer science category "cs.CL".

Cost Efficiency

The tool is designed to be cost-effective. For instance, a full scan of the "cs.CL" category cost only $0.07 on February 7, 2024.

Changelog Highlights

On February 15, 2024, several bugs were resolved, including issues with author parsing and title filtering costs, and handling scenarios when no new papers are available.
Adjustments were made on February 7, 2024, in response to changes in ArXiv's RSS format, along with introducing title filtering to further economize on resources.

How to Start Using the GPT4 Paper Assistant

This guide provides step-by-step instructions to set up and run the scanner.

Running on GitHub Actions

Clone or fork the repository into a new GitHub repo and enable scheduled workflows.
Customize config/paper_topics.txt with the specific paper topics you wish to monitor.
Adapt config/authors.txt to include authors of interest using their Semantic Scholar author IDs.
Specify desired ArXiv categories in config/config.ini.
Set your OpenAI key (OAI_KEY) as a GitHub secret.
Configure your repository settings to build GitHub Pages using actions.

Once configured, the tool runs daily, posting findings on Slack and a GitHub Pages site.

Optional (but recommended):

Obtain and use a Semantic Scholar API key to speed up author searches.
Set up a Slack bot and retrieve an OAuth key to allow paper notifications through Slack.
Create a Slack channel for the bot, attain its channel ID, and set it as a GitHub secret.
Fine-tune filtering options in config/config.ini.
Consider setting your GitHub repository to private to prevent inactivity of workflows.

Scheduled tasks happen daily at 1 PM UTC, posting updates to Slack and publishing to GitHub Pages.

Running Locally

Local execution follows similar setup steps but requires manual environmental setup.

Environment variables for keys and IDs must be declared.
Execute the program with main.py.

Additional Considerations:

You might choose to only save outputs locally instead of using Slack. Adjust the desired endpoint in config/config.ini.

Quick Setup Tips

Running the tool requires minimal computational resources, perfect for setting up on a low-cost AWS VM.
Add the following crontab entry for automatic execution: 0 13 * * * python ~/arxiv_scanner/main.py.

Crafting `paper_topics.txt`

The paper_topics.txt file houses topics you’re interested in following. Example topics might be:

Methodological improvements in RLHF or instruction-following.
Advances in test set contamination or membership inference methods for language models.
Significant performance advancements in diffusion language models.

These examples emphasize specific research interests, which aids the tool in accurately identifying relevant papers.

Mechanism Details

The tool's operation is summarized as:

Retrieve a day's papers from ArXiv's RSS feed, focusing on new, non-updated entries.
Initial author match using Semantic Scholar and authors.txt scores.
Employ GPT for a relevance check, based on specified criteria:
- Filters papers lacking notable author credentials.
- GPT evaluates each paper's relevance and novelty against specified topics.
Papers are rated and ranked on overall scores and published accordingly.

How to Contribute

The project uses ruff for code checking and formatting. Contributors should install pre-commit hooks to support workflow.

Testing GPT Filters

The filtering logic can be thoroughly tested independently. This allows for benchmarking the relevance and improving the GPT filter response.

Additional Information

Originally crafted by Tatsunori Hashimoto, the project is available under the Apache 2.0 license, with acknowledgments to Chenglei Si for testing contributions.