Introduction to the GPT4 Paper Assistant Project
The GPT4 Paper Assistant is a user-friendly toolkit designed to help researchers and enthusiasts keep up with new publications on ArXiv, a popular repository for research papers, particularly in fields like computer science, physics, and mathematics. This project uses GPT4, a powerful language model, alongside author tracking to identify papers that might be of interest to users. It operates daily, forwarding findings to Slack via a bot, or displaying them on a static GitHub Pages site.
A simple working example showcasing daily selected papers can be explored here, focusing on the computer science category "cs.CL".
Cost Efficiency
The tool is designed to be cost-effective. For instance, a full scan of the "cs.CL" category cost only $0.07 on February 7, 2024.
Changelog Highlights
- On February 15, 2024, several bugs were resolved, including issues with author parsing and title filtering costs, and handling scenarios when no new papers are available.
- Adjustments were made on February 7, 2024, in response to changes in ArXiv's RSS format, along with introducing title filtering to further economize on resources.
How to Start Using the GPT4 Paper Assistant
This guide provides step-by-step instructions to set up and run the scanner.
Running on GitHub Actions
- Clone or fork the repository into a new GitHub repo and enable scheduled workflows.
- Customize
config/paper_topics.txt
with the specific paper topics you wish to monitor. - Adapt
config/authors.txt
to include authors of interest using their Semantic Scholar author IDs. - Specify desired ArXiv categories in
config/config.ini
. - Set your OpenAI key (OAI_KEY) as a GitHub secret.
- Configure your repository settings to build GitHub Pages using actions.
Once configured, the tool runs daily, posting findings on Slack and a GitHub Pages site.
Optional (but recommended):
- Obtain and use a Semantic Scholar API key to speed up author searches.
- Set up a Slack bot and retrieve an OAuth key to allow paper notifications through Slack.
- Create a Slack channel for the bot, attain its channel ID, and set it as a GitHub secret.
- Fine-tune filtering options in
config/config.ini
. - Consider setting your GitHub repository to private to prevent inactivity of workflows.
Scheduled tasks happen daily at 1 PM UTC, posting updates to Slack and publishing to GitHub Pages.
Running Locally
Local execution follows similar setup steps but requires manual environmental setup.
- Environment variables for keys and IDs must be declared.
- Execute the program with
main.py
.
Additional Considerations:
You might choose to only save outputs locally instead of using Slack. Adjust the desired endpoint in config/config.ini
.
Quick Setup Tips
- Running the tool requires minimal computational resources, perfect for setting up on a low-cost AWS VM.
- Add the following crontab entry for automatic execution:
0 13 * * * python ~/arxiv_scanner/main.py
.
Crafting paper_topics.txt
The paper_topics.txt
file houses topics you’re interested in following. Example topics might be:
- Methodological improvements in RLHF or instruction-following.
- Advances in test set contamination or membership inference methods for language models.
- Significant performance advancements in diffusion language models.
These examples emphasize specific research interests, which aids the tool in accurately identifying relevant papers.
Mechanism Details
The tool's operation is summarized as:
- Retrieve a day's papers from ArXiv's RSS feed, focusing on new, non-updated entries.
- Initial author match using Semantic Scholar and
authors.txt
scores. - Employ GPT for a relevance check, based on specified criteria:
- Filters papers lacking notable author credentials.
- GPT evaluates each paper's relevance and novelty against specified topics.
- Papers are rated and ranked on overall scores and published accordingly.
How to Contribute
The project uses ruff
for code checking and formatting. Contributors should install pre-commit hooks to support workflow.
Testing GPT Filters
The filtering logic can be thoroughly tested independently. This allows for benchmarking the relevance and improving the GPT filter response.
Additional Information
Originally crafted by Tatsunori Hashimoto, the project is available under the Apache 2.0 license, with acknowledgments to Chenglei Si for testing contributions.