arxiv-translator - Korean Translations of ArXiv Papers using Expert OCR Solutions

Arxiv Translation Project

Introduction to the Arxiv Translation Project

The Arxiv Translation Project is an innovative initiative designed to make a significant contribution to the academic community, particularly for Korean researchers and scholars who often find it challenging to keep up with the rapidly growing corpus of Arxiv papers. By providing Korean-translated web pages for these papers, the project aims to offer a more accessible and swift way for users to review scientific findings and discussions from Arxiv.

Objectives and Motivation

The main objective of the project is to streamline the ability of Korean-speaking scholars to keep pace with the latest research. Initially, the idea was to translate Ar5iv, but it became apparent that Ar5iv’s updates lagged substantially, only reflecting changes approximately a month after papers were initially published. Furthermore, Ar5iv only converts the first version to HTML, neglecting subsequent versions. This limitation led to the decision to develop a custom solution for content extraction and translation. However, for accuracy and thorough understanding, consulting the original papers is still advised.

Technical Approach

To tackle the challenges of translating various PDF formats, this project employs the nougat OCR library for text extraction. Nougat OCR allows the project team to convert complex PDF documents into text that can subsequently be translated and formatted for web presentation. Occasionally, text extraction may not be entirely seamless due to the diverse formatting of the documents.

Paper List and Access

Currently, the project hosts a substantial and growing list of translated papers across various topics, which can be accessed directly from the project’s repository. The list includes the ArXiv ID, paper title, and links to both the original arXiv page and the translated page hosted in the project’s repository.

Here are a few notable entries:

When to Retrieve Teaching LLMs to Utilize Information Retrieval Effectively (Arxiv ID: 2404.19705v2)
RAG and RAU: A Survey on Retrieval-Augmented Language Model in Natural Language Processing (Arxiv ID: 2404.19543)
Phi-3 Technical Report: A Highly Capable Language Model Locally on Your Phone (Arxiv ID: 2404.14219v1)
The Landscape of Emerging AI Agent Architectures for Reasoning, Planning, and Tool Calling: A Survey (Arxiv ID: 2404.11584v1)

The project documentation encourages users to open these links in a new window for better navigation and reading experience.

Conclusion and Recommendations

The Arxiv Translation Project represents a vital step forward in academic inclusivity, fostering a more accessible research environment for non-English speaking scholars. While the translations provide a convenient reference point, for an in-depth and precise understanding, it is recommended that users refer to the original publications.