financial-datasets - Create Financial Question and Answer Sets with Python and Large Language Models

Financial Datasets 🧪

Overview

Financial Datasets is an open-source Python library designed to facilitate the creation of question and answer financial datasets using Large Language Models (LLMs). This library allows users to effectively generate realistic financial datasets from various financial documents such as 10-K, 10-Q forms, PDFs, and other financial texts.

Usage

The main purpose of this library is to generate datasets that can provide insightful financial information in a question and answer format. Here are some examples of how to use Financial Datasets:

Example #1 - Generate from Any Text

For those who have a list of texts and want to create a dataset, this option offers the most flexibility. By providing a range of text inputs, users can generate a customized dataset. The following example demonstrates how to accomplish this:

from financial_datasets.generator import DatasetGenerator

# Your list of texts
texts = ...

# Create dataset generator
generator = DatasetGenerator(model="gpt-4-turbo", api_key="your-openai-key")

# Generate dataset from texts
dataset = generator.generate_from_texts(
    texts=texts,
    max_questions=100,
)

Example #2 - Generate from PDF

If you possess a PDF document and need to extract financial information from it, you can generate a dataset directly from the PDF's URL. Here's how:

from financial_datasets.generator import DatasetGenerator

# Create dataset generator
generator = DatasetGenerator(model="gpt-4-turbo", api_key="your-openai-key")

# Generate dataset from PDF url
dataset = generator.generate_from_pdf(
    url="https://www.berkshirehathaway.com/letters/2023ltr.pdf",
    max_questions=100,
)

Example #3 - Generate from 10-K

For those focusing on specific company financials, datasets can be generated using a company’s ticker symbol and year. This is particularly useful for generating datasets from the 10-K financial reports:

from financial_datasets.generator import DatasetGenerator

# Create dataset generator
generator = DatasetGenerator(model="gpt-4-turbo", api_key="your-openai-key")

# Generate dataset from 10-K
dataset = generator.generate_from_10K(
    ticker="AAPL",
    year=2023,
    max_questions=100,
    item_names=["Item 1", "Item 7"],  # optional - specify Item names to use
)

Installation

There are multiple ways to install the Financial Datasets library:

Using pip

The simplest way to install the library is via pip. Simply run:

pip install financial-datasets

Using Poetry

For those who prefer Poetry for managing dependencies, the library can be added using:

poetry add financial-datasets

From the Repository

To install directly from the repository, follow these steps:

Clone the repository:

git clone https://github.com/virattt/financial-datasets.git

Navigate to the project directory:
```
cd financial-datasets
```
Install dependencies using Poetry:
```
poetry install
```

Now the library is ready to use in your Python projects.

Contributing

Contributions from the community are welcome. If any issues or suggestions for improvement arise, users are encouraged to open an issue or submit a pull request.

License

The Financial Datasets project is licensed under the MIT License.

Contributors

For a list of contributors and their contributions, visit the project's contributors page on GitHub.

With these tools and instructions, users can effectively create insightful financial datasets tailored to their needs.