Financial Datasets 🧪
Overview
Financial Datasets is an open-source Python library designed to facilitate the creation of question and answer financial datasets using Large Language Models (LLMs). This library allows users to effectively generate realistic financial datasets from various financial documents such as 10-K, 10-Q forms, PDFs, and other financial texts.
Usage
The main purpose of this library is to generate datasets that can provide insightful financial information in a question and answer format. Here are some examples of how to use Financial Datasets:
Example #1 - Generate from Any Text
For those who have a list of texts and want to create a dataset, this option offers the most flexibility. By providing a range of text inputs, users can generate a customized dataset. The following example demonstrates how to accomplish this:
from financial_datasets.generator import DatasetGenerator
# Your list of texts
texts = ...
# Create dataset generator
generator = DatasetGenerator(model="gpt-4-turbo", api_key="your-openai-key")
# Generate dataset from texts
dataset = generator.generate_from_texts(
texts=texts,
max_questions=100,
)
Example #2 - Generate from PDF
If you possess a PDF document and need to extract financial information from it, you can generate a dataset directly from the PDF's URL. Here's how:
from financial_datasets.generator import DatasetGenerator
# Create dataset generator
generator = DatasetGenerator(model="gpt-4-turbo", api_key="your-openai-key")
# Generate dataset from PDF url
dataset = generator.generate_from_pdf(
url="https://www.berkshirehathaway.com/letters/2023ltr.pdf",
max_questions=100,
)
Example #3 - Generate from 10-K
For those focusing on specific company financials, datasets can be generated using a company’s ticker symbol and year. This is particularly useful for generating datasets from the 10-K financial reports:
from financial_datasets.generator import DatasetGenerator
# Create dataset generator
generator = DatasetGenerator(model="gpt-4-turbo", api_key="your-openai-key")
# Generate dataset from 10-K
dataset = generator.generate_from_10K(
ticker="AAPL",
year=2023,
max_questions=100,
item_names=["Item 1", "Item 7"], # optional - specify Item names to use
)
Installation
There are multiple ways to install the Financial Datasets library:
Using pip
The simplest way to install the library is via pip. Simply run:
pip install financial-datasets
Using Poetry
For those who prefer Poetry for managing dependencies, the library can be added using:
poetry add financial-datasets
From the Repository
To install directly from the repository, follow these steps:
- Clone the repository:
git clone https://github.com/virattt/financial-datasets.git
- Navigate to the project directory:
cd financial-datasets
- Install dependencies using Poetry:
poetry install
Now the library is ready to use in your Python projects.
Contributing
Contributions from the community are welcome. If any issues or suggestions for improvement arise, users are encouraged to open an issue or submit a pull request.
License
The Financial Datasets project is licensed under the MIT License.
Contributors
For a list of contributors and their contributions, visit the project's contributors page on GitHub.
With these tools and instructions, users can effectively create insightful financial datasets tailored to their needs.