Yggdrasil Decision Forests: A Comprehensive Guide
Yggdrasil Decision Forests, often abbreviated as YDF, is a robust and versatile library designed to handle various kinds of machine learning models such as Random Forests, Gradient Boosted Decision Trees, Classification and Regression Trees (CART), and Isolation Forests. YDF simplifies the processes of training, evaluating, interpreting, and serving these models, making it a valuable tool for data scientists and machine learning practitioners.
Installation
YDF is easy to install and can be acquired directly from the Python Package Index (PyPI). To install the latest version, simply run the following command in your terminal:
pip install ydf -U
Basic Usage
YDF is built with ease of use in mind. Here is a simple example to illustrate its capabilities:
import ydf
import pandas as pd
# Load dataset with Pandas
ds_path = "https://raw.githubusercontent.com/google/yggdrasil-decision-forests/main/yggdrasil_decision_forests/test_data/dataset/"
train_ds = pd.read_csv(ds_path + "adult_train.csv")
test_ds = pd.read_csv(ds_path + "adult_test.csv")
# Train a Gradient Boosted Trees model
model = ydf.GradientBoostedTreesLearner(label="income").train(train_ds)
# Look at a model (input features, training logs, structure, etc.)
model.describe()
# Evaluate a model (e.g. roc, accuracy, confusion matrix, confidence intervals)
model.evaluate(test_ds)
# Generate predictions
model.predict(test_ds)
# Analyse a model (e.g. partial dependence plot, variable importance)
model.analyze(test_ds)
# Benchmark the inference speed of a model
model.benchmark(test_ds)
# Save the model
model.save("/tmp/my_model")
This example highlights the simplicity of loading datasets, training models, and performing evaluations and predictions.
Advanced Usage with C++ API
For those who prefer or require C++, YDF provides a comprehensive API:
auto dataset_path = "csv:train.csv";
// List columns in training dataset
DataSpecification spec;
CreateDataSpec(dataset_path, false, {}, &spec);
// Create a training configuration
TrainingConfig train_config;
train_config.set_learner("RANDOM_FOREST");
train_config.set_task(Task::CLASSIFICATION);
train_config.set_label("my_label");
// Train model
std::unique_ptr<AbstractLearner> learner;
GetLearner(train_config, &learner);
auto model = learner->Train(dataset_path, spec);
// Export model
SaveModel("my_model", model.get());
Documentation and Support
For users who wish to delve deeper, the official documentation provides extensive resources, including a getting started guide, detailed tutorials, and comprehensive API references.
Academic Contribution
If you utilize YDF in scientific research or publications, you are encouraged to cite the following paper to acknowledge its creators:
@inproceedings{GBBSP23,
author = {Mathieu Guillame{-}Bert and
Sebastian Bruch and
Richard Stotz and
Jan Pfeifer},
title = {Yggdrasil Decision Forests: {A} Fast and Extensible Decision Forests
Library},
booktitle = {Proceedings of the 29th {ACM} {SIGKDD} Conference on Knowledge Discovery
and Data Mining, {KDD} 2023, Long Beach, CA, USA, August 6-10, 2023},
pages = {4068--4077},
year = {2023},
url = {https://doi.org/10.1145/3580305.3599933},
doi = {10.1145/3580305.3599933},
}
Community and Contribution
The YDF project is open to contributions from the community. The project values community engagement and welcomes any form of contribution. Those interested should refer to the contribution guidelines for more details.
Licensing
YDF is an open-source project released under the Apache License 2.0, which allows for both personal and commercial use.
Yggdrasil Decision Forests represents a significant tool in the machine learning landscape with its ease of use, powerful features, and supportive community. Whether for education, research, or practical implementation, YDF offers compelling solutions for decision forest modeling needs.