Safe-Policy-Optimization - Unified Platform and Benchmarking for Safe Reinforcement Learning

Introduction to Safe Policy Optimization (SafePO)

Safe Policy Optimization (SafePO) is a pivotal platform for advancing Safe Reinforcement Learning (Safe RL) research. Developed by the PKU-Alignment team, SafePO acts as a comprehensive benchmark to evaluate and optimize safe RL algorithms, enabling researchers to explore and solve complex safety challenges in reinforcement learning environments.

Key Features of SafePO

Correctness and Reliability

To ensure SafePO’s reliability, the development team rigorously evaluated its implementation. Each algorithm within the platform is crafted to mirror the methodologies from authoritative research papers, focusing on maintaining consistency with the original mathematical formulations, such as gradient flow. By comparing implementations line-by-line with established open-source codebases, SafePO guarantees accuracy and reliability.

Extensibility

SafePO’s design is consciously modular, facilitating easy integration of new algorithms. By extending from base classes, developers can add new functionalities, incorporating unique features of diverse RL algorithms with minimal effort. For instance, implementing Proximal Policy Optimization (PPO) involves extending a policy gradient base class and modifying specific elements such as clip ratio calculations.

Logging and Visualization

SafePO excels in supporting enhanced logging and visualization capabilities, integral for monitoring training progress. Utilizing popular tools like TensorBoard and WandB, SafePO provides over 40 metrics, showcasing each algorithm’s performance, including key variables like KL-divergence and cost variance. These visualization tools enable researchers to monitor parameters, analyze the training process, and facilitate model selection and comparison.

Comprehensive Documentation

SafePO’s user-friendly documentation includes installation guides, troubleshooting tips, and instructions for both beginners and advanced users. This thorough documentation provides clear guidance on ethical and responsible use, ensuring that researchers maximize SafePO’s potential while adhering to best practices.

Overview of Algorithms in SafePO

SafePO encompasses a suite of Safe RL algorithms, enhancing researchers' ability to experiment across a spectrum of environments. The benchmark includes well-known algorithms like PPO-Lag, TRPO-Lag, and others, alongside classic reinforcement learning (RL) algorithms like Policy Gradient (PG) and NaturalPG.

Supported Environments

SafePO is aligned with Safety-Gymnasium, supporting diverse environments to evaluate safe navigation, velocity constraints, multi-agent coordination, and specialized tasks in the Safe Isaac Gym. These environments present unique challenges, helping to test and validate RL algorithm safety across real-world scenarios.

Getting Started with SafePO

Pre-requisites and Conda Environment

Setting up SafePO requires installing necessary environments like Safety-Gymnasium. Users should ensure compatible Python and CUDA versions for best performance.

Running Benchmarks and Experiments

SafePO provides straightforward commands to benchmark multiple algorithms concurrently. Users can explore outcomes across different tasks by running predefined scripts tailored to both single and multi-agent scenarios. These scripts efficiently showcase algorithm performance and facilitate comparisons.

Evaluating and Visualizing Results

Once experiments are completed, built-in functionalities allow users to plot and evaluate algorithm performance, ensuring comprehensive analysis of experimental outcomes.

Ethical and Responsible Use

Supporting safe RL research, SafePO abides by the Apache-2.0 license. The platform is committed to advancing safety in machine learning, urging users to adhere to legal and ethical standards in their research and applications.

Acknowledgments

SafePO is brought to fruition by the PKU-Alignment team from Peking University, in collaboration with contributors from resources like Spinning Up, Bullet-Safety-Gym, and Safety-Gym. These contributions underscore a collective effort towards advancing safe AI through open-source projects.

In summary, SafePO is a groundbreaking project that consolidates diverse safe RL algorithm implementations, fostering an environment where accuracy, extensibility, and usability are prioritized, all while paving the way for safer AI systems.