Project Icon

hh-rlhf

Improve model safety utilizing robust human feedback and comprehensive red teaming datasets

Product DescriptionThis repository provides crucial datasets for AI safety research, including human preference data on the helpfulness and harmlessness of language models and red teaming data designed to mitigate harmful effects. The insights from the data utilize human feedback to inform safer model training methodologies, featuring detailed JSONL files with paired preference texts and comprehensive records of adversarial interactions. The datasets serve researchers focused on model behavior and AI ethics, addressing concerns like discriminatory language and self-harm. Engage with datasets drawn from key studies to propel advanced research into AI safety and performance.
Project Details