Project Icon

Step-DPO

Optimize Large Language Models Reasoning with Step-wise Preference and Minimal Data

Product DescriptionStep-DPO improves reasoning in language models via a step-wise preference framework, using a robust 10K-step dataset. It enhances models like Qwen2-7B-Instruct, raising MATH performance by 5.6% and GSM8K by 2.4% with limited data. The method yields 70.8% and 94.0% on Qwen2-72B-Instruct for MATH and GSM8K tests, outperforming models like GPT-4-1106. Suitable for researchers and developers, Step-DPO includes a demo and detailed documentation for easier implementation and evaluation.
Project Details