Project Icon

SWE-bench

Assessing Language Models' Performance on GitHub Issue Resolution

Product DescriptionSWE-bench is a benchmark for testing language models' abilities to solve real-world GitHub software issues. It provides a containerized evaluation environment using Docker, ensuring repeatable assessments. Recent updates feature SWE-bench Verified, a collection of 500 engineer-confirmed solvable problems. Developed in collaboration with OpenAI, SWE-bench supports reproducible evaluations across different systems. Its resources are designed to help with model training, inference, and task creation, supporting NLP and machine learning applications in software engineering.
Project Details