SWE-bench
SWE-bench is a benchmark for testing language models' abilities to solve real-world GitHub software issues. It provides a containerized evaluation environment using Docker, ensuring repeatable assessments. Recent updates feature SWE-bench Verified, a collection of 500 engineer-confirmed solvable problems. Developed in collaboration with OpenAI, SWE-bench supports reproducible evaluations across different systems. Its resources are designed to help with model training, inference, and task creation, supporting NLP and machine learning applications in software engineering.