
SWE-bench Multilingual
SWE-bench Verified is a human-filtered subset of 500 instances; use the Agent dropdown to compare LMs with mini-SWE-agent or …
SWE-bench Leaderboards
SWE-bench Verified is a human-filtered subset of 500 instances; use the Agent dropdown to compare LMs with mini-SWE-agent or …
SWE-bench
SWE-bench tests AI systems' ability to solve GitHub issues. We collect 2,294 task instances by crawling Pull Requests and Issues …
SWE-bench Multimodal
Citation If you use SWE-bench Multimodal in your research, please cite our paper:
SWE-bench Citations
Original SWE-bench Paper @inproceedings{ jimenez2024swebench, title={{SWE}-bench: Can Language Models Resolve Real …
SWE-bench Verified
OpenAI Blog Post Paper GitHub Overview SWE-bench Verified is a human-filtered subset of 500 instances from SWE-bench, …
SWE-bench Lite
Repository Distribution SWE-bench Lite distribution across repositories. Compare to the full SWE-bench in Figure 3 of the SWE …
Overview - SWE-bench
SWE-bench is a benchmark for evaluating large language models on real world software issues collected from GitHub. Given a …
Datasets - SWE-bench
Note that for the test split of the multimodal dataset, the patch, test_patch, and image_assets fields will be empty. Paper's Retrieval …
SWE-bench Multilingual
Support multiple languages. As the SWE-bench Multimodal paper notes, many open-source agent frameworks hardcode Python …
Submit to SWE-bench
Evaluating on SWE-bench Check out the main SWE-bench repository docs for instructions on how to generate and evaluate …