About 50 results
Open links in new tab
  1. SWE-bench Multilingual

    SWE-bench Verified is a human-filtered subset of 500 instances; use the Agent dropdown to compare LMs with mini-SWE-agent or …

  2. SWE-bench Leaderboards

    SWE-bench Verified is a human-filtered subset of 500 instances; use the Agent dropdown to compare LMs with mini-SWE-agent or …

  3. SWE-bench

    SWE-bench tests AI systems' ability to solve GitHub issues. We collect 2,294 task instances by crawling Pull Requests and Issues …

  4. SWE-bench Multimodal

    Citation If you use SWE-bench Multimodal in your research, please cite our paper:

  5. SWE-bench Citations

    Original SWE-bench Paper @inproceedings{ jimenez2024swebench, title={{SWE}-bench: Can Language Models Resolve Real …

  6. SWE-bench Verified

    OpenAI Blog Post Paper GitHub Overview SWE-bench Verified is a human-filtered subset of 500 instances from SWE-bench, …

  7. SWE-bench Lite

    Repository Distribution SWE-bench Lite distribution across repositories. Compare to the full SWE-bench in Figure 3 of the SWE …

  8. Overview - SWE-bench

    SWE-bench is a benchmark for evaluating large language models on real world software issues collected from GitHub. Given a …

  9. Datasets - SWE-bench

    Note that for the test split of the multimodal dataset, the patch, test_patch, and image_assets fields will be empty. Paper's Retrieval …

  10. SWE-bench Multilingual

    Support multiple languages. As the SWE-bench Multimodal paper notes, many open-source agent frameworks hardcode Python …

  11. Submit to SWE-bench

    Evaluating on SWE-bench Check out the main SWE-bench repository docs for instructions on how to generate and evaluate …