Coding Language Performance Bench

25d

MiniMax-M3 debuts, eclipsing GPT-5.5 and Gemini 3.1 Pro on key benchmark performance for just 5-10% of the cost

M3 demonstrates that the next phase of agent development will not just be driven by larger datasets, but by efficient ...

VentureBeat

Self-invoking code benchmarks help you decide which LLMs to use for your programming tasks

As large language models (LLMs) continue to improve at coding, the benchmarks used to evaluate their performance are steadily becoming less useful. That's because though many LLMs have similar high ...

InfoQ

Code Arena Launches as a New Benchmark for Real-World AI Coding Performance

A monthly overview of things you need to know as an architect or aspiring architect. Unlock the full InfoQ experience by logging in! Stay updated with your favorite authors and topics, engage with ...

GIGAZINE

DeepSWE is a benchmark that prevents cheating using coding AI and allows for more accurate measurement of programming performance.

In recent years, it has become common for developers to use coding AI in software development, and various benchmarks exist to measure the performance of coding AI. Now, a new benchmark called ...

InfoQ

Claude Sonnet 4.5 Tops SWE-Bench Verified, Extends Coding Focus beyond 30 Hours

Anthropic has released Claude Sonnet 4.5, its most advanced coding model to date, featuring major improvements in agentic tasks, long-horizon task performance, and computer use capabilities. The ...

eWeek

Gemini Beats Claude, GPT in Google’s First Android AI Coding Benchmark

AI thrives on data but feeding it the right data is harder than it seems. As enterprises scale their AI initiatives, they face the challenge of managing diverse data pipelines, ensuring proximity to ...

Inc

The Winners (and Losers) of This New Vibe-Coding Benchmark Will Surprise You

In a new benchmark named Vibe Code Bench, OpenAI’s GPT-5.1 achieved the highest level of accuracy in completing a series of software engineering tasks, narrowly beating rival Anthropic’s Claude 4.5 ...

Morning Overview on MSN

Microsoft’s new MAI-Code tool turns plain-English descriptions into working app code

Microsoft has introduced MAI-Code, a tool designed to convert plain-English descriptions into functional application code.

Morning Overview on MSN

Microsoft built its own coding AI to lean less on OpenAI and cut costs for developers

Developers using GitHub Copilot now have access to a coding model built entirely by Microsoft, designed to handle lightweight ...

Geeky Gadgets

How good is ChatGPT-o1-Preview at Coding?

OpenAI’s latest large language model has been specifically designed for reasoning and is capable of generating code to a much higher standard than previous models. The ChatGPT-o1-Preview model ...

Some results have been hidden because they may be inaccessible to you

Show inaccessible results