M3 demonstrates that the next phase of agent development will not just be driven by larger datasets, but by efficient ...
As large language models (LLMs) continue to improve at coding, the benchmarks used to evaluate their performance are steadily becoming less useful. That's because though many LLMs have similar high ...
A monthly overview of things you need to know as an architect or aspiring architect. Unlock the full InfoQ experience by logging in! Stay updated with your favorite authors and topics, engage with ...
In recent years, it has become common for developers to use coding AI in software development, and various benchmarks exist to measure the performance of coding AI. Now, a new benchmark called ...
Anthropic has released Claude Sonnet 4.5, its most advanced coding model to date, featuring major improvements in agentic tasks, long-horizon task performance, and computer use capabilities. The ...
AI thrives on data but feeding it the right data is harder than it seems. As enterprises scale their AI initiatives, they face the challenge of managing diverse data pipelines, ensuring proximity to ...
In a new benchmark named Vibe Code Bench, OpenAI’s GPT-5.1 achieved the highest level of accuracy in completing a series of software engineering tasks, narrowly beating rival Anthropic’s Claude 4.5 ...
Microsoft has introduced MAI-Code, a tool designed to convert plain-English descriptions into functional application code.
Developers using GitHub Copilot now have access to a coding model built entirely by Microsoft, designed to handle lightweight ...
OpenAI’s latest large language model has been specifically designed for reasoning and is capable of generating code to a much higher standard than previous models. The ChatGPT-o1-Preview model ...