A new analysis has raised questions about whether large language models (LLMs) are genuinely improving at generating mergeable code, despite achieving higher scores on popular coding benchmarks like SWE-bench. The study suggests that passing benchmark tests may not translate to producing code that would be accepted in real-world software development workflows.

The research highlights a growing disconnect between benchmark performance and practical utility in AI-assisted coding. While AI models demonstrate increasing proficiency at solving isolated coding problems, the quality and maintainability of their generated code may lag behind what human reviewers would approve for production systems.