Nov 8, 2025

Comparing Six LLMs for Real‑World Code Fixes – GPT‑5, Claude Sonnet, Grok and More

Introduction

A recent benchmark from the Kilo Code blog put six leading large language models (LLMs) through three realistic coding challenges. The goal was simple: see which models could spot security‑critical bugs, propose production‑ready fixes, and do so cost‑effectively. The models evaluated were GPT‑5, OpenAI o1, Claude Opus 4.1, Claude Sonnet 4.5, Grok 4, and Gemini 2.5 Pro.

The results highlight a clear trade‑off between raw technical depth and practical maintainability. While every model identified the vulnerabilities, the quality, completeness, and cost of the remedies varied dramatically. Below is a detailed walkthrough of the methodology, the three test cases, and actionable recommendations for engineers choosing an LLM for code review or automated PR checks.

Test Methodology

Kilo Code constructed a consistent test harness to ensure a fair comparison:

Input: Small, risky code snippets (10‑50 lines) were fed to each model with the identical prompt: “Fix this. No hints, no leading questions.”
Phase 1 – AI Judging: An automated rubric scored each response on correctness, code quality, completeness, safety‑oriented practices, and performance.
Phase 2 – Human Validation: Engineers reviewed the AI‑ranked fixes and selected the versions they would actually merge.

This two‑step approach combined objective metrics with real‑world developer judgment, providing a pragmatic view of each model’s usefulness in production pipelines.

Scenario 1 – Node.js Config Merge Vulnerability

Problem: A deep‑merge function inadvertently propagates a malicious admin flag from a crafted payload through prototype chains, mirroring classic OASP patterns.

Model Outcomes:

GPT‑5: Implemented layered safeguards—null‑prototype base objects, explicit blocking of risky keys, hasOwnProperty checks, and freezing of sensitive objects. The fix was thorough and ready for production.
OpenAI o1: Delivered clean helper functions, a concise list of prohibited keys, and readable comments. The solution was easy to audit within minutes.
Claude Sonnet 4.5: Used Object.create(null) and key‑blocking, offering solid protection but slightly less depth than GPT‑5.
Gemini 2.5 Pro: Applied key filtering and null prototypes but missed some recursive edge cases.
Claude Opus 4.1: Relied on schemas and type checks—effective but added maintenance overhead.
Grok 4: Focused on simple filtering and omitted hasOwnProperty validation, resulting in a weaker fix.

Takeaway: All models caught the flaw, but only GPT‑5 and OpenAI o1 produced fixes that felt production‑ready without excessive complexity.

Scenario 2 – Modern Agent Workflow (2025 Style)

Problem: An AI‑driven agent fetches a web page, interprets its content, and proposes tool calls to a cloud‑management API. Without strict boundaries, the agent can execute malicious instructions, leading to cross‑tenant token leakage and unauthorized changes.

Model Outcomes:

GPT‑5: Introduced narrow tool scopes, two‑step confirmation rules, strict trust boundaries (credentials never appear in model text), provenance checks on fetched HTML, and role‑based, short‑lived tokens.
OpenAI o1: Matched GPT‑5’s depth, adding shadow‑tenant RBAC analysis, response schema validation, and a configuration that completely removes file‑system access.
Claude Sonnet 4.5: Covered trust boundaries and provenance tracking but lacked the granular implementation details of GPT‑5.
Gemini 2.5 Pro: Scoped tools and used schema checks; gating was present but lighter than the top performers.
Claude Opus 4.1: Employed Zod validation and DOM purify, providing clear diagrams but fewer layered defenses.
Grok 4: Referenced OASP top‑10 and NIST guidelines with allow‑lists; gating logic remained simple.

Takeaway: For newer, complex patterns, deeper reasoning (as shown by GPT‑5 and OpenAI o1) outweighs simple pattern matching.

Scenario 3 – ImageMagick Command Injection

Problem: An Express API builds a shell command for ImageMagick using user‑supplied font and text. A malicious payload can inject shell operators (e.g., ; rm -rf /), leading to arbitrary code execution.

Model Outcomes:

GPT‑5: Implemented a comprehensive defense—strict allow‑lists, absolute font paths, avoidance of special prefixes, execution via argument vectors (no shell), input via stdin, size/rate caps, and automatic temporary‑file cleanup.
Claude Opus 4.1: Similar thoroughness with spawn, allow‑lists, size validation, control‑character filtering, and detailed demos for reviewers.
Claude Sonnet 4.5: Used execFile with strong allow‑lists and rate limiting.
OpenAI o1: Switched to execFile with concise font validation and text sanitization.
Gemini 2.5 Pro: Adopted spawn with allow‑lists and clean validation.
Grok 4: Explained shell‑parsing pitfalls (semicolon, pipe, ampersand, backticks) and moved to spawn with range validation.

Takeaway: The best solutions layered safe process execution with strict allow‑lists and rate limits, eliminating shell‑injection vectors.

Cost Analysis

Running all three scenarios across the six models cost approximately $181 in total. The ImageMagick case was the most expensive due to the length of the model outputs. The Node.js merge scenario was the cheapest, averaging $0.60 per evaluation (about $0.10 per model run).

Budget Recommendations:

For bulk scanning where cost matters, Gemini 2.5 Pro or OpenAI o1 deliver 90‑95 % of GPT‑5’s quality at roughly 72 % lower cost.
For high‑risk domains (financial, health data, privileged APIs), the extra expense of GPT‑5 is justified by its maximalist guardrails.
For general OASP‑style reviews, Claude Sonnet 4.5 offers a strong balance of coverage and affordability.

Pragmatic Recommendations

Critical Systems: Deploy GPT‑5. Its layered defenses and exhaustive fixes make it worth the premium.
High‑Volume, Low‑Risk Scans: Choose Gemini 2.5 Pro or OpenAI o1 to achieve near‑top performance with a fraction of the cost.
Middle Ground: Claude Sonnet 4.5 provides solid protection on familiar patterns while staying budget‑friendly.
Maintainability Matters: The human reviewers favored OpenAI o1 because its fixes were concise, readable within 15 minutes, and still addressed the most complex scenarios.

The key insight is that the most perfect solution isn’t always the best long‑term choice. A slightly less comprehensive fix that is easy to understand and maintain can be more valuable in a fast‑moving development environment.

Conclusion

The Kilo Code benchmark demonstrates that modern LLMs have reached a level where all six models reliably detect security‑critical bugs. The differentiators now lie in how thorough the fixes are, the depth of layered guardrails, and the total cost of execution.

GPT‑5 leads on technical depth and safety, ideal for mission‑critical code.
OpenAI o1 strikes a pragmatic balance of readability, robustness, and cost.
Gemini 2.5 Pro and Claude Sonnet 4.5 serve as capable workhorses for everyday code hygiene.

When integrating LLMs into your pull‑request workflow, match the model to the mission: prioritize maximal security for high‑impact services, and opt for cost‑effective models where speed and volume dominate.

By treating LLMs as assistive reviewers rather than oracle replacements, engineering teams can harness their strengths while mitigating maintenance overhead—delivering safer code at scale.

Comparing Six LLMs for Real‑World Code Fixes – GPT‑5, Claude Sonnet, Grok and More

Comparing Six LLMs for Real‑World Code Fixes – GPT‑5, Claude Sonnet, Grok and More

Introduction

Test Methodology

Scenario 1 – Node.js Config Merge Vulnerability

Scenario 2 – Modern Agent Workflow (2025 Style)

Scenario 3 – ImageMagick Command Injection

Cost Analysis

Pragmatic Recommendations

Conclusion

Scenario 1 – Node.js Config Merge Vulnerability

Scenario 2 – Modern Agent Workflow (2025 Style)

Scenario 3 – ImageMagick Command Injection