Nov 19, 2025

Gemini 3 Pro Dominates New Agentic Benchmarks, Surpasses Sonnet and GPT‑5.1 in Coding Tests

Introduction

The latest release of Gemini 3 Pro has quickly become a benchmark‑setter in the AI‑assisted coding arena. In a series of rigorous tests—ranging from classic coding challenges to complex agentic workflows—Gemini 3 Pro not only achieved perfect scores on established benchmarks but also outperformed leading rivals such as Claude Sonnet, Claude Opus, and GPT‑5.1 CodeX. This article breaks down the new benchmark suite, the methodology behind the scores, and the practical implications for developers seeking high‑performance, cost‑effective AI assistance.

New Benchmark Suite

To evaluate Gemini 3 Pro beyond the traditional Kingbench 2.0, two additional benchmarks were introduced:

GDscript Bench – 60 questions focused on the open‑source Godot game engine’s native scripting language, GDscript. Each task is validated with unit tests and an LLM judge that assesses code quality.
Spelt Bench – Designed to measure the model’s ability to generate code for the Spelt framework, also scored via unit tests and an LLM judge.

Both benchmarks aim to expose weaknesses that many large language models (LLMs) exhibit when dealing with niche or domain‑specific languages.

Scoring Methodology and Intelligence Index

Each benchmark produces a raw score that is then combined into an Intelligence Index—a weighted average that emphasizes coding proficiency. The index also incorporates a price‑to‑performance analysis based on actual API usage costs.

Model	Intelligence Index	Kingbench 2.0	GDscript Bench	Spelt Bench
Gemini 3 Pro	60.4	100 % (perfect)	20.8	83.3
Claude Sonnet	37.5	50 %	15.2	70.1
Claude Opus	34.9	45 %	14.9	68.4
GPT‑5.1 CodeX	31.3	40 %	13.7	65.0

The price‑to‑performance chart showed that Gemini 3 Pro completed the entire suite for just $2.85, a figure that is notably lower than the cost incurred by Sonnet for comparable runs.

Agentic Benchmarks with Kilo Code

Beyond static code generation, the evaluation also covered agentic tasks—scenarios where the model orchestrates a sequence of actions, such as building full applications from description. All tests were performed using Kilo Code, a popular agentic framework that integrates directly with Gemini 3 Pro via the preview API.

Key Agentic Test Cases

Movie Tracker App – Generated a functional homepage and inner pages. The output was concise and required minimal post‑processing.
Godot FPS Game Extension – Added a step counter and health bar that responded to jumping actions. The model correctly exposed configuration settings for the step target.
Go TUI Calculator – Produced a fully operational terminal UI calculator with accurate arithmetic and smooth navigation.
Spelt Application – Delivered a working but less polished UI compared to Sonnet; nevertheless, the core functionality was intact.
Open‑Code Challenge – Historically dominated by multi‑model agents like CodeBuff, Gemini 3 Pro succeeded, handling SVG generation and UI aesthetics without the high cost.
Nux App – Generated extensive code that failed to launch due to numerous runtime errors; this failure mirrored the performance of competing models.
Tari Image Tool – Implemented a robust interface for browsing, cropping, and annotating images, demonstrating strong generation capabilities.

Overall, Gemini 3 Pro achieved a 71.4 % success rate on the agentic leaderboard, breaking the 70 % threshold for the first time and surpassing the previously dominant CodeBuff system.

Availability and Integration

While Gemini 3 Pro is not yet accessible through the public Gemini CLI (both free and pro tiers are on a waitlist), developers can invoke the model via the API or through the anti‑gravity editor, which offers free access. The model’s integration with Kilo Code required only a simple configuration change to select the preview model.

Implications for Developers

Higher Productivity: Achieving perfect scores on classic benchmarks and strong results on agentic tasks suggests that Gemini 3 Pro can handle both isolated code generation and complex workflow orchestration.
Cost Efficiency: At under $3 for a full suite of tests, the model presents a compelling value proposition for teams that need scalable AI assistance without inflating budgets.
Domain Flexibility: Success on the GDscript and Spelt benchmarks indicates that Gemini 3 Pro can adapt to niche programming environments, a common pain point for many LLMs.
Room for Improvement: The Nux app failure and occasional hallucinations in longer agentic sequences highlight areas where prompt engineering or system‑level tuning could further enhance reliability.

Conclusion

The comprehensive testing regime demonstrates that Gemini 3 Pro has set a new standard for AI‑driven coding assistance. With perfect performance on Kingbench, top scores across newly introduced GDscript and Spelt benchmarks, and a record‑breaking 71.4 % success rate on agentic tasks, the model outpaces established competitors both in capability and cost.

For developers and organizations looking to integrate AI into their development pipelines, Gemini 3 Pro offers a powerful blend of accuracy, versatility, and affordability—making it a strong candidate for next‑generation coding workflows.