Gemini 3 Preview Reveals Strong Checkpoints, Pricing Hints and What to Expect
Gemini 3 Preview Reveals Strong Checkpoints, Pricing Hints and What to Expect
Introduction
Google’s next‑generation large language model, Gemini 3, appears to be on the brink of a public release. A brief listing for Gemini 3.0 Pro on Vertex AI—complete with a tentative “11‑2025” rollout date—suggests the model could drop any day. After weeks of testing a series of internal checkpoints, I’ve compiled a comprehensive recap of what the model can do, where it still falls short, and what the pricing landscape might look like.
The Road to Gemini 3: From AB Tests to Checkpoint Chaos
Early Hints in AI Studio
The first public clue arrived in Google’s AI Studio, where selecting Gemini 2.5 Pro occasionally returned a different checkpoint ID beginning with 2HTT. Network logs identified this as Gemini 3.0 Pro. The checkpoint surfaced only once every 40‑50 prompts, but the results were striking:
- Accurate floor‑plan layouts with correctly placed doors and furniture
- An SVG panda eating a burger with proper composition
- A 3‑js Pokéball rendered with realistic lighting
- A Minecraft‑style scene that set a new benchmark for one‑shot 3D generation
- A butterfly simulation that, while slightly behind GPT‑5, still impressed
- Strong performance on riddles and “AIME‑style” math problems
These results pushed the model to the top of the author’s internal leaderboard, delivering roughly a 25 % improvement over Sonnet 4.5.
The “Middle” Checkpoint – ECPT
Google’s next checkpoint, labeled ECPT, felt noticeably nerfed. The output quality dipped across several dimensions:
- Floor‑plan designs lost coherence
- The SVG panda appeared disjointed
- Chess moves were sub‑optimal
- 3‑js lighting and the Minecraft scene became flat and laggy
Despite these regressions, the model still outperformed Sonnet on most math questions, suggesting the checkpoint was likely a quantized or lower‑reasoning variant intended for broader rollout testing.
The Bounce‑Back: X28 Checkpoint
Community speculation pointed to a new “Pro” checkpoint, later identified as X28. When re‑tested with the original 11‑question suite plus a few extras, X28 delivered a clear step up from 2HT:
- Floor plans became truly realistic, with functional doors, sensible layouts, and dynamic lighting controls.
- The SVG panda now actually ate the burger rather than merely posing.
- 3‑js Pokéball scenes featured richer backgrounds and refined polish.
- The Minecraft scene added rivers and cleaner illumination.
- The butterfly simulation included rocks, flowers, and fewer clipping artifacts.
- The Rust CLI for image conversion and a Blender script both produced professional‑grade results.
- A degree‑of‑separation network demo rendered a clean UI without the usual “purple‑vibe” default.
- Tool‑calling via the RU human‑relay showed accurate first‑function selection.
Overall, X28 represented a 5‑10 % improvement over 2HT and a substantial leap over current Sonnet models.
Key Observations Across Checkpoints
- Thinking‑Variant Behavior – The strongest checkpoints exhibit a slower first token followed by steady output, indicating deeper internal deliberation.
- Consistency – High‑end checkpoints generate near‑deterministic results across repeated prompts, a major advantage for developers building reliable applications.
- Design Sensibility – The model selects fonts, spacing, and layout choices that feel handcrafted rather than generic.
- Tool‑Calling – Raw reasoning is solid, but reliable chaining of function calls remains the critical hinge for production agents.
- Nerfed Checkpoints – Likely serve safety, latency, and scaling tests; they are useful but not the breakthrough many hoped for.
Pricing Expectations
- Parity with Sonnet – If Google prices Gemini 3 Pro at a level comparable to Sonnet 4.5, the performance gains justify the cost.
- Premium Pricing – Higher rates would need to be offset by superior tool‑call reliability, higher throughput, and consistent quality over long sessions.
- Aggressive Pricing – A sub‑Sonnet price point could attract a large user base, especially given the now‑mature Gemini ecosystem (CLI, Jewels, AI Studio generators).
How Gemini 3 Stacks Up Against Competitors
| Feature | Gemini 3 (strong checkpoints) | Sonnet 4.5 | GPT‑5 | Claude |
|---|---|---|---|---|
| Spatial reasoning & 3‑D one‑shots | ≥ Opus (top tier) | Good but less consistent | Competitive | Good |
| Math & physics‑style simulations | Competitive, sometimes edged by GPT‑5 | Strong | Strong | |
| Consistency across regenerations | High (especially X28/2HT) | Moderate | Moderate | Moderate |
| Tool‑calling reliability | Promising, needs more real‑world testing | Good | Good | Good |
If the public release mirrors the X28 or 2HT checkpoints, Gemini 3 could become the best mainstream model for developers. A launch resembling ECPT would still be an improvement over Sonnet, but not the generational leap many anticipate.
Practical Benchmarking Tips
- Avoid “web‑style” demos – Simple HTML/CSS outputs are easy for any frontier model and don’t reflect true capability.
- Stress 3‑D + Math – Use 3‑js scenes that require real calculations to expose differences.
- Measure Consistency – Test the same prompt multiple times; note latency to the first token and output stability.
- Evaluate Tool‑Calling Chains – Verify that the model can plan and execute multi‑step function calls, not just a single API hit.
Conclusion
From the early AB‑test checkpoint 2HT through the dip with ECPT and the strong rebound with X28, the evidence points to a cautiously optimistic outlook for Gemini 3. Should Google ship a model comparable to the X28/2HT checkpoints, developers will finally have a mainstream LLM that combines deep reasoning, design intuition, and reliable tool usage.
Even a nerfed release would still outpace Sonnet for many workflows, but the real breakthrough hinges on the final checkpoint Google chooses for the public preview. Once the model lands in Vertex AI, a full benchmark—including token economics, latency, and tool‑call success rates—will clarify the price‑to‑performance equation.
The future of AI‑driven development looks brighter than ever.