spdup.net

Tech news

Cursor Composer and SWE‑1.5 Review – Why a $10B Company Released a Subpar Model


Cursor Composer and SWE‑1.5 Review – Why a $10B Company Released a Subpar Model

Introduction

The AI‑coding assistant market is heating up, and this week two heavyweight players—Cursor and Windsurf—unveiled new models, Cursor Composer and SWE‑1.5. Both claim ultra‑low latency for “agentic” coding, yet the underlying technology and performance raise serious questions. This article breaks down the models’ claimed capabilities, the testing methodology, and why the results may disappoint even the most forgiving users.


Background on the New Models

Cursor Composer

  • Marketed as a “frontier” model that is four times faster than comparable LLMs.
  • Designed for low‑latency, multi‑step coding tasks, with most turns completing in under 30 seconds.
  • Built on an undisclosed “open‑weights” foundation, allegedly based on a 4.6‑class model.
  • No public benchmark results have been released, making independent verification difficult.

SWE‑1.5 (Windsurf)

  • Promoted as the faster of the two, delivering up to 950 tokens per second on Cerebras hardware.
  • Trained on an undisclosed open‑source base with proprietary reinforcement‑learning data.
  • Positioned as a high‑throughput alternative for code generation.

Testing Methodology

The evaluation used the official CLI tools provided by each vendor:

  • Cursor Composer – accessed via the Cursor CLI (the editor UI only displayed the older Cheetah model).
  • SWE‑1.5 – accessed through the Windsurf editor.

Both models were tasked with a suite of representative coding challenges, ranging from simple calculators to more complex web‑app prototypes. Execution time, correctness, and error rates were recorded for each task.


Performance Overview

Cursor Composer

  • Movie‑tracker app – numerous UI errors; the discover view was broken.
  • Goatee UI calculator – functioned correctly, showing the model can handle straightforward logic.
  • Godo game – failed to run; modern models such as GLM‑4.5 and Miniax handle it easily.
  • Open‑code big task – did not complete.
  • Spelt app – only a login screen appeared; backend errors were pervasive.
  • Tari Rust image‑cropper – non‑functional.
  • Overall ranking: 11th on the internal leaderboard, trailing behind models like Kilo, Miniax, and GLM‑4.5.

SWE‑1.5

  • Ranked 19th on the same leaderboard.
  • Could generate a calculator UI but failed to perform calculations.
  • Consistently produced incorrect or incomplete code across the test suite.

Why the Results Matter

  1. Lack of Transparency – Both companies hide the exact base model they fine‑tuned. The description hints at a GLM‑4.5 or Qwen‑3‑Coder lineage, but no concrete evidence is provided.
  2. Speed vs. Quality Trade‑off – While SWE‑1.5 achieves higher token‑per‑second throughput, the output quality is often unusable. Speed alone does not compensate for broken code.
  3. Missing Benchmarks – Without community‑accepted evaluations (e.g., HumanEval, MBPP), the claims of “frontier” performance remain unsubstantiated.
  4. Potential Ethical Issues – Deploying a fine‑tuned open‑source model without attribution may violate community norms and, in some jurisdictions, licensing terms.

Technical Analysis

  • Model Selection – The observed behavior aligns more closely with Qwen‑3‑Coder or an older GLM‑4.5 checkpoint rather than a true 4.6‑class model. The lack of advanced reasoning and tool‑use suggests insufficient pre‑training alignment.
  • Reinforcement Learning (RL) Impact – The modest gains from RL fine‑tuning are outweighed by the poor base model choice. Proper alignment during pre‑training would be required to see real improvements.
  • Hardware Considerations – Both models run on high‑throughput hardware (Cerebras for SWE‑1.5, unspecified for Cursor). However, newer open models (e.g., Miniax, GLM‑4.5) already achieve comparable or better speeds on the same hardware, making the speed advantage moot.

Industry Implications

  • Transparency Gap – The refusal to disclose the underlying model undermines trust. Users cannot verify whether the product is a genuine innovation or a re‑branded open‑source checkpoint.
  • Opportunity Cost – Companies with $10 billion market caps could either hire dedicated ML teams to develop proprietary models or, at a minimum, openly credit the base model they are fine‑tuning.
  • Community Reaction – The lack of criticism from the broader AI community suggests a growing complacency around model attribution.

Recommendations for Practitioners

  • Prioritize Proven Open Models – When speed is essential, consider established open weights such as Miniax, GLM‑4.5, or Mistral‑7B and apply your own fine‑tuning.
  • Validate Before Integration – Run a small benchmark suite (e.g., code generation, tool‑use, error handling) before adopting a new vendor model.
  • Demand Transparency – Insist on clear documentation of the base model, training data, and licensing to avoid legal and performance pitfalls.

Conclusion

Both Cursor Composer and SWE‑1.5 promise lightning‑fast code generation, yet the reality is a collection of fast‑but‑flawed outputs. The models struggle with basic tasks that older open‑source checkpoints handle with ease, and the opaque development process raises ethical concerns. Until the companies either disclose their foundations or deliver a genuinely superior model, developers would be better served by sticking with well‑documented, community‑vetted alternatives.


This article reflects an independent technical assessment and does not endorse any specific product.

Watch Original Video