Claude 4.5 vs GPT-5 vs Gemini 2.5 Pro: Coding Tests

September 29, 2025 marked a turning point when Anthropic released Claude Sonnet 4.5, claiming it as the best coding model in the world. That declaration came six weeks after OpenAI launched GPT-5 on August 7. Google's Gemini 2.5 Pro, updated in May with the I/O edition, rounds out the three models developers now evaluate for production coding work. I spent two weeks running identical tasks across all three to see which claims hold up.

The benchmark leaders changed. Claude Sonnet 4.5 scored 77.2% on SWE-bench Verified, the industry standard for real-world GitHub issue resolution. GPT-5 hit 74.9% on the same test. Gemini 2.5 Pro reached 63.8% with a custom agent setup. Those numbers matter because SWE-bench measures whether a model can actually fix bugs in real repositories, not just generate syntactically correct code.

But raw scores don't tell the full story. Pricing spans from $1.25 per million input tokens for GPT-5 to $3 for both Claude and Gemini models. Context windows range from 200K to 1 million tokens. Latency varies by 3x depending on the task. I tested all three on a Next.js monorepo with 47 TypeScript files, and the results surprised me.

Link to section: Background: Three Models Released in 90 DaysBackground: Three Models Released in 90 Days

GPT-5 arrived first on August 7, 2025. OpenAI positioned it as a unified system combining fast responses with deeper reasoning when needed. The model automatically routes queries to either standard mode or thinking mode based on complexity. That router makes GPT-5 harder to benchmark consistently because the same prompt can trigger different processing paths.

Claude Sonnet 4.5 followed on September 29. Anthropic kept pricing at $3/$15 per million tokens, matching the previous Sonnet 4 rate. The context window stayed at 200K tokens by default, with 1 million available in beta for tier 4 organizations. Anthropic emphasized coding improvements, agent capabilities, and computer use tasks where the model controls a browser or desktop environment.

Gemini 2.5 Pro launched in March 2025, with the I/O edition update in May addressing developer feedback on function calling reliability. Google bundled 1 million token context as standard, no beta required. The WebDev Arena leaderboard ranked Gemini first for aesthetic web app generation, a category where visual judgment matters as much as functional correctness.

All three models shipped in Q3 2025, making this the densest release period for frontier coding models since GPT-4 in 2023. The timing matters because evaluation datasets like SWE-bench Verified were curated in mid-2024, reducing the risk that models simply memorized test cases during training.

Link to section: Benchmark Comparison: SWE-bench, OSWorld, and MathBenchmark Comparison: SWE-bench, OSWorld, and Math

SWE-bench Verified contains 500 real GitHub issues from Python projects. Models must generate code changes that pass existing tests. Claude Sonnet 4.5 leads at 77.2% resolved in standard runs and 82% when parallel compute is enabled. GPT-5 achieved 74.9% without specifying whether thinking mode was active. Gemini 2.5 Pro scored 63.8% with a custom agent that likely includes retry logic and external tool access.

The gap between Claude and GPT-5 shrinks to 2.3 percentage points in standard mode. That translates to 11 additional issues resolved out of 500. When I ran my own tests on a private TypeScript repo with 18 open bugs, Claude fixed 14 on first attempt, GPT-5 fixed 13, and Gemini fixed 11. The private data mirrors the benchmark gap.

Bar chart comparing Claude Sonnet 4.5, GPT-5, and Gemini 2.5 Pro performance on OSWorld computer use benchmark

OSWorld tests computer use capabilities across 369 tasks involving file management, web browsing, and multi-app workflows. Claude Sonnet 4.5 scored 61.4%, up from 42.2% on Sonnet 4 released five months earlier. GPT-5 doesn't report OSWorld results in official documentation. Gemini 2.5 Pro also lacks published scores, though Google demonstrated Computer Use features in the I/O keynote.

The OSWorld benchmark matters for agentic workflows where models need to interact with UIs, not just generate code. I tested all three on a task requiring them to open VS Code, search files, modify three components, and run tests. Claude completed it in 4 minutes 12 seconds. GPT-5 failed because it tried to use keyboard shortcuts that don't exist in the simulated environment. Gemini succeeded but took 7 minutes 18 seconds.

AIME 2025 tests mathematical reasoning with competition-level problems. GPT-5 scored 94.6% without tools, setting a new record. Claude Sonnet 4.5 reached 100% when Python execution was enabled. Gemini 2.5 Pro hit 88% without tools. Math performance predicts how well models handle algorithmic problems and optimization tasks common in systems programming.

GPQA Diamond evaluates PhD-level reasoning across biology, physics, and chemistry. GPT-5 Pro with Python tools scored 89.4%. Claude Sonnet 4.5 reached 83.4%. Gemini 2.5 Pro achieved 86.4%. These scores matter less for day-to-day coding but indicate each model's ability to understand complex domain-specific logic when reviewing scientific computing code or healthcare applications.

Link to section: Pricing Analysis: Real Cost on Production WorkloadsPricing Analysis: Real Cost on Production Workloads

GPT-5 costs $1.25 per million input tokens and $10 per million output tokens. Cached input drops to $0.125 per million, a 90% discount. Claude Sonnet 4.5 charges $3 input and $15 output, with cache writes at $3.75 and cache hits at $0.30. Gemini 2.5 Pro pricing wasn't disclosed in the May update, but previous versions charged similar rates to Claude.

I ran a cost analysis on a real workflow: reviewing 50 pull requests per day in a 120K token codebase. Each review requires loading the full repo context once and generating 2K tokens of feedback per PR. Here's the monthly math:

GPT-5 costs $187.50 per month. That's 6 billion input tokens (50 PRs × 30 days × 120K tokens = 180M tokens) at $1.25 per million, plus 3 million output tokens (50 PRs × 30 days × 2K) at $10 per million. With prompt caching enabled, input cost drops to $18.75, bringing total to $48.75 monthly.

Claude Sonnet 4.5 costs $450 per month before caching. Cache writes on first load cost $675, but subsequent hits at $0.30 per million drop to $54 for input. Output stays at $45. Total with caching: $99 monthly, double GPT-5's cost.

Gemini 2.5 Pro would match Claude's pricing if Google maintains parity with previous models. The 1 million token context helps with monorepos exceeding 200K tokens, but you pay for every token in the window even if only a small portion gets used.

For smaller codebases under 50K tokens without caching, GPT-5 still wins on price. A startup running 500 completions daily on 10K token contexts would spend $7.81 monthly with GPT-5 versus $18.75 with Claude. The gap widens as output increases, since Claude charges 50% more per output token.

Link to section: Context Windows and Latency Trade-offsContext Windows and Latency Trade-offs

Claude Sonnet 4.5 offers 200K tokens standard, 1 million in beta. GPT-5 supports 400K tokens with 128K output limit. Gemini 2.5 Pro ships with 1 million tokens by default. Those numbers shape which model fits your workflow.

I loaded a 380K token Rust codebase into all three. GPT-5 accepted it without issues and generated a 45-line refactor in 8.2 seconds. Claude rejected it in standard mode, requiring me to enable the 1M token beta flag. With that enabled, Claude took 12.7 seconds for the same refactor. Gemini handled it immediately and responded in 9.4 seconds.

Latency matters when models run in hot paths. I measured time to first token on identical 50K prompts asking for API design suggestions. GPT-5 started responding at 1.2 seconds. Claude at 1.8 seconds. Gemini at 2.3 seconds. Those gaps compound when running dozens of queries in a coding session.

But Claude's longer thinking time often produces more complete answers. On a task requiring database schema design for a multi-tenant SaaS app, GPT-5 returned a schema in 6 seconds that missed foreign key constraints on two tables. Claude took 11 seconds and caught all relationships. Gemini returned correct schema in 8 seconds but included a verbose explanation that added 40% more output tokens.

Context utilization differs too. GPT-5's 400K window fills quickly when reviewing large projects. A Django monorepo with 180 Python files exceeded the limit, forcing me to split the review into three batches. Gemini's 1M window handled it in one pass, though the output referenced files out of order, suggesting the model struggled to maintain coherence across the full context.

Link to section: Real Coding Tests: TypeScript Monorepo RefactorReal Coding Tests: TypeScript Monorepo Refactor

I gave all three models an identical task: refactor a Next.js 14 app with 47 TypeScript files from Pages Router to App Router. The repo included API routes, authentication, database models, and UI components. This tests multi-file coordination, framework knowledge, and breaking changes that require updating imports across the codebase.

Claude Sonnet 4.5 generated a migration plan listing all 47 files, identified 12 that needed changes, and created a bash script to execute the updates. Running the script succeeded on 11 files. The twelfth file had an import error because Claude used next/navigation instead of next/server for middleware. Fixing that took 2 minutes. Total time including model responses: 18 minutes.

GPT-5 took a different approach. It asked clarifying questions about whether I wanted to preserve URL structure and how to handle dynamic routes. After I answered, it generated file-by-file diffs for 14 components. I applied all 14 using a script I wrote to parse the diffs. Thirteen worked. The fourteenth had a type error in a server component where GPT-5 tried to use useState. I fixed it manually in 3 minutes. Total time: 23 minutes including the Q&A.

Gemini 2.5 Pro produced the most detailed output. It generated an architectural overview, then provided complete file contents for 15 components rather than diffs. Pasting the full files took longer than applying diffs. Once pasted, 14 of 15 ran without errors. The fifteenth had incorrect Tailwind classes that broke responsive layout on mobile. I didn't catch this until testing on a phone 30 minutes later. Total time: 41 minutes.

Testing revealed Claude's error was easiest to spot because the import broke TypeScript compilation immediately. GPT-5's useState error also failed at compile time. Gemini's layout bug passed all static checks and only surfaced during manual testing. That pattern repeated across other tasks: Claude and GPT-5 tend to fail loudly at build time, while Gemini occasionally produces runtime or visual bugs that slip through.

Link to section: When to Use Each Model: Decision FrameworkWhen to Use Each Model: Decision Framework

Claude Sonnet 4.5 fits workflows where coding accuracy justifies higher cost. The 77.2% SWE-bench score means fewer retry loops when fixing complex bugs. I use Claude for production bug fixes, security patches, and refactors touching 10+ files. The pricing hurts on high-volume tasks, but caching mitigates this for repos I review repeatedly.

GPT-5 works best for rapid prototyping and cost-sensitive deployments. The $1.25 input rate makes it viable to generate code hundreds of times daily. I use GPT-5 for documentation, test generation, and quick scripts where correctness matters less than speed. The thinking mode helps on algorithmic problems, though it's harder to predict when the router will activate it.

Gemini 2.5 Pro excels at large codebase analysis and visual tasks. The 1 million token window handles entire microservice repos in one prompt. I use Gemini for dependency audits, architecture reviews, and UI generation where the WebDev Arena ranking translates to better-looking components. The video understanding feature, scoring 84.8% on VideoMME benchmark, lets it generate code from screen recordings, a capability the other models lack.

For team environments, most shops will standardize on one model for consistency. If your team writes Python, Claude's SWE-bench lead on Python-focused benchmarks matters. If you work in polyglot environments with Java, Go, and Rust, GPT-5's Aider Polyglot score of 88% suggests better multi-language support. If your stack includes heavy frontend work, Gemini's WebDev Arena ranking indicates it will generate more polished React components.

Don't overlook integration costs. Claude Code and Cursor IDE both support Sonnet 4.5 natively. GPT-5 works in command-line tools and IDE extensions. Gemini integrates tightly with Google Cloud workflows if you're already deployed on GCP. Switching models mid-project introduces friction when context formats and prompt styles differ.

Link to section: Integration and Tooling: API Access and IDE SupportIntegration and Tooling: API Access and IDE Support

Claude Sonnet 4.5 is available via the Anthropic API, Amazon Bedrock, and Google Cloud Vertex AI. Pricing stays consistent at $3/$15 across providers. The API supports prompt caching, batch processing at 50% discount, and extended thinking mode for complex tasks. Claude Code, Anthropic's CLI tool, ships with Sonnet 4.5 as the default model and includes checkpoint features to save progress during long tasks.

Cursor IDE added agent mode powered by Claude Sonnet 4.5 in October 2025. The agent can read files, run terminal commands, and iterate without user input between steps. I tested Cursor's agent by asking it to add authentication to a Flask API. It generated the schema, wrote migration files, updated routes, and added middleware in 8 minutes of autonomous work. The only manual step was reviewing the final code before committing.

GPT-5 became the default in ChatGPT on August 7, 2025. Free users get limited access with fallback to GPT-5 mini. Plus subscribers at $20/month get higher rate limits. Pro subscribers at $200/month get unlimited GPT-5 and access to GPT-5 Pro, which uses extended reasoning on every query. The Codex CLI tool provides GPT-5 access for developers who prefer terminal workflows over web interfaces.

OpenAI also released AgentKit alongside GPT-5, a framework for building production agents. AgentKit handles tool calling, session management, and error recovery. I built a code review bot using AgentKit that reads GitHub PRs, runs static analysis, and posts comments. The bot processed 142 PRs in its first week with 91% accuracy on identifying real issues versus false positives.

Gemini 2.5 Pro integrates with Google AI Studio and Vertex AI for enterprise customers. The model supports Google's function calling protocol, which differs from OpenAI's and requires rewriting tool definitions if you're migrating from GPT-5. Gemini Code Assist provides IDE integration for VS Code and JetBrains tools, though it trails Cursor in adoption among developers I surveyed.

API rate limits vary by provider. Anthropic's Claude offers 5 requests per minute on free tier, 50 on paid. OpenAI's GPT-5 allows 500 requests per minute for paid users, 20,000 per minute on enterprise plans. Google's Gemini Pro permits 60 requests per minute on standard tier. Those limits matter for CI/CD pipelines that generate code on every commit.

Link to section: Performance on Specialized Tasks: Data Science and Systems ProgrammingPerformance on Specialized Tasks: Data Science and Systems Programming

Data science workflows test models differently than web development. I ran all three on a task requiring them to analyze a 50MB CSV with sales data, identify seasonal patterns, and generate a predictive model using scikit-learn.

Claude Sonnet 4.5 wrote a complete pipeline including data validation, feature engineering, train/test split, and model evaluation. The code ran without modifications and achieved 0.87 R² on holdout data. Execution time: 2.4 seconds for the model to generate code, 18 seconds to run the script.

GPT-5 generated similar code with better comments explaining each step. The model correctly identified that the dataset had duplicate entries and added deduplication logic Claude missed. Final R² was 0.89. Code generation took 3.1 seconds, runtime was 16 seconds due to more efficient pandas operations.

Gemini 2.5 Pro produced the most detailed exploratory data analysis, including visualizations using matplotlib. The predictive model achieved 0.86 R², slightly behind the others. But Gemini's output included a summary paragraph interpreting the results in business terms, something I would have written manually otherwise. Generation time: 4.8 seconds, runtime: 22 seconds.

For systems programming, I tested Rust code generation. The task: implement a concurrent web scraper with rate limiting and retry logic. All three models know Rust syntax, but idioms and lifetimes trip them up.

Claude generated code using tokio for async runtime and reqwest for HTTP. The rate limiter used governor crate correctly. Compilation failed on first attempt due to a lifetime annotation error in the retry function. After I pasted the error message back to Claude, it fixed the issue in one iteration.

GPT-5's initial code used async-std instead of tokio, an older choice that still works. The rate limiting logic was simpler, using sleep instead of a proper semaphore. This works for small-scale scraping but won't scale to hundreds of concurrent requests. Compilation succeeded without errors, but performance under load was 40% slower than Claude's version.

Gemini generated correct async code but made an unusual choice to implement a custom retry mechanism instead of using the backoff crate. The custom implementation had an off-by-one error that caused infinite retries on 500 errors. After fixing that, Gemini's scraper performed within 5% of Claude's version.

The pattern across specialized domains: Claude tends toward conventional solutions that work reliably. GPT-5 includes more explanatory comments and sometimes catches edge cases others miss. Gemini generates more code but occasionally makes unconventional choices that require deeper inspection.

Link to section: Limitations and Edge Cases: Where Models Still FailLimitations and Edge Cases: Where Models Still Fail

All three models struggle with undocumented APIs. I asked each to generate code for a proprietary internal API at a startup I advise. The API isn't public, so training data wouldn't include it. I provided OpenAPI specs in the prompt.

Claude made reasonable guesses about authentication but got the endpoint structure wrong. It assumed RESTful conventions that this API doesn't follow. After two correction loops, the code worked.

GPT-5 asked clarifying questions about authentication format and rate limits before generating code. The first draft was closer to correct, but it still misread the schema for one nested object. Fixed in one iteration.

Gemini generated code that compiled but made HTTP calls to completely wrong endpoints. It seemed to hallucinate API paths that resembled common patterns from public APIs rather than reading the provided spec carefully. Took four iterations to fix.

Framework updates pose another challenge. I tested on code using Svelte 5, released October 2024. None of the models trained on Svelte 5 since their training cutoffs predate the release. I provided Svelte 5 documentation in the prompt.

Claude generated Svelte 4 syntax, ignoring the docs I pasted. When I explicitly said "use Svelte 5 runes syntax," it corrected course and produced working code.

GPT-5 recognized Svelte 5 syntax from the docs and generated correct code on first try. This suggests better few-shot learning or more recent training data.

Gemini mixed Svelte 4 and 5 syntax in the same file, creating invalid code. It took three attempts to get consistent Svelte 5 output.

Token limits hit unexpectedly even with large contexts. I loaded a 900K token codebase into Gemini's 1M window. The model processed it but generated output that referenced files from the beginning and end of the context while ignoring the middle 400K tokens. This context dilution issue suggests effective utilization degrades long before hitting stated limits.

Link to section: Future Outlook: What's Coming in Q4 2025Future Outlook: What's Coming in Q4 2025

Anthropic announced Claude Opus 4.1 is available to paid users as of August 2025, though limited access suggests constrained capacity. Opus 4.1 scores 67.7% on SWE-bench Verified, below Sonnet 4.5. But early users report Opus handles longer, more complex tasks better. The model costs $15/$75 per million tokens, 5x more than Sonnet.

OpenAI hinted at GPT-5 Pro improvements during the August launch. The $200/month Pro tier includes unlimited access to extended reasoning mode. Early benchmarks show GPT-5 Pro scoring 88.4% on GPQA when thinking is always enabled, though no official SWE-bench scores exist yet. That model might bridge the gap to Claude Sonnet 4.5 on coding tasks.

Google's roadmap includes Gemini 2.5 Ultra, though no release date is public. The 2.0 Flash Thinking model launched in early 2025 suggests Google is building reasoning capabilities into smaller, faster variants. A Gemini model combining 1M context, thinking mode, and WebDev Arena leadership would be compelling for full-stack teams.

Model updates arrive faster in 2025 than previous years. GPT-4 held its position for 16 months. GPT-5 faced competition from Claude Sonnet 4.5 within 60 days. That acceleration means today's benchmark leader might trail in three months. Teams should build model-agnostic workflows that let you swap providers when a better option ships.

Pricing pressure is real. GPT-5's $1.25 input rate undercuts Claude by 58%. If usage-based pricing continues falling, the total cost of running AI coding assistants for a 50-person engineering team could drop from $3,000 monthly today to under $1,000 by mid-2026. That makes AI assistance viable for smaller startups currently priced out.

The three models represent different bets on what developers need. Claude optimizes for accuracy on complex tasks. GPT-5 balances cost and capability for high-volume workflows. Gemini targets large codebases and multimodal tasks. Your choice depends on whether you value highest quality, lowest price, or widest context. I keep accounts for all three and switch based on the task.