Gemini 2.5 Pro¶

BIRD Mini-Dev benchmark results for Gemini 2.5 Pro via Google Cloud Vertex AI.

Summary¶

Metric	Overall	Simple (148)	Moderate (250)	Challenging (102)
Execution Accuracy (EX)	64.4%	77.0%	61.2%	53.9%
Soft F1	64.0%	75.4%	61.7%	53.0%

Gemini 2.5 Pro — EX vs Soft F1 Breakdown

Highest overall accuracy at 64.4% EX, the top-performing model across all difficulty levels
Best moderate-question performance at 61.2% EX — a 7.6-point lead over the next model (Gemini 2.5 Flash at 53.6%)
Minimal EX-to-F1 gap (0.4 points) — when this model gets close, it almost always gets it exactly right
Low error rate at 3.6%, producing well-formed SQL the vast majority of the time

Significantly slower at 20.1s average latency — roughly 3x slower than GPT-5.4 and Gemini 2.5 Flash
Total runtime of 167.3 minutes (nearly 3 hours) makes it the most expensive to run
Not the best on challenging questions — Gemini 2.5 Flash edges it out on the hardest queries (54.9% vs 53.9%)

Gemini 2.5 Pro is the right choice when accuracy is the primary concern and latency is acceptable. Ideal for: