Skip to content

Gemini 2.5 Flash

BIRD Mini-Dev benchmark results for Gemini 2.5 Flash via Google Cloud Vertex AI.

Back to Overall Results


Summary

Provider Google Cloud Vertex AI
Model gemini-2.5-flash
Overall EX Accuracy 60.6%
Overall Soft F1 62.1%
Error Rate 2.4% (12 / 500)
Avg Latency 6.7s per question
Total Benchmark Time 56.0 minutes
Rank #2 overall

Detailed Scores

Metric Overall Simple (148) Moderate (250) Challenging (102)
Execution Accuracy (EX) 60.6% 76.4% 53.6% 54.9%
Soft F1 62.1% 75.1% 56.8% 55.9%

Gemini 2.5 Flash — EX vs Soft F1 Breakdown

Analysis

Strengths

  • Best on challenging questions at 54.9% EX — highest among all models, edging out even Gemini 2.5 Pro (53.9%)
  • Excellent reliability with only 12 errors (2.4%), second-lowest error rate
  • Good balance of speed and accuracy — 3x faster than Gemini 2.5 Pro while only 3.8 points behind
  • Soft F1 exceeds EX by 1.5 points, showing it frequently produces partially-correct SQL on harder questions

Weaknesses

  • Moderate question gap — drops to 53.6% on moderate difficulty, 7.6 points behind Gemini 2.5 Pro
  • Not the fastest — GPT-5.4 Mini runs at roughly half the latency (3.6s vs 6.7s)

When to Use

Gemini 2.5 Flash offers the best accuracy-to-speed ratio among Google models. Ideal for:

  • Production text-to-SQL where both speed and accuracy matter
  • Complex analytical queries (best performance on challenging questions)
  • Real-time applications where 20s+ latency from Pro is unacceptable

Comparison with Peers

vs Model EX Difference Latency Ratio
vs Gemini 2.5 Pro -3.8 points 3.0x faster
vs GPT-5.4 +5.8 points 0.96x (similar)
vs GPT-5.4 Mini +7.4 points 1.9x slower

Back to Overall Results