GPT-5.4¶

BIRD Mini-Dev benchmark results for GPT-5.4 via OpenAI.

Summary¶

Metric	Overall	Simple (148)	Moderate (250)	Challenging (102)
Execution Accuracy (EX)	54.8%	68.9%	50.8%	44.1%
Soft F1	60.6%	71.5%	58.7%	49.4%

GPT-5.4 — EX vs Soft F1 Breakdown

Most reliable model — only 3 errors across 500 questions (0.6% error rate), the lowest of any model tested
Largest Soft F1 advantage — F1 exceeds EX by 5.8 points overall, indicating GPT-5.4 frequently generates SQL that is semantically close even when not an exact match
Consistent across difficulty — gradual degradation from 68.9% (simple) to 44.1% (challenging) without sudden drops

Lower EX than Gemini models — 9.6 points behind Gemini 2.5 Pro and 5.8 points behind Flash on exact match
Moderate latency at 7.0s, similar to Gemini 2.5 Flash but roughly 2x slower than GPT-5.4 Mini

GPT-5.4 is the best choice when reliability and low error rates are critical. Ideal for: