GPT-5.4 Mini¶

BIRD Mini-Dev benchmark results for GPT-5.4 Mini via OpenAI.

Summary¶

Metric	Overall	Simple (148)	Moderate (250)	Challenging (102)
Execution Accuracy (EX)	53.2%	70.3%	49.6%	37.3%
Soft F1	57.2%	72.0%	55.0%	41.4%

GPT-5.4 Mini — EX vs Soft F1 Breakdown

Fastest model tested at 3.6s average latency — nearly 2x faster than GPT-5.4 and 5.6x faster than Gemini 2.5 Pro
Shortest total runtime at 29.8 minutes for all 500 questions
Strong on simple questions at 70.3% EX, competitive with much larger models
Low error rate at 2.2%, similar reliability to Gemini 2.5 Flash

Sharp drop on challenging questions — 37.3% EX on challenging is a 33-point drop from simple, the largest gap among top-4 models
Moderate accuracy gap compared to Gemini models — 11.2 points behind Pro, 7.4 behind Flash

GPT-5.4 Mini is the best choice when speed matters most. Ideal for:

Real-time interactive applications where sub-5s latency is required
High-volume batch processing where cost-per-query matters
Simple-to-moderate query workloads where 70%+ accuracy on simple questions is sufficient
Prototyping and development where fast iteration beats peak accuracy