BIRD Benchmark Evaluation¶
Ask RITA includes two built-in evaluation suites based on the BIRD benchmark — the industry-standard benchmark for large-scale database grounded text-to-SQL evaluation.
| Benchmark | Module | Questions | Focus |
|---|---|---|---|
| BIRD Mini-Dev | benchmarks.bird |
500 | Single-turn text-to-SQL accuracy |
| BIRD Mini-Interact | benchmarks.bird_interact |
300 | Multi-turn conversational SQL with clarification |
Table of Contents¶
BIRD Mini-Dev¶
What is BIRD?¶
BIRD (BIg Bench for LaRge-scale Database Grounded Text-to-SQL) is a cross-domain benchmark with:
- 12,751+ question-SQL pairs across 95 databases and 37 professional domains
- Real-world "dirty" data values requiring reasoning
- External knowledge evidence for domain-specific questions
- Difficulty levels: Simple (30%), Moderate (50%), Challenging (20%)
This module uses BIRD Mini-Dev (500 instances, 11 SQLite databases) — the recommended development evaluation set.
Setup¶
1. Install Dependencies¶
2. Download BIRD Databases¶
The SQLite databases must be downloaded separately (they are too large to be included in the repository or HuggingFace):
- Download the
minidev.zipfile from the BIRD Google Drive. - Extract the contents into the
benchmarks/bird/data/directory. - Ensure the
dev_databases/folder is located at:benchmarks/bird/data/dev_databases/
Running the Benchmark¶
The benchmark runner supports multiple LLM providers (OpenAI, Vertex AI, Azure OpenAI, Bedrock) and automatically manages the Ask RITA configuration, prompt generation, SQL execution, and metric calculation.
Basic Usage¶
Run the full benchmark (500 questions) with GPT-4o:
Quick Tests and Cost Control¶
Run a quick test on a single database (~30 questions):
poetry run python -m benchmarks.bird.run_benchmark --provider openai --model gpt-4o --db-filter financial
Limit the total number of questions processed:
Run a stratified sample (proportionally across all 11 databases):
poetry run python -m benchmarks.bird.run_benchmark --provider openai --model gpt-4o --stratified --stratified-size 100
Omit external knowledge/evidence from prompts:
Resuming a Failed Run¶
If a run is interrupted, resume from the checkpoint:
This loads predictions_checkpoint.json from the latest matching output directory and skips already-completed questions.
Evaluate-Only Mode¶
Re-evaluate existing predictions without re-running SQL generation:
poetry run python -m benchmarks.bird.run_benchmark \
--evaluate-only \
--predictions ./benchmarks/bird/output/gpt-4o_20250101_120000/predictions.json
Provider-Specific Configurations¶
Vertex AI (Gemini):
poetry run python -m benchmarks.bird.run_benchmark \
--provider vertex_ai \
--model gemini-2.5-pro \
--project-id your-gcp-project-id \
--credentials-path /path/to/service-account.json
Use gcloud CLI authentication instead of a service account:
poetry run python -m benchmarks.bird.run_benchmark \
--provider vertex_ai \
--model gemini-2.5-pro \
--project-id your-gcp-project-id \
--gcloud-cli-auth \
--location us-central1
Azure OpenAI:
poetry run python -m benchmarks.bird.run_benchmark \
--provider azure_openai \
--model gpt-4o \
--api-base https://your-endpoint.openai.azure.com/ \
--deployment-name gpt-4o \
--api-version 2024-08-01-preview
Corporate Proxies (Zscaler):
poetry run python -m benchmarks.bird.run_benchmark \
--provider openai \
--model gpt-4o \
--ca-bundle credentials/zscaler-ca-bundle.pem
CLI Reference¶
| Flag | Default | Description |
|---|---|---|
--provider |
openai |
LLM provider: openai, azure_openai, vertex_ai, bedrock |
--model |
gpt-4o |
Model name |
--api-key |
From env | API key (overrides environment variable) |
--api-base |
— | OpenAI base URL or Azure endpoint |
--api-version |
— | Azure API version |
--deployment-name |
— | Azure deployment name |
--project-id |
— | GCP project ID (Vertex AI) |
--credentials-path |
— | Service account JSON path (Vertex AI) |
--location |
us-central1 |
GCP region (Vertex AI) |
--gcloud-cli-auth |
false |
Use gcloud CLI auth (Vertex AI) |
--ca-bundle |
— | CA bundle PEM path for corporate proxies |
--limit |
— | Max questions to process |
--db-filter |
— | Only process questions for one database |
--stratified |
false |
Use stratified sampling across databases |
--stratified-size |
100 |
Number of questions in stratified sample |
--stratified-seed |
42 |
Random seed for stratified sampling |
--no-evidence |
false |
Omit external knowledge from prompts |
--max-retries |
2 |
SQL generation retries per question |
--timeout |
120 |
Per-question timeout (seconds) |
--resume |
false |
Resume from checkpoint in latest matching output dir |
--evaluate-only |
false |
Skip generation; evaluate existing predictions |
--predictions |
— | Path to predictions JSON (for --evaluate-only) |
--eval-timeout |
30 |
SQL execution timeout during evaluation (seconds) |
--eval-workers |
4 |
Parallel evaluation workers |
--data-dir |
./benchmarks/bird/data |
BIRD data directory |
--output-dir |
./benchmarks/bird/output |
Output parent directory |
--local-data |
false |
Skip HuggingFace download; use local data only |
-v / --verbose |
false |
Debug logging |
Results and Metrics¶
Results are saved to ./benchmarks/bird/output/<model>_<timestamp>/:
predictions.json: Generated SQL in BIRD's official format (submittable to the leaderboard)detailed_results.json: Per-question details including latency, generated SQL, and errorsevaluation_report.json: Final EX and Soft F1 scoresbenchmark_results.xlsx: Spreadsheet comparing baseline scores vs Ask RITA
Evaluation Metrics¶
- EX (Execution Accuracy) — Primary Metric: A prediction scores 1 if executing the predicted SQL returns the exact same result set as the gold SQL (order-independent).
- Soft F1 Score: A more lenient metric that gives partial credit by comparing result sets row-by-row with cell-level matching, tolerating column reordering.
BIRD Mini-Interact¶
What is BIRD-Interact?¶
BIRD Mini-Interact is a conversational text-to-SQL benchmark. Instead of single-turn question-SQL pairs, each task starts with an ambiguous query that requires clarification. The system must:
- Identify the ambiguity in the user's question
- Ask clarifying questions
- Receive responses from a user simulator (LLM-powered)
- Generate the correct SQL after sufficient clarification
The benchmark evaluates both the system's ability to ask good clarifying questions and to produce correct SQL from multi-turn conversations.
- 300 tasks across SQLite databases
- Multiple ambiguity types per task (user ambiguity + knowledge ambiguity)
- User simulator powered by an LLM (default: GPT-4o) that provides natural responses based on ground-truth SQL
Interact Setup¶
1. Install Dependencies¶
2. Data Download¶
The dataset is cloned automatically from HuggingFace on first run:
This requires git and git-lfs to be installed. The data is placed in benchmarks/bird_interact/data/mini-interact/.
3. Ground Truth (Optional)¶
The official ground truth SQL is not publicly available. To obtain it:
- Email
bird.bench25@gmail.comwith subject[mini-interact GT&Test Cases] - Once received, merge it into the tasks file:
poetry run python -m benchmarks.bird_interact \
--provider openai --model gpt-4o \
--gt-path /path/to/ground_truth.jsonl
Without ground truth, the benchmark runs generation but skips evaluation.
Running the Interact Benchmark¶
Basic Usage¶
With a Different User Simulator Model¶
The system LLM and user simulator LLM can be different models:
poetry run python -m benchmarks.bird_interact \
--provider openai --model gpt-4o \
--user-sim-model gpt-4o-mini
Quick Test¶
poetry run python -m benchmarks.bird_interact \
--provider openai --model gpt-4o \
--limit 10 --db-filter financial
Resuming¶
Evaluate-Only¶
Re-evaluate existing results:
poetry run python -m benchmarks.bird_interact \
--evaluate-only \
--results ./benchmarks/bird_interact/output/gpt-4o_interact_20250101/detailed_results.jsonl
Interact CLI Reference¶
| Flag | Default | Description |
|---|---|---|
--provider |
openai |
LLM provider for the Ask RITA system |
--model |
gpt-4o |
System model name |
--user-sim-model |
gpt-4o |
Model for the user simulator |
--api-key |
From env | API key override |
--api-base |
— | OpenAI base URL or Azure endpoint |
--api-version |
— | Azure API version |
--deployment-name |
— | Azure deployment name |
--ca-bundle |
— | CA bundle PEM path |
--limit |
— | Max tasks to process |
--db-filter |
— | Only tasks for one database |
--patience |
3 |
Extra turns beyond task ambiguity count |
--max-retries |
2 |
SQL generation retries |
--timeout |
300 |
Per-task timeout (seconds) |
--resume |
false |
Resume from checkpoint |
--evaluate-only |
false |
Evaluate existing results only |
--results |
— | JSONL path (for --evaluate-only) |
--gt-path |
— | Ground truth JSONL to merge into tasks |
--eval-timeout |
30 |
SQL execution timeout during evaluation |
--data-dir |
./benchmarks/bird_interact/data |
Data directory |
--output-dir |
./benchmarks/bird_interact/output |
Output parent directory |
-v / --verbose |
false |
Debug logging |
User Simulator¶
The user simulator is an LLM-based agent that plays the role of the user in the conversation. It:
- Encodes the system's clarifying question — classifies each question as addressing a
labeledambiguity, anunlabeledambiguity, or asunanswerable - Decodes a natural-language response — uses the ground-truth SQL segments, schema, and ambiguity definitions to generate a helpful but natural user reply
The simulator uses the OpenAI client directly (not through Ask RITA) and can be configured with --user-sim-model and --ca-bundle.
When no ground-truth SQL is available for a task, the simulator returns a generic "use your best judgment" response.
Interact Results and Metrics¶
Results are saved to ./benchmarks/bird_interact/output/<model>_interact_<timestamp>/:
predictions.json: Generated SQL for each taskdetailed_results.jsonl: Per-task details including conversation historyevaluation_report.json: Reward scores and pass ratesrun_summary.json: Run configuration and timingbenchmark_results.xlsx: Spreadsheet with Summary, Per-Task Results, and Conversations sheets
Evaluation Metrics¶
- Reward Score (Primary):
1.0if all test cases pass (or execution accuracy matches gold SQL);0.0otherwise. - Test Pass Rate: Percentage of test cases passed per task (when test cases are available).
- Per-Database Reward: Average reward broken down by database.
Test cases can include:
- Result match: Execute a test SQL and compare result sets with the prediction
- Value check: Verify that expected values appear in the prediction's result set
Automated Multi-Model Benchmarking¶
The run_all_benchmarks.sh script in the benchmarks/bird/ directory automates the process of running the BIRD Mini-Dev benchmark across all supported models (GPT-5.4, GPT-5.4 Mini, GPT-5.4 Nano, Gemini 2.5 Pro, Gemini 2.5 Flash, Gemini 2.5 Flash-Lite) and updates the results chart in README.md.
(Note: Running the full suite across 6 models takes approximately 15-18 hours. Ensure you are on a network that does not block the respective LLM APIs.)
See also:
- Configuration Guide — Complete YAML configuration reference
- Supported Platforms — LLM provider setup
- Conversational SQL — Multi-turn chat mode in Ask RITA