BIRD Benchmark Evaluation¶

Ask RITA includes two built-in evaluation suites based on the BIRD benchmark — the industry-standard benchmark for large-scale database grounded text-to-SQL evaluation.

Benchmark	Module	Questions	Focus
BIRD Mini-Dev	`benchmarks.bird`	500	Single-turn text-to-SQL accuracy
BIRD Mini-Interact	`benchmarks.bird_interact`	300	Multi-turn conversational SQL with clarification

BIRD Mini-Dev¶

What is BIRD?¶

BIRD (BIg Bench for LaRge-scale Database Grounded Text-to-SQL) is a cross-domain benchmark with:

12,751+ question-SQL pairs across 95 databases and 37 professional domains
Real-world "dirty" data values requiring reasoning
External knowledge evidence for domain-specific questions
Difficulty levels: Simple (30%), Moderate (50%), Challenging (20%)

This module uses BIRD Mini-Dev (500 instances, 11 SQLite databases) — the recommended development evaluation set.

Setup¶

1. Install Dependencies¶

poetry install --with dev

2. Download BIRD Databases¶

The SQLite databases must be downloaded separately (they are too large to be included in the repository or HuggingFace):

Download the minidev.zip file from the BIRD Google Drive.
Extract the contents into the benchmarks/bird/data/ directory.
Ensure the dev_databases/ folder is located at: benchmarks/bird/data/dev_databases/

unzip minidev.zip -d ./benchmarks/bird/data/

Running the Benchmark¶

The benchmark runner supports multiple LLM providers (OpenAI, Vertex AI, Azure OpenAI, Bedrock) and automatically manages the Ask RITA configuration, prompt generation, SQL execution, and metric calculation.

Basic Usage¶

Run the full benchmark (500 questions) with GPT-4o:

poetry run python -m benchmarks.bird.run_benchmark --provider openai --model gpt-4o

Quick Tests and Cost Control¶

Run a quick test on a single database (~30 questions):

poetry run python -m benchmarks.bird.run_benchmark --provider openai --model gpt-4o --db-filter financial

Limit the total number of questions processed:

poetry run python -m benchmarks.bird.run_benchmark --provider openai --model gpt-4o --limit 10

Run a stratified sample (proportionally across all 11 databases):

poetry run python -m benchmarks.bird.run_benchmark --provider openai --model gpt-4o --stratified --stratified-size 100

Omit external knowledge/evidence from prompts:

poetry run python -m benchmarks.bird.run_benchmark --provider openai --model gpt-4o --no-evidence

Resuming a Failed Run¶

If a run is interrupted, resume from the checkpoint:

poetry run python -m benchmarks.bird.run_benchmark --provider openai --model gpt-4o --resume

This loads predictions_checkpoint.json from the latest matching output directory and skips already-completed questions.

Evaluate-Only Mode¶

Re-evaluate existing predictions without re-running SQL generation:

poetry run python -m benchmarks.bird.run_benchmark \
  --evaluate-only \
  --predictions ./benchmarks/bird/output/gpt-4o_20250101_120000/predictions.json

Provider-Specific Configurations¶

Vertex AI (Gemini):

poetry run python -m benchmarks.bird.run_benchmark \
  --provider vertex_ai \
  --model gemini-2.5-pro \
  --project-id your-gcp-project-id \
  --credentials-path /path/to/service-account.json

Use gcloud CLI authentication instead of a service account:

poetry run python -m benchmarks.bird.run_benchmark \
  --provider vertex_ai \
  --model gemini-2.5-pro \
  --project-id your-gcp-project-id \
  --gcloud-cli-auth \
  --location us-central1

Azure OpenAI:

poetry run python -m benchmarks.bird.run_benchmark \
  --provider azure_openai \
  --model gpt-4o \
  --api-base https://your-endpoint.openai.azure.com/ \
  --deployment-name gpt-4o \
  --api-version 2024-08-01-preview

Corporate Proxies (Zscaler):

poetry run python -m benchmarks.bird.run_benchmark \
  --provider openai \
  --model gpt-4o \
  --ca-bundle credentials/zscaler-ca-bundle.pem

CLI Reference¶

Flag	Default	Description
`--provider`	`openai`	LLM provider: `openai`, `azure_openai`, `vertex_ai`, `bedrock`
`--model`	`gpt-4o`	Model name
`--api-key`	From env	API key (overrides environment variable)
`--api-base`	—	OpenAI base URL or Azure endpoint
`--api-version`	—	Azure API version
`--deployment-name`	—	Azure deployment name
`--project-id`	—	GCP project ID (Vertex AI)
`--credentials-path`	—	Service account JSON path (Vertex AI)
`--location`	`us-central1`	GCP region (Vertex AI)
`--gcloud-cli-auth`	`false`	Use gcloud CLI auth (Vertex AI)
`--ca-bundle`	—	CA bundle PEM path for corporate proxies
`--limit`	—	Max questions to process
`--db-filter`	—	Only process questions for one database
`--stratified`	`false`	Use stratified sampling across databases
`--stratified-size`	`100`	Number of questions in stratified sample
`--stratified-seed`	`42`	Random seed for stratified sampling
`--no-evidence`	`false`	Omit external knowledge from prompts
`--max-retries`	`2`	SQL generation retries per question
`--timeout`	`120`	Per-question timeout (seconds)
`--resume`	`false`	Resume from checkpoint in latest matching output dir
`--evaluate-only`	`false`	Skip generation; evaluate existing predictions
`--predictions`	—	Path to predictions JSON (for `--evaluate-only`)
`--eval-timeout`	`30`	SQL execution timeout during evaluation (seconds)
`--eval-workers`	`4`	Parallel evaluation workers
`--data-dir`	`./benchmarks/bird/data`	BIRD data directory
`--output-dir`	`./benchmarks/bird/output`	Output parent directory
`--local-data`	`false`	Skip HuggingFace download; use local data only
`-v` / `--verbose`	`false`	Debug logging

Results and Metrics¶

Results are saved to ./benchmarks/bird/output/<model>_<timestamp>/:

predictions.json: Generated SQL in BIRD's official format (submittable to the leaderboard)
detailed_results.json: Per-question details including latency, generated SQL, and errors
evaluation_report.json: Final EX and Soft F1 scores
benchmark_results.xlsx: Spreadsheet comparing baseline scores vs Ask RITA

Evaluation Metrics¶

EX (Execution Accuracy) — Primary Metric: A prediction scores 1 if executing the predicted SQL returns the exact same result set as the gold SQL (order-independent).
Soft F1 Score: A more lenient metric that gives partial credit by comparing result sets row-by-row with cell-level matching, tolerating column reordering.

BIRD Mini-Interact¶

What is BIRD-Interact?¶

BIRD Mini-Interact is a conversational text-to-SQL benchmark. Instead of single-turn question-SQL pairs, each task starts with an ambiguous query that requires clarification. The system must:

Identify the ambiguity in the user's question
Ask clarifying questions
Receive responses from a user simulator (LLM-powered)
Generate the correct SQL after sufficient clarification

The benchmark evaluates both the system's ability to ask good clarifying questions and to produce correct SQL from multi-turn conversations.

300 tasks across SQLite databases
Multiple ambiguity types per task (user ambiguity + knowledge ambiguity)
User simulator powered by an LLM (default: GPT-4o) that provides natural responses based on ground-truth SQL

Interact Setup¶

1. Install Dependencies¶

poetry install --with dev

2. Data Download¶

The dataset is cloned automatically from HuggingFace on first run:

https://huggingface.co/datasets/birdsql/mini-interact

This requires git and git-lfs to be installed. The data is placed in benchmarks/bird_interact/data/mini-interact/.

3. Ground Truth (Optional)¶

The official ground truth SQL is not publicly available. To obtain it:

Email bird.bench25@gmail.com with subject [mini-interact GT&Test Cases]
Once received, merge it into the tasks file:

poetry run python -m benchmarks.bird_interact \
  --provider openai --model gpt-4o \
  --gt-path /path/to/ground_truth.jsonl

Without ground truth, the benchmark runs generation but skips evaluation.

Running the Interact Benchmark¶

Basic Usage¶

poetry run python -m benchmarks.bird_interact --provider openai --model gpt-4o

With a Different User Simulator Model¶

The system LLM and user simulator LLM can be different models:

poetry run python -m benchmarks.bird_interact \
  --provider openai --model gpt-4o \
  --user-sim-model gpt-4o-mini

Quick Test¶

poetry run python -m benchmarks.bird_interact \
  --provider openai --model gpt-4o \
  --limit 10 --db-filter financial

Resuming¶

poetry run python -m benchmarks.bird_interact \
  --provider openai --model gpt-4o --resume

Evaluate-Only¶

Re-evaluate existing results:

poetry run python -m benchmarks.bird_interact \
  --evaluate-only \
  --results ./benchmarks/bird_interact/output/gpt-4o_interact_20250101/detailed_results.jsonl

Interact CLI Reference¶

Flag	Default	Description
`--provider`	`openai`	LLM provider for the Ask RITA system
`--model`	`gpt-4o`	System model name
`--user-sim-model`	`gpt-4o`	Model for the user simulator
`--api-key`	From env	API key override
`--api-base`	—	OpenAI base URL or Azure endpoint
`--api-version`	—	Azure API version
`--deployment-name`	—	Azure deployment name
`--ca-bundle`	—	CA bundle PEM path
`--limit`	—	Max tasks to process
`--db-filter`	—	Only tasks for one database
`--patience`	`3`	Extra turns beyond task ambiguity count
`--max-retries`	`2`	SQL generation retries
`--timeout`	`300`	Per-task timeout (seconds)
`--resume`	`false`	Resume from checkpoint
`--evaluate-only`	`false`	Evaluate existing results only
`--results`	—	JSONL path (for `--evaluate-only`)
`--gt-path`	—	Ground truth JSONL to merge into tasks
`--eval-timeout`	`30`	SQL execution timeout during evaluation
`--data-dir`	`./benchmarks/bird_interact/data`	Data directory
`--output-dir`	`./benchmarks/bird_interact/output`	Output parent directory
`-v` / `--verbose`	`false`	Debug logging

User Simulator¶

The user simulator is an LLM-based agent that plays the role of the user in the conversation. It:

Encodes the system's clarifying question — classifies each question as addressing a labeled ambiguity, an unlabeled ambiguity, or as unanswerable
Decodes a natural-language response — uses the ground-truth SQL segments, schema, and ambiguity definitions to generate a helpful but natural user reply

The simulator uses the OpenAI client directly (not through Ask RITA) and can be configured with --user-sim-model and --ca-bundle.

When no ground-truth SQL is available for a task, the simulator returns a generic "use your best judgment" response.

Interact Results and Metrics¶

Results are saved to ./benchmarks/bird_interact/output/<model>_interact_<timestamp>/:

predictions.json: Generated SQL for each task
detailed_results.jsonl: Per-task details including conversation history
evaluation_report.json: Reward scores and pass rates
run_summary.json: Run configuration and timing
benchmark_results.xlsx: Spreadsheet with Summary, Per-Task Results, and Conversations sheets

Evaluation Metrics¶

Reward Score (Primary): 1.0 if all test cases pass (or execution accuracy matches gold SQL); 0.0 otherwise.
Test Pass Rate: Percentage of test cases passed per task (when test cases are available).
Per-Database Reward: Average reward broken down by database.

Test cases can include:

Result match: Execute a test SQL and compare result sets with the prediction
Value check: Verify that expected values appear in the prediction's result set

Automated Multi-Model Benchmarking¶

The run_all_benchmarks.sh script in the benchmarks/bird/ directory automates the process of running the BIRD Mini-Dev benchmark across all supported models (GPT-5.4, GPT-5.4 Mini, GPT-5.4 Nano, Gemini 2.5 Pro, Gemini 2.5 Flash, Gemini 2.5 Flash-Lite) and updates the results chart in README.md.

cd benchmarks/bird
./run_all_benchmarks.sh

(Note: Running the full suite across 6 models takes approximately 15-18 hours. Ensure you are on a network that does not block the respective LLM APIs.)

See also:

Configuration Guide — Complete YAML configuration reference
Supported Platforms — LLM provider setup
Conversational SQL — Multi-turn chat mode in Ask RITA