Schema Enrichment and Caching¶

Ask RITA can enrich your database schema with descriptions, business context, and cross-project metadata to help the LLM generate more accurate SQL. Schema caching reduces latency by avoiding repeated schema introspection.

Table of Contents¶

Overview
Schema Caching
Schema Descriptions
Schema Decorators
Cross-Project Access (BigQuery)
Custom SQL Syntax
Configuration Reference
Troubleshooting

Overview¶

The schema enrichment pipeline processes the raw database schema through a chain of decorators:

flowchart TD
    A["Raw Schema (from DatabaseManager)"] --> B[SchemaFormattingDecorator]
    B -->|"Normalize CREATE TABLE layout"| C[CrossProjectSchemaDecorator]
    C -->|"Add cross-project tables (BigQuery)"| D[HybridDescriptionDecorator]
    D -->|"Merge auto-extracted + manual descriptions"| E[SchemaMetadataDecorator]
    E -->|"Add header with DB type, cache info, warnings"| F["Enriched Schema → LLM"]
    style A fill:#2F5496,color:#fff
    style B fill:#4CAF50,color:#fff
    style C fill:#4CAF50,color:#fff
    style D fill:#4CAF50,color:#fff
    style E fill:#4CAF50,color:#fff
    style F fill:#0288D1,color:#fff

Each decorator is optional and controlled by configuration.

Schema Caching¶

Schema caching stores the enriched schema in memory to avoid expensive database introspection on every query.

Configuration¶

database:
  cache_schema: true              # Enable caching (default: true)
  schema_refresh_interval: 3600   # Cache TTL in seconds (default: 3600)

Behavior¶

First query: Schema is fetched from the database and cached
Subsequent queries: Cached schema is used if within the TTL
After TTL expires: Schema is re-fetched on the next query
Manual control: Use preload_schema(), clear_schema_cache(), and get_cache_status()

API¶

from askrita import SQLAgentWorkflow, ConfigManager

config = ConfigManager("config.yaml")
workflow = SQLAgentWorkflow(config)

# Preload schema into cache (runs at init by default)
workflow.preload_schema()

# Check cache status
status = workflow.get_cache_status()
print(status)
# {
#   "enabled": True,
#   "cached": True,
#   "age_seconds": 42.5,
#   "remaining_seconds": 3557.5,
#   "refresh_interval": 3600,
#   "valid": True
# }

# Force refresh
workflow.clear_schema_cache()

ConfigManager Cache Methods¶

For direct cache control:

config = ConfigManager("config.yaml")

config.should_cache_schema()       # True if cache is valid
config.get_schema_cache()          # Returns cached schema or None
config.set_schema_cache(schema)    # Set cache manually
config.clear_schema_cache()        # Clear cache
config.get_schema_cache_info()     # Detailed cache status dict

Schema Descriptions¶

Add human-readable descriptions to tables and columns to guide the LLM toward better SQL generation.

Configuration¶

database:
  schema_descriptions:
    project_context: |
      This is an e-commerce database with customer orders, products, and inventory.
      Fiscal year starts April 1.

    automatic_extraction:
      enabled: true                    # Extract descriptions from DB metadata
      fallback_to_column_name: true    # Use column name as fallback description
      include_data_types: true         # Include data type info
      extract_comments: true           # Extract column/table comments

    tables:
      orders:
        description: "Customer purchase orders"
        business_purpose: "Tracks all customer transactions"
        override_mode: "supplement"    # "supplement", "override", or "auto_only"
      products:
        description: "Product catalog"
        business_purpose: "Master product data with pricing"

    columns:
      customer_id:
        description: "Unique customer identifier (UUID format)"
        mode: "supplement"             # "supplement", "override", "fallback", or "auto_only"
        business_context: "Links to the CRM system customer record"
      created_at:
        description: "Record creation timestamp in UTC"
        mode: "fallback"
      revenue:
        description: "Net revenue after discounts and returns"
        mode: "override"
        business_context: "Use this for all revenue calculations, not gross_amount"

    business_terms:
      "active customer": "A customer with at least one order in the last 90 days"
      "churn": "A customer with no orders in the last 180 days"
      "AOV": "Average Order Value = total revenue / number of orders"
      "LTV": "Lifetime Value = total revenue from a customer since first purchase"

Description Modes¶

Modes control how manual descriptions interact with automatically extracted descriptions:

Mode	Behavior
`supplement`	Manual description is appended to the auto-extracted description
`override`	Manual description replaces the auto-extracted description entirely
`fallback`	Manual description is used only if no auto-extracted description exists
`auto_only`	Only auto-extracted descriptions are used; manual descriptions are ignored

Automatic Extraction¶

When automatic_extraction.enabled is true, Ask RITA extracts descriptions from database metadata:

BigQuery: Column descriptions from INFORMATION_SCHEMA.COLUMN_FIELD_PATHS (with fallback to direct query)
PostgreSQL: Column comments from pg_catalog
MySQL: Column comments from information_schema

The HybridDescriptionDecorator then merges auto-extracted and manual descriptions according to the configured mode for each column.

Project Context¶

The project_context string is prepended to the schema, giving the LLM high-level understanding of the database:

database:
  schema_descriptions:
    project_context: |
      Healthcare claims database for analytics.
      All dates are in Eastern Time.
      Member IDs are always 10-digit numbers.
      Use ICD-10 codes for diagnosis lookups.

Business Terms Glossary¶

The business_terms dictionary is appended to the schema as a glossary. This helps the LLM understand domain-specific terminology:

database:
  schema_descriptions:
    business_terms:
      "DAU": "Daily Active Users"
      "MAU": "Monthly Active Users"
      "retention rate": "Percentage of users who return within 30 days"

Schema Decorators¶

Schema decorators follow the decorator pattern to build enriched schemas. Each decorator wraps a SchemaProvider and enhances the schema string.

Available Decorators¶

Decorator	Purpose
`SchemaFormattingDecorator`	Normalizes `CREATE TABLE` layout and indentation
`CrossProjectSchemaDecorator`	Adds BigQuery cross-project table metadata
`HybridDescriptionDecorator`	Merges auto-extracted and manual column/table descriptions
`SchemaMetadataDecorator`	Prepends a metadata header (DB type, cache status, warnings)

Supporting Classes¶

Class	Purpose
`AutoDescriptionExtractor`	Static helpers for extracting descriptions from BigQuery, PostgreSQL, MySQL
`DescriptionMerger`	Merges auto + manual descriptions based on column `mode`
`BaseSchemaProvider`	Returns the raw schema string (base of the chain)
`SchemaDecoratorBuilder`	Fluent builder to construct the decorator chain

SchemaDecoratorBuilder¶

The builder provides a fluent API for constructing the decorator chain:

from askrita.sqlagent.database.schema_decorators import SchemaDecoratorBuilder

provider = (
    SchemaDecoratorBuilder(base_schema)
    .with_formatting()                    # Normalize layout
    .with_cross_project_enhancement()     # Add cross-project tables
    .with_hybrid_descriptions()           # Merge descriptions
    .with_metadata()                      # Add header
    .build()
)

enriched_schema = provider.get_schema(config)

Decorators are stacked in the order you add them, with the last one being the outermost wrapper.

Custom Decorators¶

To create a custom decorator, extend SchemaDecorator:

from askrita.sqlagent.database.schema_decorators import SchemaDecorator

class MyCustomDecorator(SchemaDecorator):
    def enhance_schema(self, schema: str, config) -> str:
        return f"-- Custom header\n{schema}"

Cross-Project Access (BigQuery)¶

For BigQuery, Ask RITA can query tables across multiple GCP projects and datasets.

Configuration¶

database:
  connection_string: "bigquery://my-project/my-dataset"
  bigquery_project_id: "my-project"

  cross_project_access:
    enabled: true
    datasets:
      - project_id: "other-project"
        dataset_id: "shared_data"
        include_tables: ["customers", "orders"]  # Optional filter
      - project_id: "analytics-project"
        dataset_id: "reports"
        exclude_tables: ["temp_*"]               # Optional exclusion pattern

    cache_metadata: true
    metadata_refresh_interval: 7200  # Seconds (default: 7200)

How It Works¶

The CrossProjectSchemaDecorator queries INFORMATION_SCHEMA.TABLES and INFORMATION_SCHEMA.COLUMNS for each configured dataset
Table and column metadata is appended to the schema
The SchemaMetadataDecorator adds a cross-project summary to the header
Table names in generated SQL are fully qualified: project.dataset.table

Filtering¶

include_tables: Only include listed tables (exact match)
exclude_tables: Exclude tables matching patterns (supports * wildcards)
If neither is specified, all tables in the dataset are included

Metadata Caching¶

Cross-project metadata is cached separately from the main schema cache:

database:
  cross_project_access:
    cache_metadata: true
    metadata_refresh_interval: 7200  # 2 hours

Validation¶

BigQuery cross-project access uses a chain-of-responsibility pattern (ValidationContext / ValidationStep) to validate connections and permissions before querying metadata.

Custom SQL Syntax¶

For databases with non-standard SQL syntax, you can provide hints to the LLM:

Configuration¶

database:
  sql_syntax:
    dialect_hints:
      - "Use LIMIT instead of TOP for row limiting"
      - "Date functions use DATE_TRUNC('month', column) syntax"
      - "String concatenation uses || operator, not CONCAT()"
    date_format: "YYYY-MM-DD"
    identifier_quoting: "double_quotes"  # or "backticks" or "brackets"
    catalog_separator: "."

The SQLSyntaxConfig passes these hints into the SQL generation prompt, helping the LLM produce valid SQL for your specific database dialect.

Configuration Reference¶

Complete Schema Enrichment Configuration¶

database:
  connection_string: "bigquery://my-project/my-dataset"
  bigquery_project_id: "my-project"

  # Caching
  cache_schema: true
  schema_refresh_interval: 3600

  # Cross-project access (BigQuery only)
  cross_project_access:
    enabled: true
    datasets:
      - project_id: "other-project"
        dataset_id: "shared_data"
    cache_metadata: true
    metadata_refresh_interval: 7200

  # Descriptions
  schema_descriptions:
    project_context: "E-commerce analytics database"
    automatic_extraction:
      enabled: true
      fallback_to_column_name: true
      include_data_types: true
      extract_comments: true
    tables:
      orders:
        description: "Purchase orders"
        business_purpose: "Transaction tracking"
        override_mode: "supplement"
    columns:
      customer_id:
        description: "Unique customer identifier"
        mode: "supplement"
        business_context: "Maps to CRM customer_id"
    business_terms:
      "AOV": "Average Order Value"
      "active customer": "Ordered in the last 90 days"

  # Custom SQL syntax hints
  sql_syntax:
    dialect_hints:
      - "Use LIMIT for row limiting"
    date_format: "YYYY-MM-DD"

Troubleshooting¶

Schema Not Being Cached¶

Symptom: Every query triggers a schema fetch.

Check database.cache_schema: true in your config
Check schema_refresh_interval — if set to 0, cache is always expired
Verify with workflow.get_cache_status()

Descriptions Not Appearing in Schema¶

Symptom: Manual descriptions are not added to the schema.

Ensure schema_descriptions is nested under database in the YAML
Check that automatic_extraction.enabled is true for auto descriptions
Verify column names in columns match the actual database column names
Check the mode — auto_only ignores manual descriptions

Cross-Project Tables Not Found¶

Symptom: Cross-project tables are not in the schema.

Ensure cross_project_access.enabled: true
Check that the service account has bigquery.tables.list and bigquery.tables.get permissions on the target project
Verify project_id and dataset_id are correct
Check include_tables / exclude_tables filters

Schema Too Large for LLM Context¶

Symptom: Token limit errors during SQL generation.

Use cross_project_access filters (include_tables, exclude_tables) to reduce schema size
Focus schema_descriptions on high-value tables only
The framework automatically manages token budgets, but very large schemas may still exceed limits

Stale Schema After DDL Changes¶

Symptom: New tables or columns are not recognized.

workflow.clear_schema_cache()
result = workflow.query("Show me the new_table data")

Or wait for the schema_refresh_interval to expire.

See also:

Configuration Guide — Complete YAML configuration reference
Supported Platforms — Database connection strings
Security — SQL safety and input validation
Example config: example-configs/schema-descriptions-simple.yaml