Database Configuration¶

Configure Ask RITA to connect to PostgreSQL, MySQL, SQLite, BigQuery, Snowflake, or MongoDB.

⚡ New in v0.2.0: All database configurations now support intelligent schema caching with cache_schema: true and schema_refresh_interval for up to 3x faster query performance.

🧠 New in v0.2.1: BigQuery configurations now support hybrid schema descriptions with schema_descriptions for automatic metadata extraction combined with manual enhancements, dramatically improving SQL generation quality.

PostgreSQL Configuration¶

Requirements: - ✅ Mandatory: Connection string with all credentials - ✅ Recommended: Use environment variables for passwords

database:
  # Connection string with environment variables (recommended)
  connection_string: "postgresql://${DB_USER}:${DB_PASSWORD}@${DB_HOST}:5432/${DB_NAME}"

  # Optional: Performance settings
  query_timeout: 30                          # Query timeout in seconds
  max_results: 1000                          # Maximum rows to return

  # Schema Caching (New in v0.2.0) - Improves performance by 3x
  cache_schema: true                         # Enable schema caching (default: false)
  schema_refresh_interval: 3600              # Cache expiry in seconds (1 hour recommended for production)

Connection string variants:

# Basic
postgresql://${DB_USER}:${DB_PASSWORD}@host:5432/database

# With SSL and timeout
postgresql://${DB_USER}:${DB_PASSWORD}@${DB_HOST}:5432/db?sslmode=require&connect_timeout=10

Environment Variables:

export DB_PASSWORD="your-secure-password"
export DB_HOST="your-database-host"
export DB_USER="your-username"
export DB_NAME="your-database-name"

# Use in config:
# connection_string: "postgresql://${DB_USER}:${DB_PASSWORD}@${DB_HOST}:5432/${DB_NAME}"

MySQL Configuration¶

database:
  connection_string: "mysql+pymysql://${DB_USER}:${DB_PASSWORD}@${DB_HOST}:3306/${DB_NAME}"

  # Optional: Performance settings
  query_timeout: 30
  max_results: 1000

  # Schema Caching (New in v0.2.0) - Improves performance by 3x
  cache_schema: true                         # Enable schema caching (default: false)
  schema_refresh_interval: 3600              # Cache expiry in seconds (1 hour recommended for production)

Connection string variants:

# Basic
mysql+pymysql://${DB_USER}:${DB_PASSWORD}@host:3306/database

# With SSL and charset
mysql+pymysql://${DB_USER}:${DB_PASSWORD}@${DB_HOST}:3306/db?charset=utf8mb4&ssl_disabled=false

# MariaDB (same syntax)
mysql+pymysql://${DB_USER}:${DB_PASSWORD}@${DB_HOST}:3306/database

SQLite Configuration¶

database:
  # File-based SQLite
  connection_string: "sqlite:///./path/to/database.db"

  # Absolute path
  connection_string: "sqlite:////absolute/path/to/database.db"

  # In-memory database (for testing)
  connection_string: "sqlite:///:memory:"

  # Optional: Performance settings
  query_timeout: 30                          # Less relevant for SQLite
  max_results: 1000

  # Schema Caching (New in v0.2.0) - Still beneficial for SQLite
  cache_schema: true                         # Enable schema caching (default: false)
  schema_refresh_interval: 1800              # 30 minutes recommended for development

BigQuery Configuration¶

Requirements: - ✅ Mandatory: Project and dataset in connection string - ✅ Choose one: Service account credentials OR gcloud CLI auth

Option 1: Service Account Authentication

database:
  connection_string: "bigquery://your-project-id/your-dataset-id"

  # REQUIRED: Service account credentials
  bigquery_credentials_path: "/path/to/service-account.json"
  bigquery_project_id: "your-project-id"    # Optional: Override project ID

  # Optional: Performance settings
  query_timeout: 60                          # BigQuery queries can be slow
  max_results: 10000                         # BigQuery can handle larger results

  # Schema Caching (v0.2.0) - Highly recommended for BigQuery
  cache_schema: true                         # Enable schema caching (default: false)
  schema_refresh_interval: 7200              # 2 hours recommended for enterprise data warehouses

  # Cross-Project Dataset Access (New in v0.2.1)
  cross_project_access:
    enabled: true                            # Enable cross-project functionality
    datasets:
      - "project-a.dataset-name"             # Primary cross-project dataset
      - "project-b.analytics"                # Additional datasets
    include_tables: []                       # Empty = include all tables
    exclude_tables: ["temp_*", "staging_*"] # Patterns to exclude
    cache_metadata: true                     # Cache cross-project metadata
    metadata_refresh_interval: 7200         # Cache expiry in seconds (2 hours)

  # Schema Descriptions Configuration (New in v0.2.1)
  # Hybrid system: automatic extraction + manual enhancements
  schema_descriptions:
    project_context: "Enterprise data warehouse for analytics and reporting"
    automatic_extraction:
      enabled: true                          # Extract descriptions from BigQuery INFORMATION_SCHEMA
      fallback_to_column_name: true         # Generate descriptions from column names if no metadata
      include_data_types: true              # Include data types in descriptions
      extract_comments: true                # Extract existing column/table comments
    columns:
      customer_id:
        description: "Unique customer identifier"
        mode: "supplement"                   # supplement | override | fallback | auto_only
        business_context: "Primary key for customer analytics"
    tables:
      customers:
        description: "Master customer data with demographics"
        business_purpose: "Customer segmentation and personalization"
    business_terms:
      churn: "Customer who hasn't purchased in 90+ days"
      ltv: "Customer Lifetime Value calculation"

Option 2: gcloud CLI Authentication (Recommended for Development)

database:
  connection_string: "bigquery://your-project-id/your-dataset-id"

  # REQUIRED: Use gcloud CLI authentication
  bigquery_gcloud_cli_auth: true

  # Optional: Performance settings
  query_timeout: 60
  max_results: 10000

  # Schema Caching (v0.2.0) - Highly recommended for BigQuery
  cache_schema: true                         # Enable schema caching (default: false)
  schema_refresh_interval: 1800              # 30 minutes recommended for development

  # Cross-Project Dataset Access (New in v0.2.1)
  cross_project_access:
    enabled: false                           # Usually disabled for development
    datasets: []                             # Cross-project datasets to access
    include_tables: []                       # Table patterns to include
    exclude_tables: ["temp_*"]               # Table patterns to exclude
    cache_metadata: true                     # Cache metadata for performance
    metadata_refresh_interval: 3600         # Cache expiry: 1 hour for dev

  # Schema Descriptions Configuration (New in v0.2.1)
  schema_descriptions:
    project_context: "Development environment for testing and analytics"
    automatic_extraction:
      enabled: true                          # Enable automatic extraction
      fallback_to_column_name: true         # Generate from column names
      include_data_types: true              # Include data types
      extract_comments: true                # Extract existing comments
    columns:
      customer_id:
        description: "Unique customer identifier"
        mode: "supplement"
        business_context: "Primary key for analytics"
    business_terms:
      churn: "Customer attrition rate"
      retention: "Customer retention metrics"

Environment Variables:

# Option 1: Service Account
export GOOGLE_APPLICATION_CREDENTIALS="/path/to/service-account.json"

# Option 2: gcloud CLI (run this first)
gcloud auth login
gcloud config set project your-project-id

Snowflake Configuration¶

Requirements: - ✅ Mandatory: Account, warehouse, schema, and role in connection string - ✅ Mandatory: Username and password

database:
  # Basic Snowflake connection
  connection_string: "snowflake://${SNOWFLAKE_USER}:${SNOWFLAKE_PASSWORD}@account/database?warehouse=warehouse&schema=schema&role=role"

  # With additional parameters
  connection_string: "snowflake://${SNOWFLAKE_USER}:${SNOWFLAKE_PASSWORD}@${SNOWFLAKE_ACCOUNT}/db?warehouse=WH&schema=PUBLIC&role=ROLE&authenticator=snowflake"

  # With environment variables
  connection_string: "snowflake://${SF_USER}:${SF_PASSWORD}@${SF_ACCOUNT}/${SF_DATABASE}?warehouse=${SF_WAREHOUSE}&schema=${SF_SCHEMA}&role=${SF_ROLE}"

  # Optional: Performance settings
  query_timeout: 120                         # Snowflake queries can be slow
  max_results: 10000                         # Snowflake can handle large results

  # Schema Caching (New in v0.2.0) - Highly recommended for Snowflake
  cache_schema: true                         # Enable schema caching (default: false)
  schema_refresh_interval: 3600              # 1 hour recommended for production data warehouses

Environment Variables:

export SF_USER="your-snowflake-username"
export SF_PASSWORD="your-snowflake-password"
export SF_ACCOUNT="your-account-identifier"
export SF_DATABASE="your-database"
export SF_WAREHOUSE="your-warehouse"
export SF_SCHEMA="your-schema"
export SF_ROLE="your-role"

MongoDB Configuration (New in v0.12.0)¶

Requirements: - ✅ Mandatory: Connection string with database name - ✅ Mandatory: Use NoSQLAgentWorkflow instead of SQLAgentWorkflow

📖 Full Guide: See NoSQL Workflow Guide for comprehensive MongoDB documentation.

database:
  # Local MongoDB
  connection_string: "mongodb://${MONGO_USER}:${MONGO_PASSWORD}@host:27017/database"

  # MongoDB Atlas (cloud)
  connection_string: "mongodb+srv://${MONGO_USER}:${MONGO_PASSWORD}@cluster.mongodb.net/database"

  # With environment variables (recommended)
  connection_string: "mongodb://${MONGO_USER}:${MONGO_PASSWORD}@${MONGO_HOST}:27017/${MONGO_DB}"

  # Performance settings
  query_timeout: 30
  max_results: 1000

  # Schema Caching - Recommended for production
  cache_schema: true
  schema_refresh_interval: 3600              # 1 hour

Environment Variables:

export MONGO_USER="your-username"
export MONGO_PASSWORD="your-password"
export MONGO_HOST="your-host"
export MONGO_DB="your-database"

Usage:

from askrita import NoSQLAgentWorkflow, ConfigManager

config = ConfigManager("example-configs/query-mongodb.yaml")
workflow = NoSQLAgentWorkflow(config)
result = workflow.query("How many orders per month?")
print(result.answer)

Key Differences from SQL Workflow: - Uses NoSQLAgentWorkflow instead of SQLAgentWorkflow - LLM generates MongoDB aggregation pipelines (db.collection.aggregate([...])) instead of SQL - Schema is inferred from document sampling via langchain-mongodb - Safety validation blocks MongoDB write operations ($out, $merge, deleteMany, etc.) - Workflow step names remain the same for configuration compatibility (generate_sql maps to MongoDB query generation)