Long-Text Scorers#

Long-form uncertainty quantification implements a three-stage pipeline after response generation:

  1. Response Decomposition: The response \(y\) is decomposed into units (claims or sentences), where a unit as denoted as \(s\).

  2. Unit-Level Confidence Scoring: Confidence scores are computed using a unit-level scoring function with values in \([0, 1]\). Higher scores indicate greater likelihood of factual correctness. Units with scores below threshold \(\tau\) are flagged as potential hallucinations.

  3. Response-Level Aggregation: Unit scores are combined to provide an overall response confidence.

Key Characteristics:

  • Universal Compatibility: Works with any LLM without requiring token probability access

  • Fine-Grained Scoring: Score at sentence or claim-level to localize likely hallucinations

  • Uncertainty-aware decoding: Improve factual precision by dropping high-uncertainty claims

Trade-offs:

  • Higher Cost: Requires multiple generations per prompt

  • Limited Compatibility: Multiple generations and comparison calculations increase latency

Long-Text Scoring Methods#

There are three main categories of long-text scoring methods offered by UQLM: