Long-Text Scorers#
Long-form uncertainty quantification implements a three-stage pipeline after response generation:
Response Decomposition: The response \(y\) is decomposed into units (claims or sentences), where a unit as denoted as \(s\).
Unit-Level Confidence Scoring: Confidence scores are computed using a unit-level scoring function with values in \([0, 1]\). Higher scores indicate greater likelihood of factual correctness. Units with scores below threshold \(\tau\) are flagged as potential hallucinations.
Response-Level Aggregation: Unit scores are combined to provide an overall response confidence.
Key Characteristics:
Universal Compatibility: Works with any LLM without requiring token probability access
Fine-Grained Scoring: Score at sentence or claim-level to localize likely hallucinations
Uncertainty-aware decoding: Improve factual precision by dropping high-uncertainty claims
Trade-offs:
Higher Cost: Requires multiple generations per prompt
Limited Compatibility: Multiple generations and comparison calculations increase latency
Long-Text Scoring Methods#
There are three main categories of long-text scoring methods offered by UQLM: