Black-Box Scorers#

Black-box Uncertainty Quantification (UQ) methods treat the LLM as a black box and evaluate consistency of multiple responses generated from the same prompt to estimate response-level confidence. These scorers are compatible with any LLM and don’t require access to internal model states or token probabilities.

Key Characteristics:

  • Universal Compatibility: Works with any LLM

  • Intuitive: Easy to understand and implement

  • No Internal Access Required: Doesn’t need token probabilities or model internals

Trade-offs:

  • Higher Cost: Requires multiple generations per prompt

  • Slower: Multiple generations and comparison calculations increase latency

Notation:

For a given prompt \(x_i\), these approaches involve generating \(m\) responses \(\tilde{\mathbf{y}}_i = \{ \tilde{y}_{i1},...,\tilde{y}_{im}\}\), using a non-zero temperature, from the same prompt and comparing these responses to the original response \(y_{i}\).