Black-Box Scorers#
Black-box Uncertainty Quantification (UQ) methods treat the LLM as a black box and evaluate consistency of multiple responses generated from the same prompt to estimate response-level confidence. These scorers are compatible with any LLM and don’t require access to internal model states or token probabilities.
Key Characteristics:
Universal Compatibility: Works with any LLM
Intuitive: Easy to understand and implement
No Internal Access Required: Doesn’t need token probabilities or model internals
Trade-offs:
Higher Cost: Requires multiple generations per prompt
Slower: Multiple generations and comparison calculations increase latency
Notation:
For a given prompt \(x_i\), these approaches involve generating \(m\) responses \(\tilde{\mathbf{y}}_i = \{ \tilde{y}_{i1},...,\tilde{y}_{im}\}\), using a non-zero temperature, from the same prompt and comparing these responses to the original response \(y_{i}\).