QA-Based Uncertainty Quantification (LUQ)#

Definition#

The Claim-QA approach demonstrated here is adapted from Farquhar et al. (2024). It uses an LLM to convert each unit (sentence or claim) into a question for which that unit would be the answer. The method measures consistency across multiple responses to these questions, effectively applying standard black-box uncertainty quantification to those sampled responses to the unit questions. Formally, a claim-QA scorer \(c_g(s;\cdot)\) is defined as follows:

\[c_g(s; y_0^{(s)}, \mathbf{y}^{(s)}_{\text{cand}}) = \frac{1}{m} \sum_{j=1}^m \eta(y_0^{(s)}, y_j^{(s)}), s\]

where \(y_0^{(s)}\) is the original unit response, \(\mathbf{y}^{(s)}_{\text{cand}} = {y_1^{(s)}, ..., y_m^{(s)}}\) are \(m\) candidate responses to the unit’s question, and \(\eta\) is a consistency function such as contradiction probability, cosine similarity, or BERTScore F1. Semantic entropy, which follows a slightly different functional form, can also be used to measure consistency.

Key Properties:

  • Claim or sententence-level scoring

  • Less complex (cost and latency) than other long-form scoring methods

  • Score range: \([0, 1]\)

How It Works#

  1. Generate an original response and sampled responses

  2. Decompose original response into units (claims or sentences)

  3. For each claim/sentence, generate one or more questions that have that claim/sentence as the answer

  4. Generate multiple responses for each question generated in step 3

  5. Measure consistency in the LLM responses to the claim/sentence questions to estimate claim/sentence-level confidence

Parameters#

When using LongTextQA, specify "semantic_negentropy" (or alternative scoring function) in the scorers list.

Example#

from uqlm import LongTextQA

# Initialize
ltqa = LongTextQA(
    llm=original_llm,
    claim_decomposition_llm=claim_decomposition_llm,
    scorers=["semantic_negentropy"],
    sampling_temperature=1.0
)

# Generate responses and compute scores
results = await ltqa.generate_and_score(prompts=prompts, num_claim_qa_responses=5)

# Access the claim-level scores
print(results.to_df()["claims_data"])

References#

See Also#

  • LongTextQA - Class for Graph-Based UQ for long-form generations