QA-Based Uncertainty Quantification (LUQ) ========================================= .. currentmodule:: uqlm.scorers Definition ---------- The Claim-QA approach demonstrated here is adapted from Farquhar et al. (2024). It uses an LLM to convert each unit (sentence or claim) into a question for which that unit would be the answer. The method measures consistency across multiple responses to these questions, effectively applying standard black-box uncertainty quantification to those sampled responses to the unit questions. Formally, a claim-QA scorer :math:`c_g(s;\cdot)` is defined as follows: .. math:: c_g(s; y_0^{(s)}, \mathbf{y}^{(s)}_{\text{cand}}) = \frac{1}{m} \sum_{j=1}^m \eta(y_0^{(s)}, y_j^{(s)}), s where :math:`y_0^{(s)}` is the original unit response, :math:`\mathbf{y}^{(s)}_{\text{cand}} = {y_1^{(s)}, ..., y_m^{(s)}}` are :math:`m` candidate responses to the unit's question, and :math:`\eta` is a consistency function such as contradiction probability, cosine similarity, or BERTScore F1. Semantic entropy, which follows a slightly different functional form, can also be used to measure consistency. **Key Properties:** - Claim or sententence-level scoring - Less complex (cost and latency) than other long-form scoring methods - Score range: :math:`[0, 1]` How It Works ------------ 1. Generate an original response and sampled responses 2. Decompose original response into units (claims or sentences) 3. For each claim/sentence, generate one or more questions that have that claim/sentence as the answer 4. Generate multiple responses for each question generated in step 3 5. Measure consistency in the LLM responses to the claim/sentence questions to estimate claim/sentence-level confidence Parameters ---------- When using :class:`LongTextQA`, specify ``"semantic_negentropy"`` (or alternative scoring function) in the ``scorers`` list. Example ------- .. code-block:: python from uqlm import LongTextQA # Initialize ltqa = LongTextQA( llm=original_llm, claim_decomposition_llm=claim_decomposition_llm, scorers=["semantic_negentropy"], sampling_temperature=1.0 ) # Generate responses and compute scores results = await ltqa.generate_and_score(prompts=prompts, num_claim_qa_responses=5) # Access the claim-level scores print(results.to_df()["claims_data"]) References ---------- - Farquhar, S., et al. (2024). `Detecting hallucinations in large language models using semantic entropy `_. *Nature*. See Also -------- - :class:`LongTextQA` - Class for Graph-Based UQ for long-form generations