🎯 Claim-QA Uncertainty Quantification (Long-Text)#

Claim-QA scorers, adapted as a generalization of long-form semantic entropy, are another method for detecting claim-level or sentence-level hallucinations in long-form LLM outputs. These scorers implement the following steps: decompose responses into granular units (sentences or claims), convert each claim or sentence to a question, sample LLM responses to those questions, and measure consistency among those answers to score the claim. The available scorers and papers from which they are adapted are below:

Long-form Semantic Entropy (Farquhar et al., 2024)
Black-Box Generalizations of Long-form Semantic Entropy

📊 What You’ll Do in This Demo#

Set up LLM and prompts.

Set up LLM instance and load example data prompts.

Generate LLM Responses and Confidence Scores

Generate responses and compute claim-level confidence scores using the LongTextQA() class.

Evaluate Hallucination Detection Performance

Grade claims with FactScoreGrader class and evaluate claim-level hallucination detection.

⚖️ Advantages & Limitations#

Pros

Universal Compatibility: Works with any LLM without requiring token probability access
Fine-Grained Scoring: Score at sentence or claim-level to localize likely hallucinations
Uncertainty-aware decoding: Improve factual precision by dropping high-uncertainty claims

Cons

Higher Cost: Requires multiple generations per prompt
Slower: Multiple generations and comparison calculations increase latency
Complex: More complex than simpler methods offered by LongTextUQ.

[1]:

import numpy as np

from uqlm import LongTextQA
from uqlm.utils import load_example_dataset, display_response_refinement, claims_dicts_to_lists, plot_model_accuracies
from uqlm.longform import FactScoreGrader

1. Set up LLM and Prompts#

In this demo, we will illustrate this approach using the FactScore longform QA dataset. To implement with your use case, simply replace the example prompts with your data.

[2]:

# Load example dataset (FactScore)
factscore = load_example_dataset("factscore", n=15)[["hundredw_prompt", "wikipedia_text"]].rename(columns={"hundredw_prompt": "prompt"})
factscore.head()

Loading dataset - factscore...
Processing dataset...
Dataset ready!

[2]:

	prompt	wikipedia_text
0	Tell me a bio of Suthida within 100 words.\n	Suthida Bajrasudhabimalalakshana (Thai: สมเด็จ...
1	Tell me a bio of Miguel Ángel Félix Gallardo w...	Miguel Ángel Félix Gallardo (born January 8, 1...
2	Tell me a bio of Iggy Azalea within 100 words.\n	Amethyst Amelia Kelly (born 7 June 1990), know...
3	Tell me a bio of Fernando da Costa Novaes with...	Fernando da Costa Novaes (April 6, 1927 – Marc...
4	Tell me a bio of Jan Zamoyski within 100 words.\n	Jan Sariusz Zamoyski (Latin: Ioannes Zamoyski ...

In this example, we use ChatVertexAI to instantiate our LLM, but any LangChain Chat Model may be used. Be sure to replace with your LLM of choice.

[3]:

from langchain_google_vertexai import ChatVertexAI

gemini_flash = ChatVertexAI(model="gemini-2.5-flash")
gemini_flash_lite = ChatVertexAI(model="gemini-2.5-flash-lite")

2. Generate LLM Responses and Claim/Sentence-Level Confidence Scores#

`LongTextQA()` - Generate long-text LLM responses, decompose into claims or sentences, create questions for which those claims are the answers, and measure consistency in LLM responses to those questions.#

Sample Image

📋 Class Attributes#

Parameter	Type & Default	Description
llm	BaseChatModeldefault=None	A langchain llm `BaseChatModel`. User is responsible for specifying temperature and other relevant parameters to the constructor of the provided `llm` object.
granularity	strdefault=”claim”	Specifies whether to decompose and score at claim or sentence level granularity. Must be either “claim” or “sentence”.
scorers	List[str]default=None	Specifies which black box (consistency) scorers to include. Must be subset of [‘semantic_negentropy’, ‘noncontradiction’, ‘exact_match’, ‘bert_score’, ‘cosine_sim’, ‘entailment’, ‘semantic_sets_confidence’]. If None, defaults to [“entailment”].
aggregation	strdefault=”mean”	Specifies how to aggregate claim/sentence-level scores to response-level scores. Must be one of ‘min’ or ‘mean’.
response_refinement	booldefault=False	Specifies whether to refine responses with uncertainty-aware decoding. This approach removes claims with confidence scores below the response_refinement_threshold and uses the claim_decomposition_llm to reconstruct the response from the retained claims. For more details, refer to Jiang et al., 2024: https://arxiv.org/abs/2410.20783
claim_filtering_scorer	Optional[str]default=None	Specifies which scorer to use to filter claims if response_refinement is True. If not provided, defaults to the first element of self.scorers.
claim_decomposition_llm	BaseChatModeldefault=None	A langchain llm `BaseChatModel` to be used for decomposing responses into individual claims. Also used for claim refinement. If granularity=”claim” and claim_decomposition_llm is None, the provided `llm` will be used for claim decomposition.
question_generator_llm	BaseChatModeldefault=None	A langchain llm `BaseChatModel` to be used for decomposing responses into individual claims. Used for generating questions from claims or sentences in claim-QA approach. If None, defaults to claim_decomposition_llm.
device	str or torch.devicedefault=None	Specifies the device that NLI model use for prediction. If None, detects and returns the best available PyTorch device. Prioritizes CUDA (NVIDIA GPU), then MPS (macOS), then CPU.
system_prompt	str or Nonedefault=”You are a helpful assistant.”	Optional argument for user to provide custom system prompt for the LLM.
max_calls_per_min	intdefault=None	Specifies how many API calls to make per minute to avoid rate limit errors. By default, no limit is specified.
use_n_param	booldefault=False	Specifies whether to use n parameter for BaseChatModel. Not compatible with all BaseChatModel classes. If used, it speeds up the generation process substantially when num_responses is large.
sampling_temperature	floatdefault=1	The ‘temperature’ parameter for LLM to use when generating sampled LLM responses. Must be greater than 0.
nli_model_name	strdefault=”microsoft/deberta-large-mnli”	Specifies which NLI model to use. Must be acceptable input to AutoTokenizer.from_pretrained() and AutoModelForSequenceClassification.from_pretrained().
max_length	intdefault=2000	Specifies the maximum allowed string length for LLM responses for NLI computation. Responses longer than this value will be truncated in NLI computations to avoid OutOfMemoryError.

🔍 Parameter Groups#

🧠 LLM-Specific

llm
system_prompt
sampling_temperature

📊 Confidence Scores

granularity
scorers
aggregation
num_questions
num_claim_qa_responses
response_refinement
response_refinement_threshold

🖥️ Hardware

device

⚡ Performance

max_calls_per_min
use_n_param

```

[4]:

claimqa = LongTextQA(
    llm=gemini_flash,
    question_generator_llm=gemini_flash,
    aggregation="mean",  # switch to 'min' for more conservative scoring
    response_refinement=True,  # whether to filter out low-confidence claims
    scorers=["noncontradiction"],
    # max_calls_per_min=1000,
)

claim_filtering_scorer is not specified for response_refinement. Defaulting to noncontradiction.

🔄 Class Methods#

Method	Description & Parameters
BlackBoxUQ.generate_and_score	Generate LLM responses, sampled LLM (candidate) responses, and compute confidence scores for the provided prompts. Parameters: prompts - (List[str] or List[List[BaseMessage]]) A list of input prompts for the model. num_questions - (int, default=2) Specifies how many questions to generate per claim/sentence. num_claim_qa_responses - (int, default=5) Specifies how many sampled responses to generate per claim/sentence question. response_refinement_threshold - (float, default=1/3) Threshold for uncertainty-aware filtering. Claims with confidence scores below this threshold are dropped from the refined response. Only used if response_refinement is True. show_progress_bars - (bool, default=True) If True, displays a progress bar while generating and scoring responses. Returns: UQResult containing data (prompts, responses, sampled responses, and confidence scores) and metadata 💡 Best For: Complete end-to-end uncertainty quantification when starting with prompts.
BlackBoxUQ.score	Compute confidence scores on provided LLM responses. Should only be used if responses and sampled responses are already generated. Parameters: prompts - (List[str]) A list of input prompts for the model. responses - (List[str]) A list of LLM responses for the prompts. num_questions - (int, default=2) Specifies how many questions to generate per claim/sentence. num_claim_qa_responses - (int, default=5) Specifies how many sampled responses to generate per claim/sentence question. response_refinement_threshold - (float, default=1/3) Threshold for uncertainty-aware filtering. Claims with confidence scores below this threshold are dropped from the refined response. Only used if response_refinement is True. show_progress_bars - (bool, default=True) If True, displays a progress bar while scoring responses. Returns: UQResult containing data (responses, sampled responses, and confidence scores) and metadata 💡 Best For: Computing uncertainty scores when responses are already generated elsewhere.

Method

Description & Parameters

BlackBoxUQ.generate_and_score

Generate LLM responses, sampled LLM (candidate) responses, and compute confidence scores for the provided prompts.

Parameters:

prompts - (List[str] or List[List[BaseMessage]]) A list of input prompts for the model.
num_questions - (int, default=2) Specifies how many questions to generate per claim/sentence.
num_claim_qa_responses - (int, default=5) Specifies how many sampled responses to generate per claim/sentence question.
response_refinement_threshold - (float, default=1/3) Threshold for uncertainty-aware filtering. Claims with confidence scores below this threshold are dropped from the refined response. Only used if response_refinement is True.
show_progress_bars - (bool, default=True) If True, displays a progress bar while generating and scoring responses.

Returns: UQResult containing data (prompts, responses, sampled responses, and confidence scores) and metadata

💡 Best For: Complete end-to-end uncertainty quantification when starting with prompts.

BlackBoxUQ.score

Compute confidence scores on provided LLM responses. Should only be used if responses and sampled responses are already generated.

Parameters:

prompts - (List[str]) A list of input prompts for the model.
responses - (List[str]) A list of LLM responses for the prompts.
num_questions - (int, default=2) Specifies how many questions to generate per claim/sentence.
num_claim_qa_responses - (int, default=5) Specifies how many sampled responses to generate per claim/sentence question.
response_refinement_threshold - (float, default=1/3) Threshold for uncertainty-aware filtering. Claims with confidence scores below this threshold are dropped from the refined response. Only used if response_refinement is True.
show_progress_bars - (bool, default=True) If True, displays a progress bar while scoring responses.

Returns: UQResult containing data (responses, sampled responses, and confidence scores) and metadata

💡 Best For: Computing uncertainty scores when responses are already generated elsewhere.

[5]:

results = await claimqa.generate_and_score(prompts=factscore.prompt.to_list(), response_refinement_threshold=0.85)

[6]:

result_df = results.to_df()
result_df.head(5)

[6]:

	prompt	response	noncontradiction	claims_data	refined_response	refined_noncontradiction
0	Tell me a bio of Suthida within 100 words.\n	Queen Suthida Bajrasudhabimalalakshana is the ...	0.872727	[{'claim': 'Queen Suthida Bajrasudhabimalalaks...	Queen Suthida Bajrasudhabimalalakshana, the cu...	0.985608
1	Tell me a bio of Miguel Ángel Félix Gallardo w...	Miguel Ángel Félix Gallardo, known as "El Padr...	0.922575	[{'claim': 'Miguel Ángel Félix Gallardo was kn...	Miguel Ángel Félix Gallardo, famously known as...	0.973158
2	Tell me a bio of Iggy Azalea within 100 words.\n	Amethyst Amelia Kelly, known professionally as...	0.895390	[{'claim': 'Amethyst Amelia Kelly is known pro...	Amethyst Amelia Kelly, known professionally as...	0.986233
3	Tell me a bio of Fernando da Costa Novaes with...	Fernando da Costa Novaes (1942-2004) was a hig...	0.797684	[{'claim': 'Fernando da Costa Novaes was born ...	Fernando da Costa Novaes was a highly influent...	0.966738
4	Tell me a bio of Jan Zamoyski within 100 words.\n	Jan Zamoyski (1542–1605) was a preeminent Poli...	0.947813	[{'claim': 'Jan Zamoyski was born in 1542.', '...	Jan Zamoyski, born in 1542 and dying in 1605, ...	0.978016

Response refinement#

Response refinement works by dropping claims with confidence scores (specified with claim_filtering_scorer) below a specified threshold (specified with response_refinement_threshold) and reconstructing the response from the retained claims.

Sample Image

To illustrate how the response refinement operates, let’s view an example. We first view the fine-grained claim-level data, including the claims in the original response, the claim-level confidence scores, and whether each claim was removed during the response refinement process.

[7]:

# View fine-grained claim data for a response
result_df.claims_data[0]

[7]:

[{'claim': 'Queen Suthida Bajrasudhabimalalakshana is the current Queen of Thailand.',
  'removed': False,
  'claim_questions': ['Who is the current Queen of Thailand?'],
  'claim_qa_responses': ['Suthida.'],
  'claim_qa_sampled_responses': [['Suthida.',
    'Suthida.',
    'Suthida.',
    'Suthida.',
    'Suthida']],
  'noncontradiction': 0.997577428817749},
 {'claim': 'Queen Suthida Bajrasudhabimalalakshana was born Suthida Tidjai.',
  'removed': True,
  'claim_questions': ["What was Queen Suthida Bajrasudhabimalalakshana's birth name?"],
  'claim_qa_responses': ['Suthida Tidjai'],
  'claim_qa_sampled_responses': [['Nui-Ngam',
    'Suthida Tidjai',
    'Suthida Tidjai',
    'Suthida Tidjai',
    'Suthida Tidjai']],
  'noncontradiction': 0.8045254647731781},
 {'claim': 'Queen Suthida Bajrasudhabimalalakshana began her career as a flight attendant.',
  'removed': False,
  'claim_questions': ["What was Queen Suthida Bajrasudhabimalalakshana's first career?"],
  'claim_qa_responses': ['Flight attendant.'],
  'claim_qa_sampled_responses': [['Flight attendant.',
    'Flight attendant.',
    'Flight attendant.',
    'Flight attendant.',
    'Flight attendant.']],
  'noncontradiction': 1.0},
 {'claim': 'Queen Suthida Bajrasudhabimalalakshana was a flight attendant for Thai Airways International.',
  'removed': False,
  'claim_questions': ["Identify the exact statement detailing Queen Suthida Bajrasudhabimalalakshana's previous occupation and employer."],
  'claim_qa_responses': ['She was a flight attendant for Thai Airways.'],
  'claim_qa_sampled_responses': [['She was previously a flight attendant for Thai Airways International.',
    'She previously worked as a flight attendant for Thai Airways.',
    'She was a flight attendant for Thai Airways.',
    'She was a flight attendant for Thai Airways.',
    'Flight attendant for Thai Airways.']],
  'noncontradiction': 0.9993116140365601},
 {'claim': 'Queen Suthida Bajrasudhabimalalakshana joined the Royal Thai Army.',
  'removed': True,
  'claim_questions': ["What was Queen Suthida Bajrasudhabimalalakshana's initial involvement with the Royal Thai Army?"],
  'claim_qa_responses': ['She was commissioned as a second lieutenant in 2010.'],
  'claim_qa_sampled_responses': [['Flight attendant.',
    'Royal Guard.',
    'Flight attendant for Thai Airways.',
    'Joined as an officer.',
    'She joined in 2010 as a second lieutenant.']],
  'noncontradiction': 0.72225404},
 {'claim': 'Queen Suthida Bajrasudhabimalalakshana rose through the ranks of the Royal Thai Army.',
  'removed': True,
  'claim_questions': ["What particular career trajectory characterized Queen Suthida Bajrasudhabimalalakshana's involvement in the Royal Thai Army?"],
  'claim_qa_responses': ["From flight attendant to rapid military promotion, ultimately becoming a general leading the King's security."],
  'claim_qa_sampled_responses': [['Rapid promotions as a royal bodyguard.',
    "From Royal Thai Army officer to General and Deputy Commander of the King's Guard.",
    'Joined Royal Thai Army, rapidly rose through ranks to become a general in the Royal Guard.',
    "Officer to general, commanding the King's Guard Special Operations Unit.",
    "Rapid ascent: bodyguard to General and commander in the King's Guard."]],
  'noncontradiction': 0.65020025},
 {'claim': 'Queen Suthida Bajrasudhabimalalakshana served in the Royal Security Command.',
  'removed': True,
  'claim_questions': ["What is the claim asserting Queen Suthida Bajrasudhabimalalakshana's service within the Royal Security Command?"],
  'claim_qa_responses': ['Commander of the Royal Security Command.'],
  'claim_qa_sampled_responses': [['Commander of the Royal Security Command.',
    'Commander of the Royal Security Command.',
    'Royal Guard.',
    'Deputy Commander of the Royal Security Command.',
    'Commander.']],
  'noncontradiction': 0.7191060900688171},
 {'claim': 'Queen Suthida Bajrasudhabimalalakshana was made a full General in 2017.',
  'removed': False,
  'claim_questions': ["What is the exact claim describing Queen Suthida Bajrasudhabimalalakshana's military promotion to a full General in 2017?"],
  'claim_qa_responses': ['Promoted to General.'],
  'claim_qa_sampled_responses': [['Appointed General.',
    'She was appointed Commander of the Royal Security Command as a General in October 2017.',
    'No such claim exists for 2017; she was promoted to General in December 2016.',
    'Appointed a full General.',
    'Promoted to a full General in December 2017.']],
  'noncontradiction': 0.96347487},
 {'claim': 'Queen Suthida Bajrasudhabimalalakshana became a consort to King Vajiralongkorn.',
  'removed': True,
  'claim_questions': ["What exact declaration details Queen Suthida Bajrasudhabimalalakshana's assumption of a regal spousal role alongside King Vajiralongkorn?"],
  'claim_qa_responses': ['Royal Decree of May 1, 2019.'],
  'claim_qa_sampled_responses': [['Royal proclamation on May 1, 2019, declaring her Queen Consort.',
    'Royal Gazette, May 1, 2019, announcing her marriage to King Vajiralongkorn and elevation to Queen.',
    'Royal Command of May 1, 2019.',
    'Royal Gazette declaration, May 1, 2019.',
    'Royal decree announcing her marriage and elevation to Queen, May 1, 2019, published in the *Royal Gazette*.']],
  'noncontradiction': 0.81256753},
 {'claim': 'King Vajiralongkorn is also known as Rama X.',
  'removed': True,
  'claim_questions': ['Provide the full statement that identifies King Vajiralongkorn by his most commonly used regnal name.'],
  'claim_qa_responses': ['Maha Vajiralongkorn Phra Vajiraklaochaoyuhua.'],
  'claim_qa_sampled_responses': [['Maha Vajiralongkorn Phra Vajiraklaochaoyuhua',
    'King Rama X.',
    'King Rama X',
    'Maha Vajiralongkorn Phra Vajiraklaochaoyuhua.',
    'Maha Vajiralongkorn Phra Vajiraklaochaoyuhua.']],
  'noncontradiction': 0.7375492095947266},
 {'claim': 'Queen Suthida Bajrasudhabimalalakshana married King Vajiralongkorn on May 1, 2019.',
  'removed': False,
  'claim_questions': ['State the full and accurate declaration regarding the marriage of Queen Suthida Bajrasudhabimalalakshana to King Vajiralongkorn on May 1, 2019.'],
  'claim_qa_responses': ['Queen Suthida Bajrasudhabimalalakshana married King Vajiralongkorn on May 1, 2019.'],
  'claim_qa_sampled_responses': [['Queen Suthida Bajrasudhabimalalakshana married King Vajiralongkorn on May 1, 2019.',
    'Queen Suthida Bajrasudhabimalalakshana married King Vajiralongkorn on May 1, 2019.',
    'They officially married on May 1, 2019, making her Queen Suthida Bajrasudhabimalalakshana.',
    'Suthida was declared Queen Consort and married to King Vajiralongkorn on May 1, 2019.',
    'Queen Suthida Bajrasudhabimalalakshana married King Vajiralongkorn on May 1, 2019.']],
  'noncontradiction': 0.9993778228759765},
 {'claim': "Queen Suthida Bajrasudhabimalalakshana became Queen just days before King Vajiralongkorn's official coronation.",
  'removed': False,
  'claim_questions': ["How did the timing of Queen Suthida Bajrasudhabimalalakshana becoming Queen relate to King Vajiralongkorn's official coronation?"],
  'claim_qa_responses': ['Shortly before.'],
  'claim_qa_sampled_responses': [['Days before.',
    'Days before.',
    'Days before.',
    'She was appointed Queen days *before* his official coronation.',
    'She was proclaimed Queen days before his coronation.']],
  'noncontradiction': 0.983033},
 {'claim': 'Queen Suthida Bajrasudhabimalalakshana is an influential figure in the Thai monarchy.',
  'removed': False,
  'claim_questions': ["Which statement about Queen Suthida Bajrasudhabimalalakshana's role in the Thai monarchy declares her to be an influential figure?"],
  'claim_qa_responses': ['"She holds the rank of General in the Royal Thai Army."'],
  'claim_qa_sampled_responses': [['Key figure.',
    "Commander of the King's Royal Guard.",
    'She plays a pivotal role in the monarchy.',
    'She is an influential figure in the Thai monarchy.',
    '"Influential figure" implies active participation or significant impact.']],
  'noncontradiction': 0.9564784}]

We can examine a particular claim in the response that was removed because its confidence score was too low. Let’s see how this is reflected in the original vs. the refined response.

[8]:

display_response_refinement(original_text=result_df.response[0], claims_data=result_df.claims_data[0], refined_text=result_df.refined_response[0])

───────────────────────────────────────────────────────────────────────────────────────────────────────────────────

                                            Response Refinement Example

───────────────────────────────────────────────────────────────────────────────────────────────────────────────────

╭─────────────────────────────────────────────── Original Response ───────────────────────────────────────────────╮
│ Queen Suthida Bajrasudhabimalalakshana is the current Queen of Thailand. Born Suthida Tidjai, she began her     │
│ career as a flight attendant for Thai Airways International. She later joined the Royal Thai Army, rising       │
│ through its ranks, and served in the Royal Security Command. In 2017, she was made a full General. She became a │
│ consort to King Vajiralongkorn (Rama X) and married him on May 1, 2019, becoming Queen just days before his     │
│ official coronation. She is an influential figure in the Thai monarchy.                                         │
╰─────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯

╭────────────────────────────────────── Low-Confidence Claims to be Removed ──────────────────────────────────────╮
│ • Queen Suthida Bajrasudhabimalalakshana was born Suthida Tidjai.                                               │
│ • Queen Suthida Bajrasudhabimalalakshana joined the Royal Thai Army.                                            │
│ • Queen Suthida Bajrasudhabimalalakshana rose through the ranks of the Royal Thai Army.                         │
│ • Queen Suthida Bajrasudhabimalalakshana served in the Royal Security Command.                                  │
│ • Queen Suthida Bajrasudhabimalalakshana became a consort to King Vajiralongkorn.                               │
│ • King Vajiralongkorn is also known as Rama X.                                                                  │
╰─────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯

╭─────────────────────────────────────────────── Refined Response ────────────────────────────────────────────────╮
│ Queen Suthida Bajrasudhabimalalakshana, the current and influential Queen of Thailand, began her career as a    │
│ flight attendant for Thai Airways International. Her diverse background also includes a military role, as she   │
│ was made a full General in 2017. She officially married King Vajiralongkorn on May 1, 2019, becoming Queen just │
│ days before his official coronation.                                                                            │
╰─────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯

3. Evaluate Hallucination Detection Performance#

To evaluate hallucination detection performance, we ‘grade’ the atomic claims in the responses against an answer key. Here, we use UQLM’s out-of-the-box FactScoreGrader, which can be used with LangChain Chat Model. If you are using your own prompts/questions, be sure to update the grading method accordingly.

[9]:

# set up the LLM grader
grader = FactScoreGrader(llm=gemini_flash)

Before grading, we need to have claims formatted in list of lists where each interior list corresponds to a generated response.

[10]:

# Convert claims to list of lists
claims_data_lists = claims_dicts_to_lists(result_df.claims_data.tolist())

[11]:

# grade original responses against the answer key using the grader
result_df["claim_grades"] = await grader.grade_claims(claim_sets=claims_data_lists["claim"], answers=factscore["wikipedia_text"].to_list())
result_df["answer"] = factscore["wikipedia_text"]
result_df.head(5)

[11]:

	prompt	response	noncontradiction	claims_data	refined_response	refined_noncontradiction	claim_grades	answer
0	Tell me a bio of Suthida within 100 words.\n	Queen Suthida Bajrasudhabimalalakshana is the ...	0.872727	[{'claim': 'Queen Suthida Bajrasudhabimalalaks...	Queen Suthida Bajrasudhabimalalakshana, the cu...	0.985608	[True, True, True, True, False, False, True, F...	Suthida Bajrasudhabimalalakshana (Thai: สมเด็จ...
1	Tell me a bio of Miguel Ángel Félix Gallardo w...	Miguel Ángel Félix Gallardo, known as "El Padr...	0.922575	[{'claim': 'Miguel Ángel Félix Gallardo was kn...	Miguel Ángel Félix Gallardo, famously known as...	0.973158	[True, True, True, True, True, True, True, Tru...	Miguel Ángel Félix Gallardo (born January 8, 1...
2	Tell me a bio of Iggy Azalea within 100 words.\n	Amethyst Amelia Kelly, known professionally as...	0.895390	[{'claim': 'Amethyst Amelia Kelly is known pro...	Amethyst Amelia Kelly, known professionally as...	0.986233	[True, True, True, False, True, True, True, Tr...	Amethyst Amelia Kelly (born 7 June 1990), know...
3	Tell me a bio of Fernando da Costa Novaes with...	Fernando da Costa Novaes (1942-2004) was a hig...	0.797684	[{'claim': 'Fernando da Costa Novaes was born ...	Fernando da Costa Novaes was a highly influent...	0.966738	[False, True, False, False, False, False, Fals...	Fernando da Costa Novaes (April 6, 1927 – Marc...
4	Tell me a bio of Jan Zamoyski within 100 words.\n	Jan Zamoyski (1542–1605) was a preeminent Poli...	0.947813	[{'claim': 'Jan Zamoyski was born in 1542.', '...	Jan Zamoyski, born in 1542 and dying in 1605, ...	0.978016	[True, True, True, True, True, True, True, Tru...	Jan Sariusz Zamoyski (Latin: Ioannes Zamoyski ...

[12]:

all_claim_scores, all_claim_grades = [], []
for i in range(len(result_df)):
    all_claim_scores.extend(claims_data_lists["noncontradiction"][i])
    all_claim_grades.extend(result_df["claim_grades"][i])

print(f"""Baseline LLM accuracy: {np.mean(all_claim_grades)}""")

Baseline LLM accuracy: 0.6322751322751323

To evaluate fine-grained hallucination detection performance, we compute AUROC of claim-level hallucination detection. Below, we plot the ROC curve and report these results.

[24]:

from sklearn.metrics import roc_curve, roc_auc_score

fpr, tpr, thresholds = roc_curve(y_true=all_claim_grades, y_score=all_claim_scores)
roc_auc = roc_auc_score(y_true=all_claim_grades, y_score=all_claim_scores)

[25]:

import matplotlib.pyplot as plt

plt.figure()
plt.plot(fpr, tpr, color="darkorange", lw=2, label=f"ROC curve (AUC = {roc_auc:.2f})")
plt.plot([0, 1], [0, 1], color="navy", lw=2, linestyle="--")
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel("False Positive Rate")
plt.ylabel("True Positive Rate")
plt.title("Receiver Operating Characteristic (ROC) Curve")
plt.legend(loc="lower right")
plt.show()

../../_images/_notebooks_examples_long_text_qa_demo_29_0.png

Lastly, we evaluate the gains from uncertainty-aware decoding (UAD) by measuring the factual precision over claims at various filtering thresholds.

[27]:

plot_model_accuracies(scores=all_claim_scores, correct_indicators=all_claim_grades, title="LLM Accuracy by Claim Confidence Threshold", display_percentage=True)

../../_images/_notebooks_examples_long_text_qa_demo_32_0.png

Since, we have selected a threshold of 0.85, we can measure LLM accuracy with and without UAD.

[30]:

thresh = 0.85
filtered_grades, filtered_scores = [], []
for grade, score in zip(all_claim_grades, all_claim_scores):
    if score > thresh:
        filtered_grades.append(grade)
        filtered_scores.append(score)

print(f"Baseline LLM factual precision: {np.mean(all_claim_grades)}")
print(f"UAD-Improved LLM factual precision: {np.mean(filtered_grades)}")

Baseline LLM factual precision: 0.6713091922005571
UAD-Improved LLM factual precision: 0.7306273062730627

4. Scorer Definitions#

Long-form uncertainty quantification implements a three-stage pipeline after response generation:

Response Decomposition: The response \(y\) is decomposed into units (claims or sentences), where a unit as denoted as \(s\).
Unit-Level Confidence Scoring: Confidence scores are computed using function \(c_g(s;\cdot) \in [0, 1]\). Higher scores indicate greater likelihood of factual correctness. Units with scores below threshold \(\tau\) are flagged as potential hallucinations.
Response-Level Aggregation: Unit scores are combined to provide an overall response confidence.

The Claim-QA approach demonstrated here is adapted from Farquhar et al., 2024. It uses an LLM to convert each unit (sentence or claim) into a question for which that unit would be the answer. The method measures consistency across multiple responses to these questions, effectively applying standard black-box uncertainty quantification to those sampled responses to the unit questions. Formally, a claim-QA scorer \(c_g(s;\cdot)\) is defined as follows:

\[c_g(s; y_0^{(s)}, \mathbf{y}^{(s)}_{\text{cand}}) = \frac{1}{m} \sum_{j=1}^m \eta(y_0^{(s)}, y_j^{(s)})\]

where \(y_0^{(s)}\) is the original unit response, \(\mathbf{y}^{(s)}_{\text{cand}} = \{y_1^{(s)}, ..., y_m^{(s)}\}\) are \(m\) candidate responses to the unit’s question, and \(\eta\) is a consistency function such as contradiction probability, cosine similarity, or BERTScore F1. Semantic entropy, which follows a slightly different functional form, can also be used to measure consistency.