{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# ๐ฏ Claim-QA Uncertainty Quantification (Long-Text)\n", "\n", "
\n", " Claim-QA scorers, adapted as a generalization of long-form semantic entropy, are another method for detecting claim-level or sentence-level hallucinations in long-form LLM outputs. These scorers implement the following steps: decompose responses into granular units (sentences or claims), convert each claim or sentence to a question, sample LLM responses to those questions, and measure consistency among those answers to score the claim. The available scorers and papers from which they are adapted are below:\n", "
\n", " \n", "* Long-form Semantic Entropy ([Farquhar et al., 2024](https://www.nature.com/articles/s41586-024-07421-0))\n", "* Black-Box Generalizations of Long-form Semantic Entropy\n", "\n", "Set up LLM instance and load example data prompts.
\n", "Generate LLM Responses and Confidence Scores
\n", "Generate responses and compute claim-level confidence scores using the LongTextQA() class.
Evaluate Hallucination Detection Performance
\n", "Grade claims with `FactScoreGrader` class and evaluate claim-level hallucination detection.
\n", "| \n", " | prompt | \n", "wikipedia_text | \n", "
|---|---|---|
| 0 | \n", "Tell me a bio of Suthida within 100 words.\\n | \n", "Suthida Bajrasudhabimalalakshana (Thai: เธชเธกเนเธเนเธ... | \n", "
| 1 | \n", "Tell me a bio of Miguel รngel Fรฉlix Gallardo w... | \n", "Miguel รngel Fรฉlix Gallardo (born January 8, 1... | \n", "
| 2 | \n", "Tell me a bio of Iggy Azalea within 100 words.\\n | \n", "Amethyst Amelia Kelly (born 7 June 1990), know... | \n", "
| 3 | \n", "Tell me a bio of Fernando da Costa Novaes with... | \n", "Fernando da Costa Novaes (April 6, 1927 โ Marc... | \n", "
| 4 | \n", "Tell me a bio of Jan Zamoyski within 100 words.\\n | \n", "Jan Sariusz Zamoyski (Latin: Ioannes Zamoyski ... | \n", "
| Parameter | \n", "Type & Default | \n", "Description | \n", "
|---|---|---|
| llm | \n", "BaseChatModeldefault=None | \n",
" A langchain llm `BaseChatModel`. User is responsible for specifying temperature and other relevant parameters to the constructor of the provided `llm` object. | \n", "
| granularity | \n", "strdefault=\"claim\" | \n",
" Specifies whether to decompose and score at claim or sentence level granularity. Must be either \"claim\" or \"sentence\". | \n", "
| scorers | \n", "List[str]default=None | \n",
" Specifies which black box (consistency) scorers to include. Must be subset of ['semantic_negentropy', 'noncontradiction', 'exact_match', 'bert_score', 'cosine_sim', 'entailment', 'semantic_sets_confidence']. If None, defaults to [\"entailment\"]. | \n", "
| aggregation | \n", "strdefault=\"mean\" | \n",
" Specifies how to aggregate claim/sentence-level scores to response-level scores. Must be one of 'min' or 'mean'. | \n", "
| response_refinement | \n", "booldefault=False | \n",
" Specifies whether to refine responses with uncertainty-aware decoding. This approach removes claims with confidence scores below the response_refinement_threshold and uses the claim_decomposition_llm to reconstruct the response from the retained claims. For more details, refer to Jiang et al., 2024: https://arxiv.org/abs/2410.20783 | \n", "
| claim_filtering_scorer | \n", "Optional[str]default=None | \n",
" Specifies which scorer to use to filter claims if response_refinement is True. If not provided, defaults to the first element of self.scorers. | \n", "
| claim_decomposition_llm | \n", "BaseChatModeldefault=None | \n",
" A langchain llm `BaseChatModel` to be used for decomposing responses into individual claims. Also used for claim refinement. If granularity=\"claim\" and claim_decomposition_llm is None, the provided `llm` will be used for claim decomposition. | \n", "
| question_generator_llm | \n", "BaseChatModeldefault=None | \n",
" A langchain llm `BaseChatModel` to be used for decomposing responses into individual claims. Used for generating questions from claims or sentences in claim-QA approach. If None, defaults to claim_decomposition_llm. | \n", "
| device | \n", "str or torch.devicedefault=None | \n",
" Specifies the device that NLI model use for prediction. If None, detects and returns the best available PyTorch device. Prioritizes CUDA (NVIDIA GPU), then MPS (macOS), then CPU. | \n", "
| system_prompt | \n", "str or Nonedefault=\"You are a helpful assistant.\" | \n",
" Optional argument for user to provide custom system prompt for the LLM. | \n", "
| max_calls_per_min | \n", "intdefault=None | \n",
" Specifies how many API calls to make per minute to avoid rate limit errors. By default, no limit is specified. | \n", "
| use_n_param | \n", "booldefault=False | \n",
" Specifies whether to use n parameter for BaseChatModel. Not compatible with all BaseChatModel classes. If used, it speeds up the generation process substantially when num_responses is large. | \n",
"
| sampling_temperature | \n", "floatdefault=1 | \n",
" The 'temperature' parameter for LLM to use when generating sampled LLM responses. Must be greater than 0. | \n", "
| nli_model_name | \n", "strdefault=\"microsoft/deberta-large-mnli\" | \n",
" Specifies which NLI model to use. Must be acceptable input to AutoTokenizer.from_pretrained() and AutoModelForSequenceClassification.from_pretrained(). | \n",
"
| max_length | \n", "intdefault=2000 | \n",
" Specifies the maximum allowed string length for LLM responses for NLI computation. Responses longer than this value will be truncated in NLI computations to avoid OutOfMemoryError. | \n",
"
๐ง LLM-Specific
\n", "llmsystem_promptsampling_temperature๐ Confidence Scores
\n", "granularityscorersaggregationnum_questionsnum_claim_qa_responsesresponse_refinementresponse_refinement_threshold๐ฅ๏ธ Hardware
\n", "deviceโก Performance
\n", "max_calls_per_minuse_n_param| Method | \n", "Description & Parameters | \n", "
|---|---|
| BlackBoxUQ.generate_and_score | \n", "\n",
" Generate LLM responses, sampled LLM (candidate) responses, and compute confidence scores for the provided prompts. \n", "Parameters: \n", "
Returns: \n",
" ๐ก Best For: Complete end-to-end uncertainty quantification when starting with prompts.\n",
" \n",
" | \n",
"
| BlackBoxUQ.score | \n", "\n",
" Compute confidence scores on provided LLM responses. Should only be used if responses and sampled responses are already generated. \n", "Parameters: \n", "
Returns: \n",
" ๐ก Best For: Computing uncertainty scores when responses are already generated elsewhere.\n",
" \n",
" | \n",
"
| \n", " | prompt | \n", "response | \n", "noncontradiction | \n", "claims_data | \n", "refined_response | \n", "refined_noncontradiction | \n", "
|---|---|---|---|---|---|---|
| 0 | \n", "Tell me a bio of Suthida within 100 words.\\n | \n", "Queen Suthida Bajrasudhabimalalakshana is the ... | \n", "0.872727 | \n", "[{'claim': 'Queen Suthida Bajrasudhabimalalaks... | \n", "Queen Suthida Bajrasudhabimalalakshana, the cu... | \n", "0.985608 | \n", "
| 1 | \n", "Tell me a bio of Miguel รngel Fรฉlix Gallardo w... | \n", "Miguel รngel Fรฉlix Gallardo, known as \"El Padr... | \n", "0.922575 | \n", "[{'claim': 'Miguel รngel Fรฉlix Gallardo was kn... | \n", "Miguel รngel Fรฉlix Gallardo, famously known as... | \n", "0.973158 | \n", "
| 2 | \n", "Tell me a bio of Iggy Azalea within 100 words.\\n | \n", "Amethyst Amelia Kelly, known professionally as... | \n", "0.895390 | \n", "[{'claim': 'Amethyst Amelia Kelly is known pro... | \n", "Amethyst Amelia Kelly, known professionally as... | \n", "0.986233 | \n", "
| 3 | \n", "Tell me a bio of Fernando da Costa Novaes with... | \n", "Fernando da Costa Novaes (1942-2004) was a hig... | \n", "0.797684 | \n", "[{'claim': 'Fernando da Costa Novaes was born ... | \n", "Fernando da Costa Novaes was a highly influent... | \n", "0.966738 | \n", "
| 4 | \n", "Tell me a bio of Jan Zamoyski within 100 words.\\n | \n", "Jan Zamoyski (1542โ1605) was a preeminent Poli... | \n", "0.947813 | \n", "[{'claim': 'Jan Zamoyski was born in 1542.', '... | \n", "Jan Zamoyski, born in 1542 and dying in 1605, ... | \n", "0.978016 | \n", "
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ\n",
"\n"
],
"text/plain": [
"\u001b[1;30mโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ\u001b[0m\n"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/html": [
" Response Refinement Example \n",
"\n"
],
"text/plain": [
" \u001b[1;30mResponse Refinement Example\u001b[0m \n"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/html": [
"โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ\n",
"\n"
],
"text/plain": [
"\u001b[1;30mโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ\u001b[0m\n"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/html": [
"\n",
"\n"
],
"text/plain": [
"\n"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/html": [
"โญโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ Original Response โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฎ\n", "โ Queen Suthida Bajrasudhabimalalakshana is the current Queen of Thailand. Born Suthida Tidjai, she began her โ\n", "โ career as a flight attendant for Thai Airways International. She later joined the Royal Thai Army, rising โ\n", "โ through its ranks, and served in the Royal Security Command. In 2017, she was made a full General. She became a โ\n", "โ consort to King Vajiralongkorn (Rama X) and married him on May 1, 2019, becoming Queen just days before his โ\n", "โ official coronation. She is an influential figure in the Thai monarchy. โ\n", "โฐโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฏ\n", "\n" ], "text/plain": [ "\u001b[33mโญโ\u001b[0m\u001b[33mโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ\u001b[0m\u001b[33m \u001b[0m\u001b[1;33mOriginal Response\u001b[0m\u001b[33m \u001b[0m\u001b[33mโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ\u001b[0m\u001b[33mโโฎ\u001b[0m\n", "\u001b[33mโ\u001b[0m Queen Suthida Bajrasudhabimalalakshana is the current Queen of Thailand. Born Suthida Tidjai, she began her \u001b[33mโ\u001b[0m\n", "\u001b[33mโ\u001b[0m career as a flight attendant for Thai Airways International. She later joined the Royal Thai Army, rising \u001b[33mโ\u001b[0m\n", "\u001b[33mโ\u001b[0m through its ranks, and served in the Royal Security Command. In 2017, she was made a full General. She became a \u001b[33mโ\u001b[0m\n", "\u001b[33mโ\u001b[0m consort to King Vajiralongkorn (Rama X) and married him on May 1, 2019, becoming Queen just days before his \u001b[33mโ\u001b[0m\n", "\u001b[33mโ\u001b[0m official coronation. She is an influential figure in the Thai monarchy. \u001b[33mโ\u001b[0m\n", "\u001b[33mโฐโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฏ\u001b[0m\n" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
โญโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ Low-Confidence Claims to be Removed โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฎ\n", "โ โข Queen Suthida Bajrasudhabimalalakshana was born Suthida Tidjai. โ\n", "โ โข Queen Suthida Bajrasudhabimalalakshana joined the Royal Thai Army. โ\n", "โ โข Queen Suthida Bajrasudhabimalalakshana rose through the ranks of the Royal Thai Army. โ\n", "โ โข Queen Suthida Bajrasudhabimalalakshana served in the Royal Security Command. โ\n", "โ โข Queen Suthida Bajrasudhabimalalakshana became a consort to King Vajiralongkorn. โ\n", "โ โข King Vajiralongkorn is also known as Rama X. โ\n", "โฐโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฏ\n", "\n" ], "text/plain": [ "\u001b[31mโญโ\u001b[0m\u001b[31mโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ\u001b[0m\u001b[31m \u001b[0m\u001b[1;31mLow-Confidence Claims to be Removed\u001b[0m\u001b[31m \u001b[0m\u001b[31mโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ\u001b[0m\u001b[31mโโฎ\u001b[0m\n", "\u001b[31mโ\u001b[0m โข Queen Suthida Bajrasudhabimalalakshana was born Suthida Tidjai. \u001b[31mโ\u001b[0m\n", "\u001b[31mโ\u001b[0m โข Queen Suthida Bajrasudhabimalalakshana joined the Royal Thai Army. \u001b[31mโ\u001b[0m\n", "\u001b[31mโ\u001b[0m โข Queen Suthida Bajrasudhabimalalakshana rose through the ranks of the Royal Thai Army. \u001b[31mโ\u001b[0m\n", "\u001b[31mโ\u001b[0m โข Queen Suthida Bajrasudhabimalalakshana served in the Royal Security Command. \u001b[31mโ\u001b[0m\n", "\u001b[31mโ\u001b[0m โข Queen Suthida Bajrasudhabimalalakshana became a consort to King Vajiralongkorn. \u001b[31mโ\u001b[0m\n", "\u001b[31mโ\u001b[0m โข King Vajiralongkorn is also known as Rama X. \u001b[31mโ\u001b[0m\n", "\u001b[31mโฐโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฏ\u001b[0m\n" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
โญโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ Refined Response โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฎ\n", "โ Queen Suthida Bajrasudhabimalalakshana, the current and influential Queen of Thailand, began her career as a โ\n", "โ flight attendant for Thai Airways International. Her diverse background also includes a military role, as she โ\n", "โ was made a full General in 2017. She officially married King Vajiralongkorn on May 1, 2019, becoming Queen just โ\n", "โ days before his official coronation. โ\n", "โฐโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฏ\n", "\n" ], "text/plain": [ "\u001b[32mโญโ\u001b[0m\u001b[32mโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ\u001b[0m\u001b[32m \u001b[0m\u001b[1;32mRefined Response\u001b[0m\u001b[32m \u001b[0m\u001b[32mโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ\u001b[0m\u001b[32mโโฎ\u001b[0m\n", "\u001b[32mโ\u001b[0m Queen Suthida Bajrasudhabimalalakshana, the current and influential Queen of Thailand, began her career as a \u001b[32mโ\u001b[0m\n", "\u001b[32mโ\u001b[0m flight attendant for Thai Airways International. Her diverse background also includes a military role, as she \u001b[32mโ\u001b[0m\n", "\u001b[32mโ\u001b[0m was made a full General in 2017. She officially married King Vajiralongkorn on May 1, 2019, becoming Queen just \u001b[32mโ\u001b[0m\n", "\u001b[32mโ\u001b[0m days before his official coronation. \u001b[32mโ\u001b[0m\n", "\u001b[32mโฐโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฏ\u001b[0m\n" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "display_response_refinement(original_text=result_df.response[0], claims_data=result_df.claims_data[0], refined_text=result_df.refined_response[0])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 3. Evaluate Hallucination Detection Performance" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To evaluate hallucination detection performance, we 'grade' the atomic claims in the responses against an answer key. Here, we use UQLM's out-of-the-box `FactScoreGrader`, which can be used with [LangChain Chat Model](https://js.langchain.com/docs/integrations/chat/). **If you are using your own prompts/questions, be sure to update the grading method accordingly**." ] }, { "cell_type": "code", "execution_count": 9, "metadata": { "tags": [] }, "outputs": [], "source": [ "# set up the LLM grader\n", "grader = FactScoreGrader(llm=gemini_flash)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Before grading, we need to have claims formatted in list of lists where each interior list corresponds to a generated response. " ] }, { "cell_type": "code", "execution_count": 10, "metadata": { "tags": [] }, "outputs": [], "source": [ "# Convert claims to list of lists\n", "claims_data_lists = claims_dicts_to_lists(result_df.claims_data.tolist())" ] }, { "cell_type": "code", "execution_count": 11, "metadata": { "tags": [] }, "outputs": [ { "data": { "text/html": [ "
| \n", " | prompt | \n", "response | \n", "noncontradiction | \n", "claims_data | \n", "refined_response | \n", "refined_noncontradiction | \n", "claim_grades | \n", "answer | \n", "
|---|---|---|---|---|---|---|---|---|
| 0 | \n", "Tell me a bio of Suthida within 100 words.\\n | \n", "Queen Suthida Bajrasudhabimalalakshana is the ... | \n", "0.872727 | \n", "[{'claim': 'Queen Suthida Bajrasudhabimalalaks... | \n", "Queen Suthida Bajrasudhabimalalakshana, the cu... | \n", "0.985608 | \n", "[True, True, True, True, False, False, True, F... | \n", "Suthida Bajrasudhabimalalakshana (Thai: เธชเธกเนเธเนเธ... | \n", "
| 1 | \n", "Tell me a bio of Miguel รngel Fรฉlix Gallardo w... | \n", "Miguel รngel Fรฉlix Gallardo, known as \"El Padr... | \n", "0.922575 | \n", "[{'claim': 'Miguel รngel Fรฉlix Gallardo was kn... | \n", "Miguel รngel Fรฉlix Gallardo, famously known as... | \n", "0.973158 | \n", "[True, True, True, True, True, True, True, Tru... | \n", "Miguel รngel Fรฉlix Gallardo (born January 8, 1... | \n", "
| 2 | \n", "Tell me a bio of Iggy Azalea within 100 words.\\n | \n", "Amethyst Amelia Kelly, known professionally as... | \n", "0.895390 | \n", "[{'claim': 'Amethyst Amelia Kelly is known pro... | \n", "Amethyst Amelia Kelly, known professionally as... | \n", "0.986233 | \n", "[True, True, True, False, True, True, True, Tr... | \n", "Amethyst Amelia Kelly (born 7 June 1990), know... | \n", "
| 3 | \n", "Tell me a bio of Fernando da Costa Novaes with... | \n", "Fernando da Costa Novaes (1942-2004) was a hig... | \n", "0.797684 | \n", "[{'claim': 'Fernando da Costa Novaes was born ... | \n", "Fernando da Costa Novaes was a highly influent... | \n", "0.966738 | \n", "[False, True, False, False, False, False, Fals... | \n", "Fernando da Costa Novaes (April 6, 1927 โ Marc... | \n", "
| 4 | \n", "Tell me a bio of Jan Zamoyski within 100 words.\\n | \n", "Jan Zamoyski (1542โ1605) was a preeminent Poli... | \n", "0.947813 | \n", "[{'claim': 'Jan Zamoyski was born in 1542.', '... | \n", "Jan Zamoyski, born in 1542 and dying in 1605, ... | \n", "0.978016 | \n", "[True, True, True, True, True, True, True, Tru... | \n", "Jan Sariusz Zamoyski (Latin: Ioannes Zamoyski ... | \n", "