{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# 🎯 Semantic Density\n", "\n", "
\n", "This demo illustrates a state-of-the-art uncertainty quantification (UQ) approach known as semantic density. The semantic density method combines elements of black-box UQ (which generates multiple responses from the same prompt) and white-box UQ (which uses token probabilities of those generated responses) to compute density values. Intuitively, semantic density combines both signals to estimate a probability distribution for scoring each response. This method was proposed by Qiu et al. (2024) and is demonstrated in this notebook.\n", "
\n", "Set up LLM instance and load example data prompts.
\n", "Generate LLM Responses and Confidence Scores
\n", "Generate and score LLM responses to the example questions using the SemanticDensity() class.
Evaluate Hallucination Detection Performance
\n", "Visualize model accuracy at different thresholds of the semantic density score. Compute precision, recall, and F1-score of hallucination detection.
\n", "| \n", " | question | \n", "answer | \n", "
|---|---|---|
| 0 | \n", "How much money, in euros, was the surgeon held... | \n", "120,000 euros | \n", "
| 1 | \n", "What is the name of the former Prime Minister ... | \n", "Jóhanna Sigurðardóttir | \n", "
| 2 | \n", "To whom did Mehbooba Mufti Sayed contest the 2... | \n", "Hasnain Masoodi | \n", "
| 3 | \n", "In which year did Melbourne's Monash Gallery o... | \n", "2023 | \n", "
| 4 | \n", "Who requested the Federal Aviation Administrat... | \n", "The Coast Guard | \n", "
| Parameter | \n", "Type & Default | \n", "Description | \n", "
|---|---|---|
| llm | \n", "BaseChatModeldefault=None | \n",
" A langchain llm `BaseChatModel`. User is responsible for specifying temperature and other relevant parameters to the constructor of the provided `llm` object. | \n", "
| device | \n", "str or torch.devicedefault=\"cpu\" | \n",
" Specifies the device that NLI model use for prediction. Only applies to 'noncontradiction' scorer. Pass a torch.device to leverage GPU. | \n", "
| system_prompt | \n", "str or Nonedefault=\"You are a helpful assistant.\" | \n",
" Optional argument for user to provide custom system prompt for the LLM. | \n", "
| max_calls_per_min | \n", "intdefault=None | \n",
" Specifies how many API calls to make per minute to avoid rate limit errors. By default, no limit is specified. | \n", "
| length_normalize | \n", "boolbool, default=True | \n",
" Determines whether response probabilities are length-normalized. Recommended to set as True when longer responses are expected. | \n", "
| use_n_param | \n", "booldefault=False | \n",
" Specifies whether to use n parameter for BaseChatModel. Not compatible with all BaseChatModel classes. If used, it speeds up the generation process substantially when num_responses is large. | \n",
"
| postprocessor | \n", "callabledefault=None | \n",
" A user-defined function that takes a string input and returns a string. Used for postprocessing outputs. | \n", "
| sampling_temperature | \n", "floatdefault=1 | \n",
" The 'temperature' parameter for LLM model to generate sampled LLM responses. Must be greater than 0. | \n", "
| nli_model_name | \n", "strdefault=\"microsoft/deberta-large-mnli\" | \n",
" Specifies which NLI model to use. Must be acceptable input to AutoTokenizer.from_pretrained() and AutoModelForSequenceClassification.from_pretrained(). | \n",
"
| max_length | \n", "intdefault=2000 | \n",
" Specifies the maximum allowed string length for LLM responses for NLI computation. Responses longer than this value will be truncated in NLI computations to avoid OutOfMemoryError. | \n",
"
| return_responses | \n", "strdefault=\"all\" | \n",
" If a postprocessor is used, specifies whether to return only postprocessed responses, only raw responses, or both. Specified with 'postprocessed', 'raw', or 'all', respectively. | \n", "
🧠 LLM-Specific
\n", "llmsystem_promptsampling_temperature📊 Confidence Scores
\n", "length_normalizenli_model_namepostprocessor🖥️ Hardware
\n", "device⚡ Performance
\n", "max_calls_per_minuse_n_param| Method | \n", "Description & Parameters | \n", "
|---|---|
| SemanticDensity.generate_and_score | \n", "\n",
" Generate LLM responses, sampled LLM (candidate) responses, and compute density score for the provided prompts. \n", "Parameters: \n", "
Returns: \n",
" 💡 Best For: Complete end-to-end uncertainty quantification when starting with prompts.\n",
" \n",
" | \n",
"
| SemanticDensity.score | \n", "\n",
" Compute density score on provided LLM responses. Should only be used if responses and sampled responses are already generated. \n", "Parameters: \n", "
Returns: \n",
" 💡 Best For: Computing uncertainty scores when responses, sampled responses, and logprobs are already generated elsewhere.\n",
" \n",
" | \n",
"
| \n", " | response | \n", "sampled_responses | \n", "prompt | \n", "semantic_density_value | \n", "multiple_logprob | \n", "
|---|---|---|---|---|---|
| 0 | \n", "€120,000 | \n", "[€120,000, €120,000, €136,000, €120,000, €120,... | \n", "You will be given a question. Return only the ... | \n", "0.865526 | \n", "[[{'token': '€', 'logprob': -4.172499757260084... | \n", "
| 1 | \n", "Jóhanna Sigurðardóttir | \n", "[Jóhanna Sigurðardóttir, Jóhanna Sigurðardótti... | \n", "You will be given a question. Return only the ... | \n", "0.992922 | \n", "[[{'token': 'J', 'logprob': -9.536738616588991... | \n", "
| 2 | \n", "Hasnain Masoodi | \n", "[Hasnain Masoodi, Hasnain Masoodi, Hasnain Mas... | \n", "You will be given a question. Return only the ... | \n", "0.993829 | \n", "[[{'token': 'Has', 'logprob': -2.5987286790041... | \n", "
| 3 | \n", "2023 | \n", "[2023, 2023, 2022, 2022, 2022] | \n", "You will be given a question. Return only the ... | \n", "0.941380 | \n", "[[{'token': '2', 'logprob': -1.430510337740997... | \n", "
| 4 | \n", "BP and the U.S. Coast Guard | \n", "[BP, BP, BP and the U.S. Coast Guard, BP (Brit... | \n", "You will be given a question. Return only the ... | \n", "0.978186 | \n", "[[{'token': 'BP', 'logprob': -4.20799915445968... | \n", "
| \n", " | response | \n", "sampled_responses | \n", "prompt | \n", "semantic_density_value | \n", "multiple_logprob | \n", "answer | \n", "response_correct | \n", "
|---|---|---|---|---|---|---|---|
| 0 | \n", "€120,000 | \n", "[€120,000, €120,000, €136,000, €120,000, €120,... | \n", "You will be given a question. Return only the ... | \n", "0.865526 | \n", "[[{'token': '€', 'logprob': -4.172499757260084... | \n", "120,000 euros | \n", "True | \n", "
| 1 | \n", "Jóhanna Sigurðardóttir | \n", "[Jóhanna Sigurðardóttir, Jóhanna Sigurðardótti... | \n", "You will be given a question. Return only the ... | \n", "0.992922 | \n", "[[{'token': 'J', 'logprob': -9.536738616588991... | \n", "Jóhanna Sigurðardóttir | \n", "True | \n", "
| 2 | \n", "Hasnain Masoodi | \n", "[Hasnain Masoodi, Hasnain Masoodi, Hasnain Mas... | \n", "You will be given a question. Return only the ... | \n", "0.993829 | \n", "[[{'token': 'Has', 'logprob': -2.5987286790041... | \n", "Hasnain Masoodi | \n", "True | \n", "
| 3 | \n", "2023 | \n", "[2023, 2023, 2022, 2022, 2022] | \n", "You will be given a question. Return only the ... | \n", "0.941380 | \n", "[[{'token': '2', 'logprob': -1.430510337740997... | \n", "2023 | \n", "True | \n", "
| 4 | \n", "BP and the U.S. Coast Guard | \n", "[BP, BP, BP and the U.S. Coast Guard, BP (Brit... | \n", "You will be given a question. Return only the ... | \n", "0.978186 | \n", "[[{'token': 'BP', 'logprob': -4.20799915445968... | \n", "The Coast Guard | \n", "False | \n", "