{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# 🎯 Semantic Entropy\n", "\n", "
\n", " Black-box Uncertainty Quantification (UQ) methods treat the LLM as a black box and evaluate consistency of multiple responses generated from the same prompt to estimate response-level confidence. This demo provides an illustration of a state-of-the-art black-box UQ method known as Semantic Entropy.\n", "
\n", "Set up LLM instance and load example data prompts.
\n", "Generate LLM Responses and Confidence Scores
\n", "Generate and score LLM responses to the example questions using the SemanticEntropy()
class.
Evaluate Hallucination Detection Performance
\n", "Visualize model accuracy at different thresholds of the various black-box UQ confidence scores. Compute precision, recall, and F1-score of hallucination detection.
\n", "\n", " | question | \n", "answer | \n", "
---|---|---|
0 | \n", "There are 87 oranges and 290 bananas in Philip... | \n", "145 | \n", "
1 | \n", "Marco and his dad went strawberry picking. Mar... | \n", "19 | \n", "
2 | \n", "Edward spent $ 6 to buy 2 books each book cost... | \n", "3 | \n", "
3 | \n", "Frank was reading through his favorite book. T... | \n", "198 | \n", "
4 | \n", "There were 78 dollars in Olivia's wallet. She ... | \n", "63 | \n", "
Parameter | \n", "Type & Default | \n", "Description | \n", "
---|---|---|
llm | \n", "BaseChatModeldefault=None | \n",
" A langchain llm `BaseChatModel`. User is responsible for specifying temperature and other relevant parameters to the constructor of the provided `llm` object. | \n", "
device | \n", "str or torch.devicedefault=\"cpu\" | \n",
" Specifies the device that NLI model use for prediction. Only applies to 'semantic_negentropy', 'noncontradiction' scorers. Pass a torch.device to leverage GPU. | \n", "
use_best | \n", "booldefault=True | \n",
" Specifies whether to swap the original response for the uncertainty-minimized response among all sampled responses based on semantic entropy clusters. Only used if `scorers` includes 'semantic_negentropy' or 'noncontradiction'. | \n", "
system_prompt | \n", "str or Nonedefault=\"You are a helpful assistant.\" | \n",
" Optional argument for user to provide custom system prompt for the LLM. | \n", "
max_calls_per_min | \n", "intdefault=None | \n",
" Specifies how many API calls to make per minute to avoid rate limit errors. By default, no limit is specified. | \n", "
use_n_param | \n", "booldefault=False | \n",
" Specifies whether to use n parameter for BaseChatModel . Not compatible with all BaseChatModel classes. If used, it speeds up the generation process substantially when num_responses is large. | \n",
"
postprocessor | \n", "callabledefault=None | \n",
" A user-defined function that takes a string input and returns a string. Used for postprocessing outputs. | \n", "
sampling_temperature | \n", "floatdefault=1 | \n",
" The 'temperature' parameter for LLM model to generate sampled LLM responses. Must be greater than 0. | \n", "
nli_model_name | \n", "strdefault=\"microsoft/deberta-large-mnli\" | \n",
" Specifies which NLI model to use. Must be acceptable input to AutoTokenizer.from_pretrained() and AutoModelForSequenceClassification.from_pretrained() . | \n",
"
max_length | \n", "intdefault=2000 | \n",
" Specifies the maximum allowed string length for LLM responses for NLI computation. Responses longer than this value will be truncated in NLI computations to avoid OutOfMemoryError . | \n",
"
🧠 LLM-Specific
\n", "llm
system_prompt
sampling_temperature
📊 Confidence Scores
\n", "nli_model_name
use_best
postprocessor
🖥️ Hardware
\n", "device
⚡ Performance
\n", "max_calls_per_min
use_n_param
Method | \n", "Description & Parameters | \n", "
---|---|
SemanticEntropy.generate_and_score | \n", "\n",
" Generate LLM responses, sampled LLM (candidate) responses, and compute confidence scores for the provided prompts. \n", "Parameters: \n", "
Returns: \n",
" 💡 Best For: Complete end-to-end uncertainty quantification when starting with prompts.\n",
" \n",
" | \n",
"
SemanticEntropy.score | \n", "\n",
" Compute confidence scores on provided LLM responses. Should only be used if responses and sampled responses are already generated. \n", "Parameters: \n", "
Returns: \n",
" 💡 Best For: Computing uncertainty scores when responses are already generated elsewhere.\n",
" \n",
" | \n",
"
\n", " | response | \n", "entropy_value | \n", "confidence_score | \n", "sampled_responses | \n", "prompt | \n", "
---|---|---|---|---|---|
0 | \n", "145 | \n", "0.000000 | \n", "1.000000 | \n", "[145, 145, 145, 145, 145, 145, 145, 145, 145, ... | \n", "When you solve this math problem only return t... | \n", "
1 | \n", "19 pounds | \n", "0.000000 | \n", "1.000000 | \n", "[Nineteen pounds. , 19, 19, 19 pounds, 19, 19 ... | \n", "When you solve this math problem only return t... | \n", "
2 | \n", "$3 | \n", "0.600166 | \n", "0.749711 | \n", "[$ 9, $3, $3.00, $3, $3.00, $3, $ 3.00 \\n \\n \\... | \n", "When you solve this math problem only return t... | \n", "
3 | \n", "198 | \n", "0.000000 | \n", "1.000000 | \n", "[198, 198, 198, 198, 198, 198, 198, 198, 198, ... | \n", "When you solve this math problem only return t... | \n", "
4 | \n", "63 | \n", "0.000000 | \n", "1.000000 | \n", "[63, 63, 63, 63, 63.0, 63 dollars, 63, 63 doll... | \n", "When you solve this math problem only return t... | \n", "
\n", " | response | \n", "entropy_value | \n", "confidence_score | \n", "sampled_responses | \n", "prompt | \n", "answer | \n", "response_correct | \n", "
---|---|---|---|---|---|---|---|
0 | \n", "145 | \n", "0.000000 | \n", "1.000000 | \n", "[145, 145, 145, 145, 145, 145, 145, 145, 145, ... | \n", "When you solve this math problem only return t... | \n", "145 | \n", "True | \n", "
1 | \n", "19 pounds | \n", "0.000000 | \n", "1.000000 | \n", "[Nineteen pounds. , 19, 19, 19 pounds, 19, 19 ... | \n", "When you solve this math problem only return t... | \n", "19 | \n", "True | \n", "
2 | \n", "$3 | \n", "0.600166 | \n", "0.749711 | \n", "[$ 9, $3, $3.00, $3, $3.00, $3, $ 3.00 \\n \\n \\... | \n", "When you solve this math problem only return t... | \n", "3 | \n", "True | \n", "
3 | \n", "198 | \n", "0.000000 | \n", "1.000000 | \n", "[198, 198, 198, 198, 198, 198, 198, 198, 198, ... | \n", "When you solve this math problem only return t... | \n", "198 | \n", "True | \n", "
4 | \n", "63 | \n", "0.000000 | \n", "1.000000 | \n", "[63, 63, 63, 63, 63.0, 63 dollars, 63, 63 doll... | \n", "When you solve this math problem only return t... | \n", "63 | \n", "True | \n", "