🎯 Semantic Entropy#

Black-box Uncertainty Quantification (UQ) methods treat the LLM as a black box and evaluate consistency of multiple responses generated from the same prompt to estimate response-level confidence. This demo provides an illustration of a state-of-the-art black-box UQ method known as Semantic Entropy.

📊 What You’ll Do in This Demo#

1

Set up LLM and prompts.

Set up LLM instance and load example data prompts.

2

Generate LLM Responses and Confidence Scores

Generate and score LLM responses to the example questions using the SemanticEntropy() class.

3

Evaluate Hallucination Detection Performance

Visualize model accuracy at different thresholds of the various black-box UQ confidence scores. Compute precision, recall, and F1-score of hallucination detection.

⚖️ Advantages & Limitations#

Pros

  • Universal Compatibility: Works with any LLM

  • Intuitive: Easy to understand and implement

  • No Internal Access Required: Doesn’t need token probabilities or model internals

Cons

  • Higher Cost: Requires multiple generations per prompt

  • Slower: Multiple generations and comparison calculations increase latency

[2]:
import numpy as np
from sklearn.metrics import precision_score, recall_score, f1_score

from uqlm.utils import load_example_dataset, math_postprocessor, plot_model_accuracies, Tuner
from uqlm import SemanticEntropy

## 1. Set up LLM and Prompts

In this demo, we will illustrate this approach using a set of math questions from the SVAMP benchmark. To implement with your use case, simply replace the example prompts with your data.

[3]:
# Load example dataset (SVAMP)
svamp = load_example_dataset("svamp", n=100)
svamp.head()
Loading dataset - svamp...
Processing dataset...
Dataset ready!
[3]:
question answer
0 There are 87 oranges and 290 bananas in Philip... 145
1 Marco and his dad went strawberry picking. Mar... 19
2 Edward spent $ 6 to buy 2 books each book cost... 3
3 Frank was reading through his favorite book. T... 198
4 There were 78 dollars in Olivia's wallet. She ... 63
[4]:
# Define prompts
MATH_INSTRUCTION = (
    "When you solve this math problem only return the answer with no additional text.\n"
)
prompts = [MATH_INSTRUCTION + prompt for prompt in svamp.question]

In this example, we use ChatVertexAI to instantiate our LLM, but any LangChain Chat Model may be used. Be sure to replace with your LLM of choice.

[5]:
# import sys
# !{sys.executable} -m pip install langchain-google-vertexai
from langchain_google_vertexai import ChatVertexAI

llm = ChatVertexAI(model="gemini-pro")

## 2. Generate responses and confidence scores

SemanticEntropy() - Generate LLM responses and compute consistency-based confidence scores for each response.#

📋 Class Attributes#

Parameter

Type & Default

Description

llm

BaseChatModeldefault=None

A langchain llm BaseChatModel. User is responsible for specifying temperature and other relevant parameters to the constructor of the provided llm object.

device

str or torch.devicedefault=”cpu”

Specifies the device that NLI model use for prediction. Only applies to ‘semantic_negentropy’, ‘noncontradiction’ scorers. Pass a torch.device to leverage GPU.

use_best

booldefault=True

Specifies whether to swap the original response for the uncertainty-minimized response among all sampled responses based on semantic entropy clusters. Only used if scorers includes ‘semantic_negentropy’ or ‘noncontradiction’.

system_prompt

str or Nonedefault=”You are a helpful assistant.”

Optional argument for user to provide custom system prompt for the LLM.

max_calls_per_min

intdefault=None

Specifies how many API calls to make per minute to avoid rate limit errors. By default, no limit is specified.

use_n_param

booldefault=False

Specifies whether to use n parameter for BaseChatModel. Not compatible with all BaseChatModel classes. If used, it speeds up the generation process substantially when num_responses is large.

postprocessor

callabledefault=None

A user-defined function that takes a string input and returns a string. Used for postprocessing outputs.

sampling_temperature

floatdefault=1

The ‘temperature’ parameter for LLM model to generate sampled LLM responses. Must be greater than 0.

nli_model_name

strdefault=”microsoft/deberta-large-mnli”

Specifies which NLI model to use. Must be acceptable input to AutoTokenizer.from_pretrained() and AutoModelForSequenceClassification.from_pretrained().

max_length

intdefault=2000

Specifies the maximum allowed string length for LLM responses for NLI computation. Responses longer than this value will be truncated in NLI computations to avoid OutOfMemoryError.

🔍 Parameter Groups#

🧠 LLM-Specific

  • llm

  • system_prompt

  • sampling_temperature

📊 Confidence Scores

  • nli_model_name

  • use_best

  • postprocessor

🖥️ Hardware

  • device

⚡ Performance

  • max_calls_per_min

  • use_n_param

💻 Usage Examples#

# Basic usage with default parameters
se = SemanticEntropy(llm=llm)

# Using GPU acceleration, default scorers
se = SemanticEntropy(llm=llm, device=torch.device("cuda"))

# High-throughput configuration with rate limiting
se = SemanticEntropy(llm=llm, max_calls_per_min=200, use_n_param=True)
[6]:
import torch

# Set the torch device
if torch.cuda.is_available():  # NVIDIA GPU
    device = torch.device("cuda")
elif torch.backends.mps.is_available():  # macOS
    device = torch.device("mps")
else:
    device = torch.device("cpu")  # CPU
print(f"Using {device.type} device")
Using cuda device
[7]:
se = SemanticEntropy(
    llm=llm,
    max_calls_per_min=250,  # set value to avoid rate limit error
    device=device,  # use if GPU available
)
Some weights of the model checkpoint at microsoft/deberta-large-mnli were not used when initializing DebertaForSequenceClassification: ['config']
- This IS expected if you are initializing DebertaForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DebertaForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).

🔄 Class Methods#

Method

Description & Parameters

SemanticEntropy.generate_and_score

Generate LLM responses, sampled LLM (candidate) responses, and compute confidence scores for the provided prompts.

Parameters:

  • prompts - (list of str) A list of input prompts for the model.

  • num_responses - (int, default=5) The number of sampled responses used to compute consistency.

Returns: UQResult containing data (prompts, responses, sampled responses, and confidence scores) and metadata

💡 Best For: Complete end-to-end uncertainty quantification when starting with prompts.

SemanticEntropy.score

Compute confidence scores on provided LLM responses. Should only be used if responses and sampled responses are already generated.

Parameters:

  • responses - (list of str) A list of LLM responses for the prompts.

  • sampled_responses - (list of list of str) A list of lists of sampled LLM responses for each prompt. These will be used to compute consistency scores by comparing to the corresponding response from responses.

Returns: UQResult containing data (responses, sampled responses, and confidence scores) and metadata

💡 Best For: Computing uncertainty scores when responses are already generated elsewhere.

[8]:
results = await se.generate_and_score(
    prompts=prompts, num_responses=10,
)

# # alternative approach: directly score if responses already generated
# results = se.score(responses=responses, sampled_responses=sampled_responses)
Generating responses...
Generating candidate responses...
Computing confidence scores...
[9]:
result_df = results.to_df()
result_df.head(5)
[9]:
response entropy_value confidence_score sampled_responses prompt
0 145 0.000000 1.000000 [145, 145, 145, 145, 145, 145, 145, 145, 145, ... When you solve this math problem only return t...
1 19 pounds 0.000000 1.000000 [Nineteen pounds. , 19, 19, 19 pounds, 19, 19 ... When you solve this math problem only return t...
2 $3 0.600166 0.749711 [$ 9, $3, $3.00, $3, $3.00, $3, $ 3.00 \n \n \... When you solve this math problem only return t...
3 198 0.000000 1.000000 [198, 198, 198, 198, 198, 198, 198, 198, 198, ... When you solve this math problem only return t...
4 63 0.000000 1.000000 [63, 63, 63, 63, 63.0, 63 dollars, 63, 63 doll... When you solve this math problem only return t...

## 3. Evaluate Hallucination Detection Performance

To evaluate hallucination detection performance, we ‘grade’ the responses against an answer key. Note the math_postprocessor is specific to our use case (math questions). If you are using your own prompts/questions, update the grading method accordingly.

[10]:
# Populate correct answers and grade responses
result_df["answer"] = svamp.answer
result_df["response_correct"] = [
    math_postprocessor(r) == a for r, a in zip(result_df["response"], svamp["answer"])
]
result_df.head(5)
[10]:
response entropy_value confidence_score sampled_responses prompt answer response_correct
0 145 0.000000 1.000000 [145, 145, 145, 145, 145, 145, 145, 145, 145, ... When you solve this math problem only return t... 145 True
1 19 pounds 0.000000 1.000000 [Nineteen pounds. , 19, 19, 19 pounds, 19, 19 ... When you solve this math problem only return t... 19 True
2 $3 0.600166 0.749711 [$ 9, $3, $3.00, $3, $3.00, $3, $ 3.00 \n \n \... When you solve this math problem only return t... 3 True
3 198 0.000000 1.000000 [198, 198, 198, 198, 198, 198, 198, 198, 198, ... When you solve this math problem only return t... 198 True
4 63 0.000000 1.000000 [63, 63, 63, 63, 63.0, 63 dollars, 63, 63 doll... When you solve this math problem only return t... 63 True
[11]:
print(f"""Baseline LLM accuracy: {np.mean(result_df["response_correct"])}""")
Baseline LLM accuracy: 0.68

3.1 Filtered LLM Accuracy Evaluation#

Here, we explore ‘filtered accuracy’ as a metric for evaluating the performance of our confidence scores. Filtered accuracy measures the change in LLM performance when responses with confidence scores below a specified threshold are excluded. By adjusting the confidence score threshold, we can observe how the accuracy of the LLM improves as less certain responses are filtered out.

We will plot the filtered accuracy across various confidence score thresholds to visualize the relationship between confidence and LLM accuracy. This analysis helps in understanding the trade-off between response coverage (measured by sample size below) and LLM accuracy, providing insights into the reliability of the LLM’s outputs.

[12]:
plot_model_accuracies(
    scores=result_df.confidence_score, correct_indicators=result_df.response_correct
)
../../_images/_notebooks_examples_semantic_entropy_demo_21_0.png

3.2 Precision, Recall, F1-Score of Hallucination Detection#

Lastly, we compute the optimal threshold for binarizing confidence scores, using F1-score as the objective. Using this threshold, we compute precision, recall, and F1-score for black box scorer predictions of whether responses are correct.

[13]:
# instantiate UQLM tuner object for threshold selection
t = Tuner()

# Define score vector and corresponding correct indicators (i.e. ground truth)
y_scores = result_df["confidence_score"]  # confidence score
correct_indicators = (
    result_df.response_correct
) * 1  # Whether responses is actually correct

# Solve for threshold that maximizes F1-score
best_threshold = t.tune_threshold(
    y_scores=y_scores,
    correct_indicators=correct_indicators,
    thresh_objective="fbeta_score",
)
y_pred = [
    (s > best_threshold) * 1 for s in y_scores
]  # predicts whether response is correct based on confidence score
print(f"Semantic entropy F1-optimal threshold: {best_threshold}")
Semantic entropy F1-optimal threshold: 0.54
[14]:
# evaluate precision, recall, and f1-score of semantic entropy predictions of correctness
print(
    f"Semantic entropy precision: {precision_score(y_true=correct_indicators, y_pred=y_pred)}"
)
print(
    f"Semantic entropy recall: {recall_score(y_true=correct_indicators, y_pred=y_pred)}"
)
print(
    f"Semantic entropy f1-score: {f1_score(y_true=correct_indicators, y_pred=y_pred)}"
)
Semantic entropy precision: 0.7951807228915663
Semantic entropy recall: 0.9705882352941176
Semantic entropy f1-score: 0.8741721854304636

© 2025 CVS Health and/or one of its affiliates. All rights reserved.