🎯 Semantic Density#
This demo illustrates a state-of-the-art uncertainty quantification (UQ) approach known as semantic density. The semantic density method combines elements of black-box UQ (which generates multiple responses from the same prompt) and white-box UQ (which uses token probabilities of those generated responses) to compute density values. Intuitively, semantic density combines both signals to estimate a probability distribution for scoring each response. This method was proposed by Qiu et al. (2024) and is demonstrated in this notebook.
📊 What You’ll Do in This Demo#
1
Set up LLM and prompts.
Set up LLM instance and load example data prompts.
2
Generate LLM Responses and Confidence Scores
Generate and score LLM responses to the example questions using the SemanticDensity() class.
3
Evaluate Hallucination Detection Performance
Visualize model accuracy at different thresholds of the semantic density score. Compute precision, recall, and F1-score of hallucination detection.
⚖️ Advantages & Limitations#
Pros
Universal Compatibility: Works with any LLM
Intuitive: Easy to understand and implement
Response-wise: Evaluates the trustworthiness of each response separately
Cons
Higher Cost: Requires multiple generations per prompt
Slower: Multiple generations and comparison calculations increase latency
Internal Access Required: Needs access to token probabilities
[1]:
import numpy as np
from sklearn.metrics import precision_score, recall_score, f1_score
from uqlm.utils import load_example_dataset, plot_model_accuracies, LLMGrader, Tuner
from uqlm.scorers import SemanticDensity
## 1. Set up LLM and Prompts
In this demo, we will illustrate this approach using a set of short answer questions from the SimpleQA benchmark. To implement with your use case, simply replace the example prompts with your data.
[2]:
# Load example dataset (simpleqa)
simpleqa = load_example_dataset("simpleqa", n=200)
simpleqa.head()
Loading dataset - simpleqa...
Processing dataset...
Dataset ready!
[2]:
| question | answer | |
|---|---|---|
| 0 | How much money, in euros, was the surgeon held... | 120,000 euros |
| 1 | What is the name of the former Prime Minister ... | Jóhanna Sigurðardóttir |
| 2 | To whom did Mehbooba Mufti Sayed contest the 2... | Hasnain Masoodi |
| 3 | In which year did Melbourne's Monash Gallery o... | 2023 |
| 4 | Who requested the Federal Aviation Administrat... | The Coast Guard |
[3]:
# Define prompts
INSTRUCTION = "You will be given a question. Return only the answer as concisely as possible without providing an explanation.\n"
prompts = [INSTRUCTION + prompt for prompt in simpleqa.question]
[4]:
# import sys
# !{sys.executable} -m pip install langchain-google-vertexai
from langchain_google_vertexai import ChatVertexAI
llm = ChatVertexAI(model_name="gemini-2.5-pro")
## 2. Generate responses and confidence scores
SemanticDensity() - Generate LLM responses and compute consistency-based confidence scores for each response.#
📋 Class Attributes#
Parameter | Type & Default | Description |
|---|---|---|
llm | BaseChatModeldefault=None | A langchain llm |
device | str or torch.devicedefault=”cpu” | Specifies the device that NLI model use for prediction. Only applies to ‘noncontradiction’ scorer. Pass a torch.device to leverage GPU. |
system_prompt | str or Nonedefault=”You are a helpful assistant.” | Optional argument for user to provide custom system prompt for the LLM. |
max_calls_per_min | intdefault=None | Specifies how many API calls to make per minute to avoid rate limit errors. By default, no limit is specified. |
length_normalize | boolbool, default=True | Determines whether response probabilities are length-normalized. Recommended to set as True when longer responses are expected. |
use_n_param | booldefault=False | Specifies whether to use n parameter for BaseChatModel. Not compatible with all BaseChatModel classes. If used, it speeds up the generation process substantially when num_responses is large. |
postprocessor | callabledefault=None | A user-defined function that takes a string input and returns a string. Used for postprocessing outputs. |
sampling_temperature | floatdefault=1 | The ‘temperature’ parameter for LLM model to generate sampled LLM responses. Must be greater than 0. |
nli_model_name | strdefault=”microsoft/deberta-large-mnli” | Specifies which NLI model to use. Must be acceptable input to AutoTokenizer.from_pretrained() and AutoModelForSequenceClassification.from_pretrained(). |
max_length | intdefault=2000 | Specifies the maximum allowed string length for LLM responses for NLI computation. Responses longer than this value will be truncated in NLI computations to avoid OutOfMemoryError. |
return_responses | strdefault=”all” | If a postprocessor is used, specifies whether to return only postprocessed responses, only raw responses, or both. Specified with ‘postprocessed’, ‘raw’, or ‘all’, respectively. |
🔍 Parameter Groups#
🧠 LLM-Specific
llm
system_prompt
sampling_temperature
📊 Confidence Scores
length_normalize
nli_model_name
postprocessor
🖥️ Hardware
device
⚡ Performance
max_calls_per_min
use_n_param
💻 Usage Examples#
# Basic usage with default parameters
sd = SemanticDensity(llm=llm)
# Using GPU acceleration, default scorers
sd = SemanticDensity(llm=llm, device=torch.device("cuda"))
# High-throughput configuration with rate limiting
sd = SemanticDensity(llm=llm, max_calls_per_min=200, use_n_param=True)
[5]:
import torch
# Set the torch device
if torch.cuda.is_available(): # NVIDIA GPU
device = torch.device("cuda")
elif torch.backends.mps.is_available(): # macOS
device = torch.device("mps")
else:
device = torch.device("cpu") # CPU
print(f"Using {device.type} device")
Using cuda device
[6]:
sd = SemanticDensity(
llm=llm,
max_calls_per_min=250, # set value to avoid rate limit error
device=device, # use if GPU available
length_normalize=True,
)
🔄 Class Methods#
Method | Description & Parameters |
|---|---|
SemanticDensity.generate_and_score | Generate LLM responses, sampled LLM (candidate) responses, and compute density score for the provided prompts. Parameters:
Returns: UQResult containing data (prompts, responses, sampled responses, and density score) and metadata 💡 Best For: Complete end-to-end uncertainty quantification when starting with prompts. |
SemanticDensity.score | Compute density score on provided LLM responses. Should only be used if responses and sampled responses are already generated. Parameters:
Returns: UQResult containing data (responses, sampled responses, and density score) and metadata 💡 Best For: Computing uncertainty scores when responses, sampled responses, and logprobs are already generated elsewhere. |
[11]:
results = await sd.generate_and_score(prompts=prompts, num_responses=5)
# # alternative approach: directly score if responses already generated
# results = sd.score(responses=responses, sampled_responses=sampled_responses)
[12]:
result_df = results.to_df()
result_df.head(5)
[12]:
| response | sampled_responses | prompt | semantic_density_value | multiple_logprob | |
|---|---|---|---|---|---|
| 0 | €120,000 | [€120,000, €120,000, €136,000, €120,000, €120,... | You will be given a question. Return only the ... | 0.865526 | [[{'token': '€', 'logprob': -4.172499757260084... |
| 1 | Jóhanna Sigurðardóttir | [Jóhanna Sigurðardóttir, Jóhanna Sigurðardótti... | You will be given a question. Return only the ... | 0.992922 | [[{'token': 'J', 'logprob': -9.536738616588991... |
| 2 | Hasnain Masoodi | [Hasnain Masoodi, Hasnain Masoodi, Hasnain Mas... | You will be given a question. Return only the ... | 0.993829 | [[{'token': 'Has', 'logprob': -2.5987286790041... |
| 3 | 2023 | [2023, 2023, 2022, 2022, 2022] | You will be given a question. Return only the ... | 0.941380 | [[{'token': '2', 'logprob': -1.430510337740997... |
| 4 | BP and the U.S. Coast Guard | [BP, BP, BP and the U.S. Coast Guard, BP (Brit... | You will be given a question. Return only the ... | 0.978186 | [[{'token': 'BP', 'logprob': -4.20799915445968... |
## 3. Evaluate Hallucination Detection Performance
To evaluate hallucination detection performance, we ‘grade’ the responses against an answer key. Here, we use UQLM’s out-of-the-box LLM Grader, which can be used with LangChain Chat Model, but you may replace this with a grading method of your choice. Some notable alternatives are Vectara HHEM and AlignScore. If you are using your own prompts/questions, be sure to update the grading method accordingly.
[14]:
# Populate correct answers and grade responses
gemini_flash = ChatVertexAI(model="gemini-2.5-flash")
grader = LLMGrader(llm=gemini_flash)
result_df["answer"] = simpleqa["answer"]
result_df["response_correct"] = await grader.grade_responses(prompts=simpleqa["question"].to_list(), responses=result_df["response"].to_list(), answers=simpleqa["answer"].to_list())
result_df.head(5)
[14]:
| response | sampled_responses | prompt | semantic_density_value | multiple_logprob | answer | response_correct | |
|---|---|---|---|---|---|---|---|
| 0 | €120,000 | [€120,000, €120,000, €136,000, €120,000, €120,... | You will be given a question. Return only the ... | 0.865526 | [[{'token': '€', 'logprob': -4.172499757260084... | 120,000 euros | True |
| 1 | Jóhanna Sigurðardóttir | [Jóhanna Sigurðardóttir, Jóhanna Sigurðardótti... | You will be given a question. Return only the ... | 0.992922 | [[{'token': 'J', 'logprob': -9.536738616588991... | Jóhanna Sigurðardóttir | True |
| 2 | Hasnain Masoodi | [Hasnain Masoodi, Hasnain Masoodi, Hasnain Mas... | You will be given a question. Return only the ... | 0.993829 | [[{'token': 'Has', 'logprob': -2.5987286790041... | Hasnain Masoodi | True |
| 3 | 2023 | [2023, 2023, 2022, 2022, 2022] | You will be given a question. Return only the ... | 0.941380 | [[{'token': '2', 'logprob': -1.430510337740997... | 2023 | True |
| 4 | BP and the U.S. Coast Guard | [BP, BP, BP and the U.S. Coast Guard, BP (Brit... | You will be given a question. Return only the ... | 0.978186 | [[{'token': 'BP', 'logprob': -4.20799915445968... | The Coast Guard | False |
[15]:
print(f"""Baseline LLM accuracy: {np.mean(result_df["response_correct"])}""")
Baseline LLM accuracy: 0.54
3.1 Filtered LLM Accuracy Evaluation#
Here, we explore ‘filtered accuracy’ as a metric for evaluating the performance of our density scores. Filtered accuracy measures the change in LLM performance when responses with density scores below a specified threshold are excluded. By adjusting the density score threshold, we can observe how the accuracy of the LLM improves as less certain responses are filtered out.
We will plot the filtered accuracy across various density score thresholds to visualize the relationship between density and LLM accuracy. This analysis helps in understanding the trade-off between response coverage (measured by sample size below) and LLM accuracy, providing insights into the reliability of the LLM’s outputs.
[16]:
# Semantic Density
plot_model_accuracies(scores=result_df.semantic_density_value, correct_indicators=result_df.response_correct, display_percentage=True)
3.2 Precision, Recall, F1-Score of Hallucination Detection#
Lastly, we compute the optimal threshold for binarizing confidence scores, using F1-score as the objective. Using these thresholds, we compute precision, recall, and F1-score for our semantic density-based scorer predictions of whether responses are correct.
[17]:
# instantiate UQLM tuner object for threshold selection
split = len(result_df) // 2
t = Tuner()
correct_indicators = (result_df.response_correct) * 1 # Whether responses is actually correct
metric_values = {"Precision": [], "Recall": [], "F1-score": []}
optimal_thresholds = []
# tune threshold on first half
y_scores = result_df["semantic_density_value"]
y_scores_tune = y_scores[0:split]
y_true_tune = correct_indicators[0:split]
best_threshold = t.tune_threshold(y_scores=y_scores_tune, correct_indicators=y_true_tune, thresh_objective="fbeta_score")
y_pred = [(s > best_threshold) * 1 for s in y_scores] # predicts whether response is correct based on confidence score
optimal_thresholds.append(best_threshold)
# evaluate on last half
y_true_eval = correct_indicators[split:]
y_pred_eval = y_pred[split:]
metric_values["Precision"].append(precision_score(y_true=y_true_eval, y_pred=y_pred_eval))
metric_values["Recall"].append(recall_score(y_true=y_true_eval, y_pred=y_pred_eval))
metric_values["F1-score"].append(f1_score(y_true=y_true_eval, y_pred=y_pred_eval))
# print results
header = f"{'Metrics':<25}" + f"{'semantic_density_value':<35}"
print("=" * len(header) + "\n" + header + "\n" + "-" * len(header))
for metric in metric_values.keys():
print(f"{metric:<25}" + "".join([f"{round(x_, 3):<35}" for x_ in metric_values[metric]]))
print("-" * len(header))
print(f"{'F-1 optimal threshold':<25}" + "".join([f"{round(x_, 3):<35}" for x_ in optimal_thresholds]))
print("=" * len(header))
============================================================
Metrics semantic_density_value
------------------------------------------------------------
Precision 0.765
Recall 0.765
F1-score 0.765
------------------------------------------------------------
F-1 optimal threshold 0.61
============================================================
4. Scorer Definition#
Semantic Density#
Semantic Density (SD) approximates a probability density function (PDF) in semantic space for estimating response correctness. Given a prompt \(x\) with candidate response \(y_*\), the objective is to construct a PDF that assigns higher density to regions in the semantic space that correspond to correct responses. We begin by sampling \(M\) unique reference responses \(y_i\) (for \(i = 1, 2, \dots, M\)) conditioned on \(x\). For any pair of responses \(y_i, y_j\) with corresponding embeddings \(v_i, v_j\), the semantic distance is estimated as
where \(p_c, p_n\) denote the contradiction and neutrality scores returned by a natural language inference (NLI) model, respectively. This estimated distance is incorporated in the kernel function \(K\) to smooth out the reference responses into a continuous distribution. The kernel function value can be obtained as
where \(\bf{1}\) is the indicator function such that \(\bf{1}_{\text{condition}} = 1\) when the condition holds and \(0\) otherwise. The final semantic density score is computed as
where \(L_i\) denotes the length of \(y_i\).
© 2025 CVS Health and/or one of its affiliates. All rights reserved.