🎯 BS Detector: Off-the-Shelf Ensemble for LLM Uncertainty#

Ensemble UQ methods combine multiple individual scorers to provide a more robust and accurate uncertainty estimate. This demo illustrates the BS Detector method proposed in Chen & Mueller, 2023. It uses three components:

Two black-box components: exact match rate and noncontradiction probability
One LLM-as-a-Judge component (self-judge)

📊 What You’ll Do in This Demo#

Set up LLM and prompts.

Set up LLM instance and load example data prompts.

Generate LLM Responses and Confidence Scores

Generate and score LLM responses to the example questions using the UQEnsemble() class.

Evaluate Hallucination Detection Performance

Visualize model accuracy at different thresholds of the ensemble score, combining exact match rate, noncontradiction probability, and self-judge. Compute precision, recall, and F1-score of hallucination detection.

⚖️ Advantages & Limitations#

Pros

Universal Compatibility: Works with any LLM
Intuitive: Easy to understand and implement
No Internal Access Required: Doesn’t need token probabilities or model internals

Cons

Higher Cost: Requires multiple generations per prompt
Slower: Multiple generations and comparison calculations increase latency

[1]:

import numpy as np
from sklearn.metrics import precision_score, recall_score, f1_score

from uqlm import UQEnsemble
from uqlm.utils import load_example_dataset, plot_model_accuracies, LLMGrader, Tuner

1. Set up LLM and Prompts#

In this demo, we will illustrate this approach using a set of short answer questions from the HotpotQA benchmark. To implement with your use case, simply replace the example prompts with your data.

[2]:

# Load example dataset (hotpotqa)
hotpotqa = load_example_dataset("hotpotqa", n=200)
hotpotqa.head()

Loading dataset - hotpotqa...
Processing dataset...
Dataset ready!

[2]:

	question	answer
0	Were Scott Derrickson and Ed Wood of the same ...	yes
1	What government position was held by the woman...	Chief of Protocol
2	What science fantasy young adult series, told ...	Animorphs
3	Are the Laleli Mosque and Esma Sultan Mansion ...	no
4	The director of the romantic comedy "Big Stone...	Greenwich Village, New York City

[3]:

# Define prompts
INSTRUCTION = "You will be given a question. Return only the answer as concisely as possible without providing an explanation.\n"
prompts = [INSTRUCTION + prompt for prompt in hotpotqa.question]

In this example, we use ChatVertexAI to instantiate our LLM, but any LangChain Chat Model may be used. Be sure to replace with your LLM of choice.

[4]:

# Alternative example (open model with Ollama):
# import sys
# !{sys.executable} -m pip install langchain-ollama

# from langchain_ollama import ChatOllama
# llm = ChatOllama(model="llama3")


# API example:
# !{sys.executable} -m pip install langchain-google-vertexai
from langchain_google_vertexai import ChatVertexAI

gemini_flash = ChatVertexAI(model="gemini-2.5-flash")

2. Generate responses and confidence scores#

`UQEnsemble()` - Ensemble of uncertainty scorers#

📋 Class Attributes#

Parameter	Type & Default	Description
llm	BaseChatModeldefault=None	A langchain llm `BaseChatModel`. User is responsible for specifying temperature and other relevant parameters to the constructor of the provided `llm` object.
device	str or torch.devicedefault=”cpu”	Specifies the device that NLI model use for prediction. Only applies to ‘semantic_negentropy’, ‘noncontradiction’ scorers. Pass a torch.device to leverage GPU.
use_best	booldefault=True	Specifies whether to swap the original response for the uncertainty-minimized response among all sampled responses based on semantic entropy clusters. Only used if `scorers` includes ‘semantic_negentropy’ or ‘noncontradiction’.
system_prompt	str or Nonedefault=”You are a helpful assistant.”	Optional argument for user to provide custom system prompt for the LLM.
max_calls_per_min	intdefault=None	Specifies how many API calls to make per minute to avoid rate limit errors. By default, no limit is specified.
use_n_param	booldefault=False	Specifies whether to use n parameter for BaseChatModel. Not compatible with all BaseChatModel classes. If used, it speeds up the generation process substantially when num_responses is large.
postprocessor	callabledefault=None	A user-defined function that takes a string input and returns a string. Used for postprocessing outputs.
sampling_temperature	floatdefault=1	The ‘temperature’ parameter for LLM model to generate sampled LLM responses. Must be greater than 0.
nli_model_name	strdefault=”microsoft/deberta-large-mnli”	Specifies which NLI model to use. Must be acceptable input to AutoTokenizer.from_pretrained() and AutoModelForSequenceClassification.from_pretrained().
return_responses	strdefault=”all”	If a postprocessor is used, specifies whether to return only postprocessed responses, only raw responses, or both. Specified with ‘postprocessed’, ‘raw’, or ‘all’, respectively.

🔍 Parameter Groups#

🧠 LLM-Specific

llm
system_prompt
sampling_temperature

📊 Confidence Scores

nli_model_name
use_best
postprocessor

🖥️ Hardware

device

⚡ Performance

max_calls_per_min
use_n_param

💻 Usage Examples#

# Basic usage with default parameters
bsd = UQEnsemble(llm=llm)

# Using GPU acceleration
bsd = UQEnsemble(llm=llm, device=torch.device("cuda"))

# High-throughput configuration with rate limiting
bsd = UQEnsemble(llm=llm, max_calls_per_min=200, use_n_param=True)

[5]:

import torch

# Set the torch device
if torch.cuda.is_available():  # NVIDIA GPU
    device = torch.device("cuda")
elif torch.backends.mps.is_available():  # macOS
    device = torch.device("mps")
else:
    device = torch.device("cpu")  # CPU
print(f"Using {device.type} device")

Using cuda device

[6]:

bsd = UQEnsemble(llm=gemini_flash, device=device)

🔄 Class Methods#

Method	Description & Parameters
UQEnsemble.generate_and_score	Generate LLM responses, sampled LLM (candidate) responses, and compute confidence scores for the provided prompts. Parameters: prompts - (list of str) A list of input prompts for the model. num_responses - (int, default=5) The number of sampled responses used to compute consistency. show_progress_bars - (bool, default=True) If True, displays a progress bar while generating and scoring responses. Returns: UQResult containing data (prompts, responses, sampled responses, and confidence scores) and metadata 💡 Best For: Complete end-to-end uncertainty quantification when starting with prompts.
UQEnsemble.score	Compute confidence scores on provided LLM responses. Should only be used if responses and sampled responses are already generated. Parameters: prompts - (list of str) A list of input prompts for the LLM. responses - (list of str) A list of LLM responses for the prompts. sampled_responses - (list of list of str) A list of lists of sampled LLM responses for each prompt. These will be used to compute consistency scores by comparing to the corresponding response from responses. show_progress_bars - (bool, default=True) If True, displays a progress bar while scoring responses. Returns: UQResult containing data (responses, sampled responses, and confidence scores) and metadata 💡 Best For: Computing uncertainty scores when responses are already generated elsewhere.

Method

Description & Parameters

UQEnsemble.generate_and_score

Generate LLM responses, sampled LLM (candidate) responses, and compute confidence scores for the provided prompts.

Parameters:

prompts - (list of str) A list of input prompts for the model.
num_responses - (int, default=5) The number of sampled responses used to compute consistency.
show_progress_bars - (bool, default=True) If True, displays a progress bar while generating and scoring responses.

Returns: UQResult containing data (prompts, responses, sampled responses, and confidence scores) and metadata

💡 Best For: Complete end-to-end uncertainty quantification when starting with prompts.

UQEnsemble.score

Compute confidence scores on provided LLM responses. Should only be used if responses and sampled responses are already generated.

Parameters:

prompts - (list of str) A list of input prompts for the LLM.
responses - (list of str) A list of LLM responses for the prompts.
sampled_responses - (list of list of str) A list of lists of sampled LLM responses for each prompt. These will be used to compute consistency scores by comparing to the corresponding response from responses.
show_progress_bars - (bool, default=True) If True, displays a progress bar while scoring responses.

Returns: UQResult containing data (responses, sampled responses, and confidence scores) and metadata

💡 Best For: Computing uncertainty scores when responses are already generated elsewhere.

[7]:

results = await bsd.generate_and_score(prompts=prompts, num_responses=5)

# # alternative approach: directly score if responses already generated
# results = bsd.score(prompts=prompts, responses=responses, sampled_responses=sampled_responses)

[8]:

# preview results
result_df = results.to_df()
result_df.head(5)

[8]:

	response	sampled_responses	prompt	ensemble_score	noncontradiction	exact_match	judge_1
0	Yes	[Yes, Yes, Yes, Yes, Yes]	You will be given a question. Return only the ...	1.000000	1.000000	1.0	1.0
1	Ambassador	[Ambassador, Chief of Protocol, Ambassador, Am...	You will be given a question. Return only the ...	0.894327	0.911298	0.6	1.0
2	Animorphs	[Animorphs, Animorphs, Animorphs, Animorphs, A...	You will be given a question. Return only the ...	1.000000	1.000000	1.0	1.0
3	No	[No, No, No, No, No]	You will be given a question. Return only the ...	1.000000	1.000000	1.0	1.0
4	New York City	[New York City, New York City, New York City, ...	You will be given a question. Return only the ...	1.000000	1.000000	1.0	1.0

3. Evaluate Hallucination Detection Performance#

To evaluate hallucination detection performance, we ‘grade’ the responses against an answer key. Here, we use UQLM’s out-of-the-box LLM Grader, which can be used with LangChain Chat Model, but you may replace this with a grading method of your choice. Some notable alternatives are Vectara HHEM and AlignScore. If you are using your own prompts/questions, be sure to update the grading method accordingly.

[12]:

# Populate correct answers and grade responses
grader = LLMGrader(llm=gemini_flash)

result_df["answer"] = hotpotqa["answer"]
result_df["response_correct"] = await grader.grade_responses(prompts=hotpotqa["question"].to_list(), responses=result_df["response"].to_list(), answers=hotpotqa["answer"].to_list())
result_df.head(5)

[12]:

	response	sampled_responses	prompt	ensemble_score	noncontradiction	exact_match	judge_1	answer	response_correct
0	Yes	[Yes, Yes, Yes, Yes, Yes]	You will be given a question. Return only the ...	1.000000	1.000000	1.0	1.0	yes	True
1	Ambassador	[Ambassador, Chief of Protocol, Ambassador, Am...	You will be given a question. Return only the ...	0.894327	0.911298	0.6	1.0	Chief of Protocol	False
2	Animorphs	[Animorphs, Animorphs, Animorphs, Animorphs, A...	You will be given a question. Return only the ...	1.000000	1.000000	1.0	1.0	Animorphs	True
3	No	[No, No, No, No, No]	You will be given a question. Return only the ...	1.000000	1.000000	1.0	1.0	no	True
4	New York City	[New York City, New York City, New York City, ...	You will be given a question. Return only the ...	1.000000	1.000000	1.0	1.0	Greenwich Village, New York City	True

[13]:

print(f"""Baseline LLM accuracy: {np.mean(result_df["response_correct"])}""")

Baseline LLM accuracy: 0.635

Here, we explore ‘filtered accuracy’ as a metric for evaluating the performance of our confidence scores. Filtered accuracy measures the change in LLM performance when responses with confidence scores below a specified threshold are excluded. By adjusting the confidence score threshold, we can observe how the accuracy of the LLM improves as less certain responses are filtered out.

We will plot the filtered accuracy across various confidence score thresholds to visualize the relationship between confidence and LLM accuracy. This analysis helps in understanding the trade-off between response coverage (measured by sample size below) and LLM accuracy, providing insights into the reliability of the LLM’s outputs.

[14]:

plot_model_accuracies(scores=result_df.ensemble_score, correct_indicators=result_df.response_correct, display_percentage=True)

../../_images/_notebooks_examples_ensemble_off_the_shelf_demo_21_0.png

Lastly, we compute the optimal threshold for binarizing confidence scores, using F1-score as the objective. Using this threshold, we compute precision, recall, and F1-score for black box scorer predictions of whether responses are correct.

[15]:

# instantiate UQLM tuner object for threshold selection
split = len(result_df) // 2
t = Tuner()

correct_indicators = (result_df.response_correct) * 1  # Whether responses is actually correct
metric_values = {"Precision": [], "Recall": [], "F1-score": []}
optimal_thresholds = []
for confidence_score in ["ensemble_score"]:
    # tune threshold on first half
    y_scores = result_df[confidence_score]
    y_scores_tune = y_scores[0:split]
    y_true_tune = correct_indicators[0:split]
    best_threshold = t.tune_threshold(y_scores=y_scores_tune, correct_indicators=y_true_tune, thresh_objective="fbeta_score")

    y_pred = [(s > best_threshold) * 1 for s in y_scores]  # predicts whether response is correct based on confidence score
    optimal_thresholds.append(best_threshold)

    # evaluate on last half
    y_true_eval = correct_indicators[split:]
    y_pred_eval = y_pred[split:]
    metric_values["Precision"].append(precision_score(y_true=y_true_eval, y_pred=y_pred_eval))
    metric_values["Recall"].append(recall_score(y_true=y_true_eval, y_pred=y_pred_eval))
    metric_values["F1-score"].append(f1_score(y_true=y_true_eval, y_pred=y_pred_eval))

# print results
header = f"{'Metrics':<25}" + "".join([f"{scorer_name:<25}" for scorer_name in ["ensemble_score"]])
print("=" * len(header) + "\n" + header + "\n" + "-" * len(header))
for metric in metric_values.keys():
    print(f"{metric:<25}" + "".join([f"{round(x_, 3):<25}" for x_ in metric_values[metric]]))
print("-" * len(header))
print(f"{'F-1 optimal threshold':<25}" + "".join([f"{round(x_, 3):<25}" for x_ in optimal_thresholds]))
print("=" * len(header))

==================================================
Metrics                  ensemble_score
--------------------------------------------------
Precision                0.787
Recall                   0.787
F1-score                 0.787
--------------------------------------------------
F-1 optimal threshold    0.85
==================================================