{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# 🎯 Black-Box Uncertainty Quantification\n", "\n", "
\n", "

\n", " Black-box Uncertainty Quantification (UQ) methods treat the LLM as a black box and evaluate \n", " consistency of multiple responses generated from the same prompt to estimate response-level confidence. This demo provides an illustration \n", " of how to use state-of-the-art black-box UQ methods with uqlm. The following scorers are available:\n", "

\n", " \n", "* Non-Contradiction Probability ([Chen & Mueller, 2023](https://arxiv.org/abs/2308.16175); [Lin et al., 2025](https://arxiv.org/abs/2305.19187); [Manakul et al., 2023](https://arxiv.org/abs/2303.08896))\n", "* Semantic Negentropy (based on [Farquhar et al., 2024](https://www.nature.com/articles/s41586-024-07421-0); [Kuhn et al., 2023](https://arxiv.org/pdf/2302.09664))\n", "* Exact Match ([Cole et al., 2023](https://arxiv.org/abs/2305.14613); [Chen & Mueller, 2023](https://arxiv.org/abs/2308.16175))\n", "* BERT-score ([Manakul et al., 2023](https://arxiv.org/abs/2303.08896); [Zheng et al., 2020](https://arxiv.org/abs/1904.09675))\n", "* BLUERT ([Sellam et al., 2020](https://arxiv.org/abs/2004.04696))\n", "* Normalized Cosine Similarity ([Shorinwa et al., 2024](https://arxiv.org/pdf/2412.05563); [HuggingFace](https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2))\n", "
\n", "\n", "## 📊 What You'll Do in This Demo\n", "\n", "
\n", "
1
\n", "
\n", "

Set up LLM and prompts.

\n", "

Set up LLM instance and load example data prompts.

\n", "
\n", "
\n", "\n", "
\n", "
2
\n", "
\n", "

Generate LLM Responses and Confidence Scores

\n", "

Generate and score LLM responses to the example questions using the BlackBoxUQ() class.

\n", "
\n", "
\n", "\n", "
\n", "
3
\n", "
\n", "

Evaluate Hallucination Detection Performance

\n", "

Visualize model accuracy at different thresholds of the various black-box UQ confidence scores. Compute precision, recall, and F1-score of hallucination detection.

\n", "
\n", "
\n", "\n", "## ⚖️ Advantages & Limitations\n", "\n", "
\n", "
\n", "

Pros

\n", " \n", "
\n", " \n", "
\n", "

Cons

\n", " \n", "
\n", "
" ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "tags": [] }, "outputs": [], "source": [ "import numpy as np\n", "from sklearn.metrics import precision_score, recall_score, f1_score\n", "\n", "from uqlm import BlackBoxUQ\n", "from uqlm.utils import load_example_dataset, math_postprocessor, plot_model_accuracies, Tuner" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "## 1. Set up LLM and Prompts" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In this demo, we will illustrate this approach using a set of math questions from the [SVAMP benchmark](https://arxiv.org/abs/2103.07191). To implement with your use case, simply **replace the example prompts with your data**. " ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "tags": [] }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Loading dataset - svamp...\n", "Processing dataset...\n", "Dataset ready!\n" ] }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
questionanswer
0There are 87 oranges and 290 bananas in Philip...145
1Marco and his dad went strawberry picking. Mar...19
2Edward spent $ 6 to buy 2 books each book cost...3
3Frank was reading through his favorite book. T...198
4There were 78 dollars in Olivia's wallet. She ...63
\n", "
" ], "text/plain": [ " question answer\n", "0 There are 87 oranges and 290 bananas in Philip... 145\n", "1 Marco and his dad went strawberry picking. Mar... 19\n", "2 Edward spent $ 6 to buy 2 books each book cost... 3\n", "3 Frank was reading through his favorite book. T... 198\n", "4 There were 78 dollars in Olivia's wallet. She ... 63" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Load example dataset (SVAMP)\n", "svamp = load_example_dataset(\"svamp\", n=75)\n", "svamp.head()" ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "tags": [] }, "outputs": [], "source": [ "# Define prompts\n", "MATH_INSTRUCTION = (\n", " \"When you solve this math problem only return the answer with no additional text.\\n\"\n", ")\n", "prompts = [MATH_INSTRUCTION + prompt for prompt in svamp.question]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In this example, we use `ChatVertexAI` to instantiate our LLM, but any [LangChain Chat Model](https://js.langchain.com/docs/integrations/chat/) may be used. Be sure to **replace with your LLM of choice.**" ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "tags": [] }, "outputs": [], "source": [ "# import sys\n", "# !{sys.executable} -m pip install langchain-google-vertexai\n", "from langchain_google_vertexai import ChatVertexAI\n", "\n", "llm = ChatVertexAI(model=\"gemini-pro\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "## 2. Generate LLM Responses and Confidence Scores" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### `BlackBoxUQ()` - Generate LLM responses and compute consistency-based confidence scores for each response.\n", "\n", "![Sample Image](https://raw.githubusercontent.com/cvs-health/uqlm/develop/assets/images/black_box_graphic.png)\n", "\n", "#### 📋 Class Attributes\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
ParameterType & DefaultDescription
llmBaseChatModel
default=None
A langchain llm `BaseChatModel`. User is responsible for specifying temperature and other relevant parameters to the constructor of the provided `llm` object.
scorersList[str]
default=None
Specifies which black box (consistency) scorers to include. Must be subset of ['semantic_negentropy', 'noncontradiction', 'exact_match', 'bert_score', 'bleurt', 'cosine_sim']. If None, defaults to [\"semantic_negentropy\", \"noncontradiction\", \"exact_match\", \"cosine_sim\"]. Note that using \"bleurt\" scorer requires installation of bleurt package (pip install pip install --user git+https://github.com/google-research/bleurt.git).
devicestr or torch.device
default=\"cpu\"
Specifies the device that NLI model use for prediction. Only applies to 'semantic_negentropy', 'noncontradiction' scorers. Pass a torch.device to leverage GPU.
use_bestbool
default=True
Specifies whether to swap the original response for the uncertainty-minimized response among all sampled responses based on semantic entropy clusters. Only used if `scorers` includes 'semantic_negentropy' or 'noncontradiction'.
system_promptstr or None
default=\"You are a helpful assistant.\"
Optional argument for user to provide custom system prompt for the LLM.
max_calls_per_minint
default=None
Specifies how many API calls to make per minute to avoid rate limit errors. By default, no limit is specified.
use_n_parambool
default=False
Specifies whether to use n parameter for BaseChatModel. Not compatible with all BaseChatModel classes. If used, it speeds up the generation process substantially when num_responses is large.
postprocessorcallable
default=None
A user-defined function that takes a string input and returns a string. Used for postprocessing outputs.
sampling_temperaturefloat
default=1
The 'temperature' parameter for LLM to use when generating sampled LLM responses. Must be greater than 0.
nli_model_namestr
default=\"microsoft/deberta-large-mnli\"
Specifies which NLI model to use. Must be acceptable input to AutoTokenizer.from_pretrained() and AutoModelForSequenceClassification.from_pretrained().
max_lengthint
default=2000
Specifies the maximum allowed string length for LLM responses for NLI computation. Responses longer than this value will be truncated in NLI computations to avoid OutOfMemoryError.
\n", "\n", "#### 🔍 Parameter Groups\n", "\n", "
\n", "
\n", "

🧠 LLM-Specific

\n", " \n", "
\n", "
\n", "

📊 Confidence Scores

\n", " \n", "
\n", "
\n", "

🖥️ Hardware

\n", " \n", "
\n", "
\n", "

⚡ Performance

\n", " \n", "
\n", "
\n", "\n", "#### 💻 Usage Examples\n", "\n", "```python\n", "# Basic usage with default parameters\n", "bbuq = BlackBoxUQ(llm=llm)\n", "\n", "# Using GPU acceleration, default scorers\n", "bbuq = BlackBoxUQ(llm=llm, device=torch.device(\"cuda\"))\n", "\n", "# Custom scorer list\n", "bbuq = BlackBoxUQ(llm=llm, scorers=[\"semantic_negentropy\", \"exact_match\", \"cosine_sim\"])\n", "\n", "# High-throughput configuration with rate limiting\n", "bbuq = BlackBoxUQ(llm=llm, max_calls_per_min=200, use_n_param=True) \n", "```" ] }, { "cell_type": "code", "execution_count": 6, "metadata": { "tags": [] }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Using cuda device\n" ] } ], "source": [ "import torch\n", "\n", "# Set the torch device\n", "if torch.cuda.is_available(): # NVIDIA GPU\n", " device = torch.device(\"cuda\")\n", "elif torch.backends.mps.is_available(): # macOS\n", " device = torch.device(\"mps\")\n", "else:\n", " device = torch.device(\"cpu\") # CPU\n", "print(f\"Using {device.type} device\")" ] }, { "cell_type": "code", "execution_count": 8, "metadata": { "tags": [] }, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "Some weights of the model checkpoint at microsoft/deberta-large-mnli were not used when initializing DebertaForSequenceClassification: ['config']\n", "- This IS expected if you are initializing DebertaForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).\n", "- This IS NOT expected if you are initializing DebertaForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).\n" ] } ], "source": [ "bbuq = BlackBoxUQ(\n", " llm=llm,\n", " max_calls_per_min=250,\n", " device=device,\n", " scorers=[\"semantic_negentropy\", \"exact_match\", \"cosine_sim\"],\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 🔄 Class Methods\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
MethodDescription & Parameters
BlackBoxUQ.generate_and_score\n", "

Generate LLM responses, sampled LLM (candidate) responses, and compute confidence scores for the provided prompts.

\n", "

Parameters:

\n", "
    \n", "
  • prompts - (list of str) A list of input prompts for the model.
  • \n", "
  • num_responses - (int, default=5) The number of sampled responses used to compute consistency.
  • \n", "
\n", "

Returns: UQResult containing data (prompts, responses, sampled responses, and confidence scores) and metadata

\n", "
\n", " 💡 Best For: Complete end-to-end uncertainty quantification when starting with prompts.\n", "
\n", "
BlackBoxUQ.score\n", "

Compute confidence scores on provided LLM responses. Should only be used if responses and sampled responses are already generated.

\n", "

Parameters:

\n", "
    \n", "
  • responses - (list of str) A list of LLM responses for the prompts.
  • \n", "
  • sampled_responses - (list of list of str) A list of lists of sampled LLM responses for each prompt. These will be used to compute consistency scores by comparing to the corresponding response from responses.
  • \n", "
\n", "

Returns: UQResult containing data (responses, sampled responses, and confidence scores) and metadata

\n", "
\n", " 💡 Best For: Computing uncertainty scores when responses are already generated elsewhere.\n", "
\n", "
" ] }, { "cell_type": "code", "execution_count": 9, "metadata": { "tags": [] }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Generating responses...\n", "Generating candidate responses...\n", "Computing confidence scores...\n" ] } ], "source": [ "results = await bbuq.generate_and_score(\n", " prompts=prompts, \n", " num_responses=10, # for lower cost and latency, use smaller value of num_responses\n", ")\n", "\n", "# # alternative approach: directly score if responses already generated\n", "# results = bbuq.score(responses=responses, sampled_responses=sampled_responses)" ] }, { "cell_type": "code", "execution_count": 10, "metadata": { "tags": [] }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
responsesampled_responsespromptexact_matchcosine_simsemantic_negentropy
0145[145 \\n, 145 Each group should contain 145 ban...When you solve this math problem only return t...0.80.9723151.000000
119 pounds[19 pounds, 19, 19 pounds, 19 pounds, 19, 19, ...When you solve this math problem only return t...0.50.9090831.000000
2$3[$4, $4\\n, $3, $ 3.00, $3, $3, $ 3.00, $3.00, ...When you solve this math problem only return t...0.30.9151180.802269
3198[198, ```\\n198\\n```, 198, 198, 198, 198, 198, ...When you solve this math problem only return t...0.90.9904221.000000
463[63, 63, 63, 63, 63, 63, 63, 63, 63, 63]When you solve this math problem only return t...1.01.0000001.000000
\n", "
" ], "text/plain": [ " response sampled_responses \\\n", "0 145 [145 \\n, 145 Each group should contain 145 ban... \n", "1 19 pounds [19 pounds, 19, 19 pounds, 19 pounds, 19, 19, ... \n", "2 $3 [$4, $4\\n, $3, $ 3.00, $3, $3, $ 3.00, $3.00, ... \n", "3 198 [198, ```\\n198\\n```, 198, 198, 198, 198, 198, ... \n", "4 63 [63, 63, 63, 63, 63, 63, 63, 63, 63, 63] \n", "\n", " prompt exact_match cosine_sim \\\n", "0 When you solve this math problem only return t... 0.8 0.972315 \n", "1 When you solve this math problem only return t... 0.5 0.909083 \n", "2 When you solve this math problem only return t... 0.3 0.915118 \n", "3 When you solve this math problem only return t... 0.9 0.990422 \n", "4 When you solve this math problem only return t... 1.0 1.000000 \n", "\n", " semantic_negentropy \n", "0 1.000000 \n", "1 1.000000 \n", "2 0.802269 \n", "3 1.000000 \n", "4 1.000000 " ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "result_df = results.to_df()\n", "result_df.head(5)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "## 3. Evaluate Hallucination Detection Performance" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To evaluate hallucination detection performance, we 'grade' the responses against an answer key. Note the `math_postprocessor` is specific to our use case (math questions). **If you are using your own prompts/questions, update the grading method accordingly**." ] }, { "cell_type": "code", "execution_count": 11, "metadata": { "tags": [] }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
responsesampled_responsespromptexact_matchcosine_simsemantic_negentropyanswerresponse_correct
0145[145 \\n, 145 Each group should contain 145 ban...When you solve this math problem only return t...0.80.9723151.000000145True
119 pounds[19 pounds, 19, 19 pounds, 19 pounds, 19, 19, ...When you solve this math problem only return t...0.50.9090831.00000019True
2$3[$4, $4\\n, $3, $ 3.00, $3, $3, $ 3.00, $3.00, ...When you solve this math problem only return t...0.30.9151180.8022693True
3198[198, ```\\n198\\n```, 198, 198, 198, 198, 198, ...When you solve this math problem only return t...0.90.9904221.000000198True
463[63, 63, 63, 63, 63, 63, 63, 63, 63, 63]When you solve this math problem only return t...1.01.0000001.00000063True
\n", "
" ], "text/plain": [ " response sampled_responses \\\n", "0 145 [145 \\n, 145 Each group should contain 145 ban... \n", "1 19 pounds [19 pounds, 19, 19 pounds, 19 pounds, 19, 19, ... \n", "2 $3 [$4, $4\\n, $3, $ 3.00, $3, $3, $ 3.00, $3.00, ... \n", "3 198 [198, ```\\n198\\n```, 198, 198, 198, 198, 198, ... \n", "4 63 [63, 63, 63, 63, 63, 63, 63, 63, 63, 63] \n", "\n", " prompt exact_match cosine_sim \\\n", "0 When you solve this math problem only return t... 0.8 0.972315 \n", "1 When you solve this math problem only return t... 0.5 0.909083 \n", "2 When you solve this math problem only return t... 0.3 0.915118 \n", "3 When you solve this math problem only return t... 0.9 0.990422 \n", "4 When you solve this math problem only return t... 1.0 1.000000 \n", "\n", " semantic_negentropy answer response_correct \n", "0 1.000000 145 True \n", "1 1.000000 19 True \n", "2 0.802269 3 True \n", "3 1.000000 198 True \n", "4 1.000000 63 True " ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Populate correct answers \n", "result_df[\"answer\"] = svamp.answer\n", "\n", "# Grade responses against correct answers\n", "result_df[\"response_correct\"] = [\n", " math_postprocessor(r) == a for r, a in zip(result_df[\"response\"], svamp[\"answer\"])\n", "]\n", "result_df.head(5)" ] }, { "cell_type": "code", "execution_count": 12, "metadata": { "tags": [] }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Baseline LLM accuracy: 0.72\n" ] } ], "source": [ "print(f\"\"\"Baseline LLM accuracy: {np.mean(result_df[\"response_correct\"])}\"\"\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### 3.1 Filtered LLM Accuracy Evaluation" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Here, we explore ‘filtered accuracy’ as a metric for evaluating the performance of our confidence scores. Filtered accuracy measures the change in LLM performance when responses with confidence scores below a specified threshold are excluded. By adjusting the confidence score threshold, we can observe how the accuracy of the LLM improves as less certain responses are filtered out.\n", "\n", "We will plot the filtered accuracy across various confidence score thresholds to visualize the relationship between confidence and LLM accuracy. This analysis helps in understanding the trade-off between response coverage (measured by sample size below) and LLM accuracy, providing insights into the reliability of the LLM’s outputs. We conduct this analysis separately for each of our scorers. " ] }, { "cell_type": "code", "execution_count": 13, "metadata": { "tags": [] }, "outputs": [ { "data": { "image/png": "", "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "image/png": "", "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "image/png": "", "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "for confidence_score in [\"semantic_negentropy\", \"exact_match\", \"cosine_sim\"]:\n", " plot_model_accuracies(\n", " scores=result_df[confidence_score],\n", " correct_indicators=result_df.response_correct,\n", " title=f\"LLM Accuracy by {confidence_score} Threshold\",\n", " )" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### 3.2 Precision, Recall, F1-Score of Hallucination Detection" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Lastly, we compute the optimal threshold for binarizing confidence scores, using F1-score as the objective. Using this threshold, we compute precision, recall, and F1-score for black box scorer predictions of whether responses are correct." ] }, { "cell_type": "code", "execution_count": 14, "metadata": { "tags": [] }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "semantic_negentropy F1-optimal threshold: 0.59\n", " \n", "semantic_negentropy precision: 0.8333333333333334\n", "semantic_negentropy recall: 0.9259259259259259\n", "semantic_negentropy f1-score: 0.8771929824561403\n", " \n", " \n", "exact_match F1-optimal threshold: 0.2\n", " \n", "exact_match precision: 0.7910447761194029\n", "exact_match recall: 0.9814814814814815\n", "exact_match f1-score: 0.8760330578512396\n", " \n", " \n", "cosine_sim F1-optimal threshold: 0.88\n", " \n", "cosine_sim precision: 0.8305084745762712\n", "cosine_sim recall: 0.9074074074074074\n", "cosine_sim f1-score: 0.8672566371681416\n", " \n", " \n" ] } ], "source": [ "# instantiate UQLM tuner object for threshold selection\n", "t = Tuner()\n", "\n", "correct_indicators = (\n", " result_df.response_correct\n", ") * 1 # Whether responses is actually correct\n", "for confidence_score in [\"semantic_negentropy\", \"exact_match\", \"cosine_sim\"]:\n", " y_scores = result_df[confidence_score] # confidence score\n", "\n", " # Solve for threshold that maximizes F1-score\n", " best_threshold = t.tune_threshold(\n", " y_scores=y_scores,\n", " correct_indicators=correct_indicators,\n", " thresh_objective=\"fbeta_score\",\n", " )\n", " y_pred = [\n", " (s > best_threshold) * 1 for s in y_scores\n", " ] # predicts whether response is correct based on confidence score\n", " print(f\"{confidence_score} F1-optimal threshold: {best_threshold}\")\n", " print(\" \")\n", "\n", " # evaluate precision, recall, and f1-score of predictions of correctness\n", " print(\n", " f\"{confidence_score} precision: {precision_score(y_true=correct_indicators, y_pred=y_pred)}\"\n", " )\n", " print(\n", " f\"{confidence_score} recall: {recall_score(y_true=correct_indicators, y_pred=y_pred)}\"\n", " )\n", " print(\n", " f\"{confidence_score} f1-score: {f1_score(y_true=correct_indicators, y_pred=y_pred)}\"\n", " )\n", " print(\" \")\n", " print(\" \")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 4. Scorer Definitions\n", "Below we define the scorers offered by the `BlackBoxUQ` class. These scorers exploit variation in LLM responses to the same prompt to measure semantic consistency. All scorers have outputs ranging from 0 to 1, with higher values indicating higher confidence. \n", "\n", "For a given prompt $x_i$, these approaches involves generating $m$ responses $\\tilde{\\mathbf{y}}_i = \\{ \\tilde{y}_{i1},...,\\tilde{y}_{im}\\}$, using a non-zero temperature, from the same prompt and comparing these responses to the original response $y_{i}$. We provide detailed descriptions of each below.\n", "\n", "### Exact Match Rate (`exact_match`)\n", "Exact Match Rate (EMR) computes the proportion of candidate responses that are identical to the original response.\n", "$$ EMR(y_i; \\tilde{\\mathbf{y}}_i) = \\frac{1}{m} \\sum_{j=1}^m \\mathbb{I}(y_i=\\tilde{y}_{ij}). $$\n", "\n", "For more on this scorer, refer to [Cole et al., 2023](https://arxiv.org/abs/2305.14613).\n", "\n", "### Non-Contradiction Probability (`noncontradiction`)\n", "Non-contradiction probability (NCP) computes the mean non-contradiction probability estimated by a natural language inference (NLI) model. This score is formally defined as follows:\n", "\n", "\\begin{equation}\n", " NCP(y_i; \\tilde{\\mathbf{y}}_i) = \\frac{1}{m} \\sum_{j=1}^m(1 - p_j)\n", "\\end{equation}\n", "where\n", "\n", "\\begin{equation}\n", " p_j = \\frac{\\eta(y_{i}, \\tilde{y}_{ij}) + \\eta(\\tilde{y}_{ij},y_i)}{2}.\n", "\\end{equation}\n", "\n", "Above, $\\eta(\\tilde{y}_{ij},y_i)$ denotes the contradiction probability estimated by the NLI model for response $y_i$ and candidate $\\tilde{y}_{ij}$. For more on this scorer, refer to [Chen & Mueller, 2023](https://arxiv.org/abs/2308.16175), [Lin et al., 2025](https://arxiv.org/abs/2305.19187), or [Manakul et al., 2023](https://arxiv.org/abs/2303.08896).\n", "\n", "### Normalized Semantic Negentropy (`semantic_negentropy`)\n", "Normalized Semantic Negentropy (NSN) normalizes the standard computation of discrete semantic entropy to be increasing with higher confidence and have [0,1] support. In contrast to the EMR and NCP, semantic entropy does not distinguish between an original response and candidate responses. Instead, this approach computes a single metric value on a list of responses generated from the same prompt. Under this approach, responses are clustered using an NLI model based on mutual entailment. We consider the discrete version of SE, where the final set of clusters is defined as follows:\n", "\n", "\\begin{equation}\n", " SE(y_i; \\tilde{\\mathbf{y}}_i) = - \\sum_{C \\in \\mathcal{C}} P(C|y_i, \\tilde{\\mathbf{y}}_i)\\log P(C|y_i, \\tilde{\\mathbf{y}}_i),\n", "\\end{equation}\n", "where $P(C|y_i, \\tilde{\\mathbf{y}}_i)$ denotes the probability a randomly selected response $y \\in \\{y_i\\} \\cup \\tilde{\\mathbf{y}}_i $ belongs to cluster $C$, and $\\mathcal{C}$ denotes the full set of clusters of $\\{y_i\\} \\cup \\tilde{\\mathbf{y}}_i$.\n", "\n", "To ensure that we have a normalized confidence score with $[0,1]$ support and with higher values corresponding to higher confidence, we implement the following normalization to arrive at *ormalized Semantic Negentropy* (NSN):\n", "\\begin{equation}\n", " NSN(y_i; \\tilde{\\mathbf{y}}_i) = 1 - \\frac{SE(y_i; \\tilde{\\mathbf{y}}_i)}{\\log m},\n", "\\end{equation}\n", "where $\\log m$ is included to normalize the support.\n", "\n", "### BERTScore (`bert_score`)\n", "Let a tokenized text sequence be denoted as $\\textbf{t} = \\{t_1,...t_L\\}$ and the corresponding contextualized word embeddings as $\\textbf{E} = \\{\\textbf{e}_1,...,\\textbf{e}_L\\}$, where $L$ is the number of tokens in the text. The BERTScore precision, recall, and F1-scores between two tokenized texts $\\textbf{t}, \\textbf{t}'$ are respectively defined as follows:\n", "\n", "\\begin{equation}\n", " BertP(\\textbf{t}, \\textbf{t}') = \\frac{1}{| \\textbf{t}|} \\sum_{t \\in \\textbf{t}} \\max_{t' \\in \\textbf{t}'} \\textbf{e} \\cdot \\textbf{e}'\n", "\\end{equation}\n", "\n", "\\begin{equation}\n", " BertR(\\textbf{t}, \\textbf{t}') = \\frac{1}{| \\textbf{t}'|} \\sum_{t' \\in \\textbf{t}'} \\max_{t \\in \\textbf{t}} \\textbf{e} \\cdot \\textbf{e}'\n", "\\end{equation}\n", "\n", "\\begin{equation}\n", " BertF(\\textbf{t}, \\textbf{t}') = 2\\frac{ BertP(\\textbf{t}, \\textbf{t}') BertR(\\textbf{t}, \\textbf{t}')}{BertP(\\textbf{t}, \\textbf{t}') + BertRec(\\textbf{t}, \\textbf{t}')},\n", "\\end{equation}\n", "where $e, e'$ respectively correspond to $t, t'$. We compute our BERTScore-based confidence scores as follows:\n", "\\begin{equation}\n", " BertConf(y_i; \\tilde{\\mathbf{y}}_i) = \\frac{1}{m} \\sum_{j=1}^m BertF(y_i, \\tilde{y}_{ij}),\n", "\\end{equation}\n", "i.e. the average BERTScore F1 across pairings of the original response with all candidate responses. For more on BERTScore, refer to [Zheng et al., 2020](https://arxiv.org/abs/1904.09675).\n", "\n", "### BLEURT (`bleurt`)\n", "In contrast to the aforementioned scorers, BLEURT is specifically pre-trained and fine-tuned to learn human judgments of text similarity. Our BLEURT confidence score is the average BLEURT value across pairings of the original response with all candidate responses:\n", "\n", "\\begin{equation}\n", " BLEURTConf(y_i; \\tilde{\\mathbf{y}}_i) = \\frac{1}{m} \\sum_{j=1}^m BLEURT(y_i, \\tilde{y}_{ij}).\n", "\\end{equation}\n", "\n", "For more on this scorer, refer to [Sellam et al., 2020](https://arxiv.org/abs/2004.04696).\n", "\n", "\n", "### Normalized Cosine Similarity (`cosine_sim`)\n", "This scorer leverages a sentence transformer to map LLM outputs to an embedding space and measure similarity using those sentence embeddings. Let $V: \\mathcal{Y} \\xrightarrow{} \\mathbb{R}^d$ denote the sentence transformer, where $d$ is the dimension of the embedding space. The average cosine similarity across pairings of the original response with all candidate responses is given as follows:\n", "\n", "\\begin{equation}\n", " CS(y_i; \\tilde{\\mathbf{y}}_i) = \\frac{1}{m} \\sum_{i=1}^m \\frac{\\mathbf{V}(y_i) \\cdot \\mathbf{V}(\\tilde{y}_{ij}) }{ \\lVert \\mathbf{V}(y_i) \\rVert \\lVert \\mathbf{V}(\\tilde{y}_{ij}) \\rVert}.\n", "\\end{equation}\n", "\n", "To ensure a standardized support of $[0, 1]$, we normalize cosine similarity to obtain confidence scores as follows:\n", "\n", "\\begin{equation}\n", " NCS(y_i; \\tilde{\\mathbf{y}}_i) = \\frac{CS(y_i; \\tilde{\\mathbf{y}}_i) + 1}{2}.\n", "\\end{equation}" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "© 2025 CVS Health and/or one of its affiliates. All rights reserved." ] } ], "metadata": { "environment": { "kernel": "uqlm", "name": "workbench-notebooks.m126", "type": "gcloud", "uri": "us-docker.pkg.dev/deeplearning-platform-release/gcr.io/workbench-notebooks:m126" }, "kernelspec": { "display_name": "uqlm", "language": "python", "name": "uqlm" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.9.21" } }, "nbformat": 4, "nbformat_minor": 4 }