{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# 🎯 White-Box Uncertainty Quantification\n", "\n", "
\n", "

\n", " White-box Uncertainty Quantification (UQ) methods leverage token probabilities to estimate uncertainty. Multi-generation white-box methods generate multiple responses from the same prompt, combining the sampling approach of black-box UQ with token-probability-based singals. This demo provides an illustration of how to use state-of-the-art white-box UQ methods with uqlm. The following multi-generation scorers are available:\n", "

\n", " \n", "* Monte carlo sequence probability ([Kuhn et al., 2023](https://arxiv.org/abs/2302.09664))\n", "* Consistency and Confidence (CoCoA) ([Vashurin et al., 2025](https://arxiv.org/abs/2502.04964))\n", "* Semantic Negentropy ([Farquhar et al., 2024](https://www.nature.com/articles/s41586-024-07421-0)) \n", "* Semantic Density ([Qiu et al., 2024](https://arxiv.org/abs/2405.13845))\n", "* P(True) ([Kadavath et al., 2022](https://arxiv.org/abs/2207.05221))\n", "\n", "
\n", "\n", "## 📊 What You'll Do in This Demo\n", "\n", "
\n", "
1
\n", "
\n", "

Set up LLM and prompts.

\n", "

Set up LLM instance and load example data prompts.

\n", "
\n", "
\n", "\n", "
\n", "
2
\n", "
\n", "

Generate LLM Responses and Confidence Scores

\n", "

Generate and score LLM responses to the example questions using the WhiteBoxUQ() class.

\n", "
\n", "
\n", "\n", "
\n", "
3
\n", "
\n", "

Evaluate Hallucination Detection Performance

\n", "

Visualize model accuracy at different thresholds of the various white-box UQ confidence scores. Compute precision, recall, and F1-score of hallucination detection.

\n", "
\n", "
\n", "\n", "## ⚖️ Advantages & Limitations\n", "\n", "
\n", "
\n", "

Pros

\n", " \n", "
\n", " \n", "
\n", "

Cons

\n", " \n", "
\n", "
" ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "tags": [] }, "outputs": [], "source": [ "import numpy as np\n", "from sklearn.metrics import precision_score, recall_score, f1_score\n", "\n", "from uqlm import WhiteBoxUQ\n", "from uqlm.utils import load_example_dataset, math_postprocessor, plot_model_accuracies, Tuner" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "## 1. Set up LLM and Prompts" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In this demo, we will illustrate this approach using a set of math questions from the [gsm8k benchmark](https://github.com/openai/grade-school-math). To implement with your use case, simply **replace the example prompts with your data**. " ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "tags": [] }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Loading dataset - gsm8k...\n", "Processing dataset...\n", "Dataset ready!\n" ] }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
questionanswer
0Natalia sold clips to 48 of her friends in Apr...72
1Weng earns $12 an hour for babysitting. Yester...10
2Betty is saving money for a new wallet which c...5
3Julie is reading a 120-page book. Yesterday, s...42
4James writes a 3-page letter to 2 different fr...624
\n", "
" ], "text/plain": [ " question answer\n", "0 Natalia sold clips to 48 of her friends in Apr... 72\n", "1 Weng earns $12 an hour for babysitting. Yester... 10\n", "2 Betty is saving money for a new wallet which c... 5\n", "3 Julie is reading a 120-page book. Yesterday, s... 42\n", "4 James writes a 3-page letter to 2 different fr... 624" ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Load example dataset (gsm8k)\n", "gsm8k = load_example_dataset(\"gsm8k\", n=100)\n", "gsm8k.head()" ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "tags": [] }, "outputs": [], "source": [ "# Define prompts\n", "MATH_INSTRUCTION = \"When you solve this math problem only return the answer with no additional text.\\n\"\n", "prompts = [MATH_INSTRUCTION + prompt for prompt in gsm8k.question]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In this example, we use `AzureChatOpenAI` to instantiate our LLM, but any [LangChain Chat Model](https://js.langchain.com/docs/integrations/chat/) may be used. Be sure to **replace with your LLM of choice.**" ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "tags": [] }, "outputs": [], "source": [ "# import sys\n", "# !{sys.executable} -m pip install python-dotenv\n", "# !{sys.executable} -m pip install langchain-openai\n", "\n", "# # User to populate .env file with API credentials. In this step, replace with your LLM of choice.\n", "from dotenv import load_dotenv, find_dotenv\n", "from langchain_openai import AzureChatOpenAI\n", "\n", "load_dotenv(find_dotenv())\n", "llm = AzureChatOpenAI(deployment_name=\"gpt-4o\", openai_api_type=\"azure\", openai_api_version=\"2024-02-15-preview\", temperature=1)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "## 2. Generate responses and confidence scores" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### `WhiteBoxUQ()` - Generate LLM responses and compute token-probability-based confidence scores for each response.\n", "\n", "![Sample Image](https://raw.githubusercontent.com/cvs-health/uqlm/develop/assets/images/white_box_graphic.png)\n", "\n", "#### 📋 Class Attributes\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
ParameterType & DefaultDescription
llmBaseChatModel
default=None
A langchain llm `BaseChatModel`. User is responsible for specifying temperature and other relevant parameters to the constructor of their `llm` object.
scorersList[str]
default=None
Specifies which white-box UQ scorers to include. Must be subset of [\"normalized_probability\", \"min_probability\", \"sequence_probability\", \"max_token_negentropy\", \"mean_token_negentropy\", \"probability_margin\", \"monte_carlo_negentropy\", \"consistency_and_confidence\", \"semantic_negentropy\", \"semantic_density\", \"p_true\"]. If None, defaults to [\"normalized_probability\", \"min_probability\"].
system_promptstr or None
default=\"You are a helpful assistant.\"
Optional argument for user to provide custom system prompt for the LLM.
max_calls_per_minint
default=None
Specifies how many API calls to make per minute to avoid rate limit errors. By default, no limit is specified.
sampling_temperaturefloat
default=1
The 'temperature' parameter for LLM to use when generating sampled LLM responses. Only applies to \"monte_carlo_negentropy\", \"consistency_and_confidence\", \"semantic_negentropy\", \"semantic_density\". Must be greater than 0.
\n", "\n", "#### 🔍 Parameter Groups\n", "\n", "
\n", "
\n", "

🧠 Model-Specific

\n", " \n", "
\n", "
\n", "

📊 Confidence Scores

\n", " \n", "
\n", "
\n", "

⚡ Performance

\n", " \n", "
\n", "
" ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "tags": [] }, "outputs": [], "source": [ "wbuq = WhiteBoxUQ(\n", " llm=llm,\n", " scorers=[\n", " \"monte_carlo_probability\", # requires multiple sampled responses per prompt\n", " \"consistency_and_confidence\", # requires multiple sampled responses per prompt\n", " \"p_true\", # generates one additional response per prompt, acts as logprobs-based self-judge\n", " ],\n", " max_calls_per_min=125,\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 🔄 Class Methods\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
MethodDescription & Parameters
WhiteBoxUQ.generate_and_score\n", "

Generate LLM responses and compute confidence scores for the provided prompts.

\n", "

Parameters:

\n", "
    \n", "
  • prompts - (List[str] or List[List[BaseMessage]]) A list of input prompts for the model.
  • \n", "
  • num_responses - (int, default=5) The number of sampled responses to generate for sampling-based white-box UQ methods. Only applies to \"monte_carlo_negentropy\", \"consistency_and_confidence\", \"semantic_negentropy\", \"semantic_density\".
  • \n", "
  • show_progress_bars - (bool, default=True) If True, displays a progress bar while generating and scoring responses.
  • \n", "
\n", "

Returns: UQResult containing data (prompts, responses, log probabilities, and confidence scores) and metadata

\n", "
\n", " 💡 Best For: Complete end-to-end uncertainty quantification when starting with prompts.\n", "
\n", "
BlackBoxUQ.score\n", "

Compute confidence scores on provided LLM responses and logprobs. Should only be used if responses and sampled responses are already generated with logprobs.

\n", "

Parameters:

\n", "
    \n", "
  • responses - (List[str]) A list of LLM responses for the prompts.
  • \n", "
  • logprob_results - (List[List[str]]) A list of dictionaries, each returned by BaseChatModel.agenerate corresponding to responses.
  • \n", "
  • sampled_responses - (List[List[str]], default=None) A list of lists of sampled LLM responses for each prompt. Used to compute consistency scores by comparing to the corresponding response from responses. Required only for \"monte_carlo_negentropy\", \"consistency_and_confidence\", \"semantic_negentropy\", \"semantic_density\" scorers.
  • \n", "
  • sampled_logprob_results - (List[List[str]], default=None) List of list of dictionaries, each returned by BaseChatModel.agenerate. These must correspond to sampled_responses. Required only for \"monte_carlo_negentropy\", \"consistency_and_confidence\", \"semantic_negentropy\", \"semantic_density\" scorers.
  • \n", "
  • prompts - (List[List[str]], default=None) List of prompts from which responses were generated. Required only for \"p_true\" scorer.
  • \n", "
  • show_progress_bars - (bool, default=True) If True, displays a progress bar while scoring responses.
  • \n", "
\n", "

Returns: UQResult containing data (responses, sampled responses, and confidence scores) and metadata

\n", "
\n", " 💡 Best For: Computing uncertainty scores when responses and logprobs are already generated elsewhere.\n", "
\n", "
" ] }, { "cell_type": "code", "execution_count": 6, "metadata": { "tags": [] }, "outputs": [ { "data": { "application/vnd.jupyter.widget-view+json": { "model_id": "ab71f72963e74b979bd369a1b508c123", "version_major": 2, "version_minor": 0 }, "text/plain": [ "Output()" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
\n"
      ],
      "text/plain": []
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "results = await wbuq.generate_and_score(prompts=prompts, num_responses=5)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {
    "tags": []
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
promptresponselogprobsampled_responsessampled_logprobconsistency_and_confidencemonte_carlo_probabilityp_true
0When you solve this math problem only return t...72[{'token': '72', 'bytes': [55, 50], 'logprob':...[72, 72, 72, 72, 72][[{'token': '72', 'bytes': [55, 50], 'logprob'...0.9998190.9999550.377549
1When you solve this math problem only return t...$10[{'token': '$', 'bytes': [36], 'logprob': -0.0...[$10, $10, $10, $10, $10][[{'token': '$', 'bytes': [36], 'logprob': -0....0.9944630.9944150.047430
2When you solve this math problem only return t...$20[{'token': '$', 'bytes': [36], 'logprob': -0.0...[$20, $20, $20, $20, $10][[{'token': '$', 'bytes': [36], 'logprob': -0....0.9230750.8903580.777260
3When you solve this math problem only return t...48[{'token': '48', 'bytes': [52, 56], 'logprob':...[48, 48, 48, 48, 48][[{'token': '48', 'bytes': [52, 56], 'logprob'...0.9947550.9961960.182436
4When you solve this math problem only return t...624[{'token': '624', 'bytes': [54, 50, 52], 'logp...[624, 624 pages., 624, 624, 624][[{'token': '624', 'bytes': [54, 50, 52], 'log...0.9548160.9233050.981987
\n", "
" ], "text/plain": [ " prompt response \\\n", "0 When you solve this math problem only return t... 72 \n", "1 When you solve this math problem only return t... $10 \n", "2 When you solve this math problem only return t... $20 \n", "3 When you solve this math problem only return t... 48 \n", "4 When you solve this math problem only return t... 624 \n", "\n", " logprob \\\n", "0 [{'token': '72', 'bytes': [55, 50], 'logprob':... \n", "1 [{'token': '$', 'bytes': [36], 'logprob': -0.0... \n", "2 [{'token': '$', 'bytes': [36], 'logprob': -0.0... \n", "3 [{'token': '48', 'bytes': [52, 56], 'logprob':... \n", "4 [{'token': '624', 'bytes': [54, 50, 52], 'logp... \n", "\n", " sampled_responses \\\n", "0 [72, 72, 72, 72, 72] \n", "1 [$10, $10, $10, $10, $10] \n", "2 [$20, $20, $20, $20, $10] \n", "3 [48, 48, 48, 48, 48] \n", "4 [624, 624 pages., 624, 624, 624] \n", "\n", " sampled_logprob \\\n", "0 [[{'token': '72', 'bytes': [55, 50], 'logprob'... \n", "1 [[{'token': '$', 'bytes': [36], 'logprob': -0.... \n", "2 [[{'token': '$', 'bytes': [36], 'logprob': -0.... \n", "3 [[{'token': '48', 'bytes': [52, 56], 'logprob'... \n", "4 [[{'token': '624', 'bytes': [54, 50, 52], 'log... \n", "\n", " consistency_and_confidence monte_carlo_probability p_true \n", "0 0.999819 0.999955 0.377549 \n", "1 0.994463 0.994415 0.047430 \n", "2 0.923075 0.890358 0.777260 \n", "3 0.994755 0.996196 0.182436 \n", "4 0.954816 0.923305 0.981987 " ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "result_df = results.to_df()\n", "result_df.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "## 3. Evaluate Hallucination Detection Performance" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To evaluate hallucination detection performance, we 'grade' the responses against an answer key. Note the `math_postprocessor` is specific to our use case (math questions). **If you are using your own prompts/questions, update the grading method accordingly**." ] }, { "cell_type": "code", "execution_count": 8, "metadata": { "tags": [] }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
promptresponselogprobsampled_responsessampled_logprobconsistency_and_confidencemonte_carlo_probabilityp_trueanswerresponse_correct
0When you solve this math problem only return t...72[{'token': '72', 'bytes': [55, 50], 'logprob':...[72, 72, 72, 72, 72][[{'token': '72', 'bytes': [55, 50], 'logprob'...0.9998190.9999550.37754972True
1When you solve this math problem only return t...$10[{'token': '$', 'bytes': [36], 'logprob': -0.0...[$10, $10, $10, $10, $10][[{'token': '$', 'bytes': [36], 'logprob': -0....0.9944630.9944150.04743010True
2When you solve this math problem only return t...$20[{'token': '$', 'bytes': [36], 'logprob': -0.0...[$20, $20, $20, $20, $10][[{'token': '$', 'bytes': [36], 'logprob': -0....0.9230750.8903580.7772605False
3When you solve this math problem only return t...48[{'token': '48', 'bytes': [52, 56], 'logprob':...[48, 48, 48, 48, 48][[{'token': '48', 'bytes': [52, 56], 'logprob'...0.9947550.9961960.18243642False
4When you solve this math problem only return t...624[{'token': '624', 'bytes': [54, 50, 52], 'logp...[624, 624 pages., 624, 624, 624][[{'token': '624', 'bytes': [54, 50, 52], 'log...0.9548160.9233050.981987624True
\n", "
" ], "text/plain": [ " prompt response \\\n", "0 When you solve this math problem only return t... 72 \n", "1 When you solve this math problem only return t... $10 \n", "2 When you solve this math problem only return t... $20 \n", "3 When you solve this math problem only return t... 48 \n", "4 When you solve this math problem only return t... 624 \n", "\n", " logprob \\\n", "0 [{'token': '72', 'bytes': [55, 50], 'logprob':... \n", "1 [{'token': '$', 'bytes': [36], 'logprob': -0.0... \n", "2 [{'token': '$', 'bytes': [36], 'logprob': -0.0... \n", "3 [{'token': '48', 'bytes': [52, 56], 'logprob':... \n", "4 [{'token': '624', 'bytes': [54, 50, 52], 'logp... \n", "\n", " sampled_responses \\\n", "0 [72, 72, 72, 72, 72] \n", "1 [$10, $10, $10, $10, $10] \n", "2 [$20, $20, $20, $20, $10] \n", "3 [48, 48, 48, 48, 48] \n", "4 [624, 624 pages., 624, 624, 624] \n", "\n", " sampled_logprob \\\n", "0 [[{'token': '72', 'bytes': [55, 50], 'logprob'... \n", "1 [[{'token': '$', 'bytes': [36], 'logprob': -0.... \n", "2 [[{'token': '$', 'bytes': [36], 'logprob': -0.... \n", "3 [[{'token': '48', 'bytes': [52, 56], 'logprob'... \n", "4 [[{'token': '624', 'bytes': [54, 50, 52], 'log... \n", "\n", " consistency_and_confidence monte_carlo_probability p_true answer \\\n", "0 0.999819 0.999955 0.377549 72 \n", "1 0.994463 0.994415 0.047430 10 \n", "2 0.923075 0.890358 0.777260 5 \n", "3 0.994755 0.996196 0.182436 42 \n", "4 0.954816 0.923305 0.981987 624 \n", "\n", " response_correct \n", "0 True \n", "1 True \n", "2 False \n", "3 False \n", "4 True " ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Populate correct answers\n", "result_df[\"answer\"] = gsm8k.answer\n", "\n", "# Grade responses against correct answers\n", "result_df[\"response_correct\"] = [math_postprocessor(r) == a for r, a in zip(result_df[\"response\"], gsm8k[\"answer\"])]\n", "result_df.head(5)" ] }, { "cell_type": "code", "execution_count": 9, "metadata": { "tags": [] }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Baseline LLM accuracy: 0.53\n" ] } ], "source": [ "print(f\"\"\"Baseline LLM accuracy: {np.mean(result_df[\"response_correct\"])}\"\"\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### 3.1 Filtered LLM Accuracy Evaluation" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Here, we explore ‘filtered accuracy’ as a metric for evaluating the performance of our confidence scores. Filtered accuracy measures the change in LLM performance when responses with confidence scores below a specified threshold are excluded. By adjusting the confidence score threshold, we can observe how the accuracy of the LLM improves as less certain responses are filtered out.\n", "\n", "We will plot the filtered accuracy across various confidence score thresholds to visualize the relationship between confidence and LLM accuracy. This analysis helps in understanding the trade-off between response coverage (measured by sample size below) and LLM accuracy, providing insights into the reliability of the LLM’s outputs. We conduct this analysis separately for each of our scorers. " ] }, { "cell_type": "code", "execution_count": 10, "metadata": { "tags": [] }, "outputs": [ { "data": { "image/png": "", "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "image/png": "", "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "image/png": "", "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "for scorer in [\"monte_carlo_probability\", \"consistency_and_confidence\", \"p_true\"]:\n", " plot_model_accuracies(scores=result_df[scorer], correct_indicators=result_df.response_correct, title=f\"LLM Accuracy by {scorer} Score Threshold\", display_percentage=True)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### 3.2 Precision, Recall, F1-Score of Hallucination Detection" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Lastly, we compute the optimal threshold for binarizing confidence scores, using F1-score as the objective. Using this threshold, we compute precision, recall, and F1-score for black box scorer predictions of whether responses are correct." ] }, { "cell_type": "code", "execution_count": 11, "metadata": { "tags": [] }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "========================================================================================================================\n", "Metrics monte_carlo_probability consistency_and_confidence p_true \n", "------------------------------------------------------------------------------------------------------------------------\n", "Precision 0.885 0.909 0.522 \n", "Recall 0.885 0.769 0.923 \n", "F1-score 0.885 0.833 0.667 \n", "------------------------------------------------------------------------------------------------------------------------\n", "F-1 optimal threshold 0.64 0.74 0.02 \n", "========================================================================================================================\n" ] } ], "source": [ "# instantiate UQLM tuner object for threshold selection\n", "split = len(result_df) // 2\n", "t = Tuner()\n", "\n", "correct_indicators = (result_df.response_correct) * 1 # Whether responses is actually correct\n", "metric_values = {\"Precision\": [], \"Recall\": [], \"F1-score\": []}\n", "optimal_thresholds = []\n", "for confidence_score in wbuq.scorers:\n", " # tune threshold on first half\n", " y_scores = result_df[confidence_score]\n", " y_scores_tune = y_scores[0:split]\n", " y_true_tune = correct_indicators[0:split]\n", " best_threshold = t.tune_threshold(y_scores=y_scores_tune, correct_indicators=y_true_tune, thresh_objective=\"fbeta_score\")\n", "\n", " y_pred = [(s > best_threshold) * 1 for s in y_scores] # predicts whether response is correct based on confidence score\n", " optimal_thresholds.append(best_threshold)\n", "\n", " # evaluate on last half\n", " y_true_eval = correct_indicators[split:]\n", " y_pred_eval = y_pred[split:]\n", " metric_values[\"Precision\"].append(precision_score(y_true=y_true_eval, y_pred=y_pred_eval))\n", " metric_values[\"Recall\"].append(recall_score(y_true=y_true_eval, y_pred=y_pred_eval))\n", " metric_values[\"F1-score\"].append(f1_score(y_true=y_true_eval, y_pred=y_pred_eval))\n", "\n", "# print results\n", "header = f\"{'Metrics':<30}\" + \"\".join([f\"{scorer_name:<30}\" for scorer_name in wbuq.scorers])\n", "print(\"=\" * len(header) + \"\\n\" + header + \"\\n\" + \"-\" * len(header))\n", "for metric in metric_values.keys():\n", " print(f\"{metric:<30}\" + \"\".join([f\"{round(x_, 3):<30}\" for x_ in metric_values[metric]]))\n", "print(\"-\" * len(header))\n", "print(f\"{'F-1 optimal threshold':<30}\" + \"\".join([f\"{round(x_, 3):<30}\" for x_ in optimal_thresholds]))\n", "print(\"=\" * len(header))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "## 4. Scorer Definitions\n", "White-box UQ scorers leverage token probabilities of the LLM's generated response to quantify uncertainty. All scorers have outputs ranging from 0 to 1, with higher values indicating higher confidence. We define several multi-generation white-box UQ scorers below. \n", "\n", "Let the tokenization LLM response $y_i$ be denoted as $\\{t_1,...,t_{L_i}\\}$, where $L_i$ denotes the number of tokens the response. Further, let $y_1,...,y_m$ denote $m$ sampled responses generated from the same prompt.\n", "\n", "### Monte Carlo Sequence Probability (`monte_carlo_probability`)\n", "Monte Carlo Sequence Probability (MCSP) computes the average length-normalized sequence probability across sampled responses. \n", "\n", "\n", "\n", "$$ MCSP(y_1,y_2,...,y_m) = \\frac{1}{m} \\sum_{i=1}^m \\prod_{t \\in y_i} p_t^{\\frac{1}{L_i}} $$ \n", "\n", "\n", "For more on this scorer, refer to [Kuhn et al., 2023](https://arxiv.org/abs/2302.09664). \n", "\n", "\n", "### Consistency and Confidence Approach (CoCoA) (`consistency_and_confidence`)\n", "Consistency and Confidence Approach (CoCoA) leverages two distinct signals: 1) similarity between an original response $y_0$ and a set of sampled responses $y_1,...,y_m$ and token probabilities from the original response $y_0$. \n", "\n", "We first get the length-normalized token probability of our original response:\n", "\n", "$$ LNTP(y_0) = \\prod_{t \\in y_0} p_t^{\\frac{1}{L_0}}.$$ \n", "\n", "\n", "We then obtain average cosine similarity across pairings of the original response with all sampled responses, normalized to a [0,1] scale:\n", "\n", "\n", " $$ NCS(y_0; y_1,...,y_m) = \\frac{1}{m} \\sum_{i=1}^m \\frac{\\cos(y_0; y_i) + 1}{2}.$$ \n", "\n", "\n", "CoCoa is then calculated as the product of these two terms.\n", "\n", "\n", " $$ CoCoA(y_0; y_1,...,y_m) = LNTP(y_0) * NCS(y_0; y_1,...,y_m).$$ \n", "\n", "\n", "For more on this scorer, refer to [Vashurin et al., 2025](https://arxiv.org/abs/2502.04964).\n", "\n", "### Normalized Semantic Negentropy\n", "Normalized Semantic Negentropy (NSN) normalizes the standard computation of discrete semantic entropy to be increasing with higher confidence and have [0,1] support. Under this approach, responses are clustered using an NLI model based on mutual entailment. After obtaining the set of clusters $\\mathcal{C}$, semantic entropy is computed as:\n", "\n", "\n", "$$ SE(y_i; \\tilde{\\mathbf{y}}_i) = - \\sum_{C \\in \\mathcal{C}} P(C|y_i, \\tilde{\\mathbf{y}}_i)\\log P(C|y_i, \\tilde{\\mathbf{y}}_i),$$ \n", "\n", "\n", "where $P(C|y_i, \\tilde{\\mathbf{y}}_i)$ is calculated as the average across response-level sequence probabilities (normalized or otherwise), and $\\mathcal{C}$ denotes the full set of clusters of $\\{y_i\\} \\cup \\tilde{\\mathbf{y}}_i$.\n", "\n", "To ensure that we have a normalized confidence score with $[0,1]$ support and with higher values corresponding to higher confidence, we implement the following normalization to arrive at *ormalized Semantic Negentropy* (NSN):\n", "\n", "$$ NSN(y_i; \\tilde{\\mathbf{y}}_i) = 1 - \\frac{SE(y_i; \\tilde{\\mathbf{y}}_i)}{\\log m},$$ \n", "\n", "where $\\log m$ is included to normalize the support. For more on semantic entropy, refer to [Farquhar et al., 2024](https://www.nature.com/articles/s41586-024-07421-0); [Kuhn et al., 2023](https://arxiv.org/pdf/2302.09664), and for more on our normalized version, refer to [Bouchard & Chauhan, 2025](https://arxiv.org/abs/2504.19254).\n", "\n", "### Semantic Density\n", "Semantic Density (SD) approximates a probability density function (PDF) in semantic space for estimating response correctness. Given a prompt $x$ with candidate response $y_*$, the objective is to construct a PDF that assigns higher density to regions in the semantic space that correspond to correct responses. We begin by sampling $M$ unique reference responses $y_i$ (for $i = 1, 2, \\dots, M$) conditioned on $x$. For any pair of responses $y_i, y_j$ with corresponding embeddings $v_i, v_j$, the semantic distance is estimated as\n", "\n", "\n", "$$ \\mathbb{E}(\\Vert v_i,v_j \\Vert^2) = p_c(y_i, y_j | x) + \\dfrac{1}{2} \\cdot p_n(y_i, y_j | x)$$ \n", "\n", "\n", "where $p_c, p_n$ denote the contradiction and neutrality scores returned by a natural language inference (NLI) model, respectively. This estimated distance is incorporated in the kernel function $K$ to smooth out the reference responses into a continuous distribution. The kernel function value can be obtained as\n", "\n", "\n", " $$ K(v_*, v_i) = (1 - \\mathbb{E}(\\Vert v_* - v_i \\Vert^2))\\mathbf{1}_{\\mathbb{E}(\\Vert v_* - v_i \\Vert) \\leq 1}$$ \n", "\n", "\n", "where $\\bf{1}$ is the indicator function such that $\\bf{1}_{\\text{condition}} = 1$ when the condition holds and $0$ otherwise. The final semantic density score is computed as\n", "\n", "\n", " $$ SD(y_* | x) = \\dfrac{1}{\\sum^M_{i=1}\\sqrt[L_i]{p(y_i|x)}}\\sum^M_{i=1}\\sqrt[L_i]{p(y_i|x)}K(v_* - v_i)$$ \n", "\n", "\n", "where $L_i$ denotes the length of $y_i$.\n", "\n", "### P(True) (`p_true`)\n", "\n", "The P(True) presents an LLM with a concatenation of a question and its own previous response. The LLM is asked to classify this statement as \"True\" or \"False.\" We derive this confidence score directly from the model's token probability for answering \"True\" (or equivalently, 1-P(\"False\") if the model answers \"False\"). For more on this scorer, refer to [Kadavath et al., 2022](https://arxiv.org/abs/2207.05221)." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "© 2025 CVS Health and/or one of its affiliates. All rights reserved." ] } ], "metadata": { "environment": { "kernel": "uqlm_my_test", "name": "workbench-notebooks.m126", "type": "gcloud", "uri": "us-docker.pkg.dev/deeplearning-platform-release/gcr.io/workbench-notebooks:m126" }, "kernelspec": { "display_name": "uqlm_my_test", "language": "python", "name": "uqlm_my_test" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.11.12" } }, "nbformat": 4, "nbformat_minor": 4 }