{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# 🎯 White-Box Uncertainty Quantification\n", "\n", "
\n", "

\n", " White-box Uncertainty Quantification (UQ) methods leverage token probabilities to estimate uncertainty. They are significantly faster and cheaper than black-box methods, but require access to the LLM's internal probabilities, meaning they are not necessarily compatible with all LLMs/APIs. This demo provides an illustration of how to use state-of-the-art white-box UQ methods with uqlm. The following scorers are available:\n", "

\n", " \n", "* Minimum token probability ([Manakul et al., 2023](https://arxiv.org/abs/2303.08896))\n", "* Length-Normalized Joint Token Probability ([Malinin & Gales, 2021](https://arxiv.org/pdf/2002.07650))\n", "
\n", "\n", "## 📊 What You'll Do in This Demo\n", "\n", "
\n", "
1
\n", "
\n", "

Set up LLM and prompts.

\n", "

Set up LLM instance and load example data prompts.

\n", "
\n", "
\n", "\n", "
\n", "
2
\n", "
\n", "

Generate LLM Responses and Confidence Scores

\n", "

Generate and score LLM responses to the example questions using the WhiteBoxUQ() class.

\n", "
\n", "
\n", "\n", "
\n", "
3
\n", "
\n", "

Evaluate Hallucination Detection Performance

\n", "

Visualize model accuracy at different thresholds of the various white-box UQ confidence scores. Compute precision, recall, and F1-score of hallucination detection.

\n", "
\n", "
\n", "\n", "## ⚖️ Advantages & Limitations\n", "\n", "
\n", "
\n", "

Pros

\n", " \n", "
\n", " \n", "
\n", "

Cons

\n", " \n", "
\n", "
" ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "tags": [] }, "outputs": [], "source": [ "import os\n", "\n", "import numpy as np\n", "from sklearn.metrics import precision_score, recall_score, f1_score\n", "\n", "from uqlm import WhiteBoxUQ\n", "from uqlm.utils import load_example_dataset, math_postprocessor, plot_model_accuracies, Tuner" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "## 1. Set up LLM and Prompts" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In this demo, we will illustrate this approach using a set of math questions from the [GSM8K benchmark](https://github.com/openai/grade-school-math). To implement with your use case, simply **replace the example prompts with your data**. " ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "tags": [] }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Loading dataset - gsm8k...\n", "Processing dataset...\n", "Dataset ready!\n" ] }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
questionanswer
0Natalia sold clips to 48 of her friends in Apr...72
1Weng earns $12 an hour for babysitting. Yester...10
2Betty is saving money for a new wallet which c...5
3Julie is reading a 120-page book. Yesterday, s...42
4James writes a 3-page letter to 2 different fr...624
\n", "
" ], "text/plain": [ " question answer\n", "0 Natalia sold clips to 48 of her friends in Apr... 72\n", "1 Weng earns $12 an hour for babysitting. Yester... 10\n", "2 Betty is saving money for a new wallet which c... 5\n", "3 Julie is reading a 120-page book. Yesterday, s... 42\n", "4 James writes a 3-page letter to 2 different fr... 624" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Load example dataset (gsm8k)\n", "gsm8k = load_example_dataset(\"gsm8k\", n=100)\n", "gsm8k.head()" ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "tags": [] }, "outputs": [], "source": [ "# Define prompts\n", "MATH_INSTRUCTION = (\n", " \"When you solve this math problem only return the answer with no additional text.\\n\"\n", ")\n", "prompts = [MATH_INSTRUCTION + prompt for prompt in gsm8k.question]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In this example, we use `AzureChatOpenAI` to instantiate our LLM, but any [LangChain Chat Model](https://js.langchain.com/docs/integrations/chat/) may be used. Be sure to **replace with your LLM of choice.**" ] }, { "cell_type": "code", "execution_count": 6, "metadata": { "tags": [] }, "outputs": [], "source": [ "# import sys\n", "# !{sys.executable} -m pip install python-dotenv\n", "# !{sys.executable} -m pip install langchain-openai\n", "\n", "# # User to populate .env file with API credentials\n", "from dotenv import load_dotenv, find_dotenv\n", "from langchain_openai import AzureChatOpenAI\n", "\n", "load_dotenv(find_dotenv())\n", "llm = AzureChatOpenAI(\n", " deployment_name=os.getenv(\"DEPLOYMENT_NAME\"),\n", " openai_api_key=os.getenv(\"API_KEY\"),\n", " azure_endpoint=os.getenv(\"API_BASE\"),\n", " openai_api_type=os.getenv(\"API_TYPE\"),\n", " openai_api_version=os.getenv(\"API_VERSION\"),\n", " temperature=1, # User to set temperature\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "## 2. Generate responses and confidence scores" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### `WhiteBoxUQ()` - Generate LLM responses and compute token-probability-based confidence scores for each response.\n", "\n", "![Sample Image](https://raw.githubusercontent.com/cvs-health/uqlm/develop/assets/images/white_box_graphic.png)\n", "\n", "#### 📋 Class Attributes\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
ParameterType & DefaultDescription
llmBaseChatModel
default=None
A langchain llm `BaseChatModel`. User is responsible for specifying temperature and other relevant parameters to the constructor of their `llm` object.
scorersList[str]
default=None
Specifies which white-box (token-probability-based) scorers to include. Must be subset of {\"normalized_probability\", \"min_probability\"}. If None, defaults to all.
system_promptstr or None
default=\"You are a helpful assistant.\"
Optional argument for user to provide custom system prompt for the LLM.
max_calls_per_minint
default=None
Specifies how many API calls to make per minute to avoid rate limit errors. By default, no limit is specified.
\n", "\n", "#### 🔍 Parameter Groups\n", "\n", "
\n", "
\n", "

🧠 Model-Specific

\n", " \n", "
\n", "
\n", "

📊 Confidence Scores

\n", " \n", "
\n", "
\n", "

⚡ Performance

\n", " \n", "
\n", "
" ] }, { "cell_type": "code", "execution_count": 9, "metadata": { "tags": [] }, "outputs": [], "source": [ "wbuq = WhiteBoxUQ(llm=llm)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 🔄 Class Methods\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
MethodDescription & Parameters
WhiteBoxUQ.generate_and_score\n", "

Generate LLM responses and compute confidence scores for the provided prompts.

\n", "

Parameters:

\n", "
    \n", "
  • prompts - (list of str) A list of input prompts for the model.
  • \n", "
\n", "

Returns: UQResult containing data (prompts, responses, log probabilities, and confidence scores) and metadata

\n", "
\n", " 💡 Best For: Complete end-to-end uncertainty quantification when starting with prompts.\n", "
\n", "
" ] }, { "cell_type": "code", "execution_count": 10, "metadata": { "tags": [] }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Generating responses...\n" ] } ], "source": [ "results = await wbuq.generate_and_score(prompts=prompts)" ] }, { "cell_type": "code", "execution_count": 11, "metadata": { "tags": [] }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
promptresponselogprobnormalized_probabilitymin_probability
0When you solve this math problem only return t...72[{'token': '72', 'bytes': [55, 50], 'logprob':...0.9999490.999949
1When you solve this math problem only return t...$10[{'token': '$', 'bytes': [36], 'logprob': -0.0...0.9993980.998797
2When you solve this math problem only return t...$20[{'token': '$', 'bytes': [36], 'logprob': -0.0...0.9453830.900076
3When you solve this math problem only return t...48[{'token': '48', 'bytes': [52, 56], 'logprob':...0.9966840.996684
4When you solve this math problem only return t...624[{'token': '624', 'bytes': [54, 50, 52], 'logp...0.9999260.999926
\n", "
" ], "text/plain": [ " prompt response \\\n", "0 When you solve this math problem only return t... 72 \n", "1 When you solve this math problem only return t... $10 \n", "2 When you solve this math problem only return t... $20 \n", "3 When you solve this math problem only return t... 48 \n", "4 When you solve this math problem only return t... 624 \n", "\n", " logprob normalized_probability \\\n", "0 [{'token': '72', 'bytes': [55, 50], 'logprob':... 0.999949 \n", "1 [{'token': '$', 'bytes': [36], 'logprob': -0.0... 0.999398 \n", "2 [{'token': '$', 'bytes': [36], 'logprob': -0.0... 0.945383 \n", "3 [{'token': '48', 'bytes': [52, 56], 'logprob':... 0.996684 \n", "4 [{'token': '624', 'bytes': [54, 50, 52], 'logp... 0.999926 \n", "\n", " min_probability \n", "0 0.999949 \n", "1 0.998797 \n", "2 0.900076 \n", "3 0.996684 \n", "4 0.999926 " ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "result_df = results.to_df()\n", "result_df.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "## 3. Evaluate Hallucination Detection Performance" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To evaluate hallucination detection performance, we 'grade' the responses against an answer key. Note the `math_postprocessor` is specific to our use case (math questions). **If you are using your own prompts/questions, update the grading method accordingly**." ] }, { "cell_type": "code", "execution_count": 12, "metadata": { "tags": [] }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
promptresponselogprobnormalized_probabilitymin_probabilityanswerresponse_correct
0When you solve this math problem only return t...72[{'token': '72', 'bytes': [55, 50], 'logprob':...0.9999490.99994972True
1When you solve this math problem only return t...$10[{'token': '$', 'bytes': [36], 'logprob': -0.0...0.9993980.99879710True
2When you solve this math problem only return t...$20[{'token': '$', 'bytes': [36], 'logprob': -0.0...0.9453830.9000765False
3When you solve this math problem only return t...48[{'token': '48', 'bytes': [52, 56], 'logprob':...0.9966840.99668442False
4When you solve this math problem only return t...624[{'token': '624', 'bytes': [54, 50, 52], 'logp...0.9999260.999926624True
\n", "
" ], "text/plain": [ " prompt response \\\n", "0 When you solve this math problem only return t... 72 \n", "1 When you solve this math problem only return t... $10 \n", "2 When you solve this math problem only return t... $20 \n", "3 When you solve this math problem only return t... 48 \n", "4 When you solve this math problem only return t... 624 \n", "\n", " logprob normalized_probability \\\n", "0 [{'token': '72', 'bytes': [55, 50], 'logprob':... 0.999949 \n", "1 [{'token': '$', 'bytes': [36], 'logprob': -0.0... 0.999398 \n", "2 [{'token': '$', 'bytes': [36], 'logprob': -0.0... 0.945383 \n", "3 [{'token': '48', 'bytes': [52, 56], 'logprob':... 0.996684 \n", "4 [{'token': '624', 'bytes': [54, 50, 52], 'logp... 0.999926 \n", "\n", " min_probability answer response_correct \n", "0 0.999949 72 True \n", "1 0.998797 10 True \n", "2 0.900076 5 False \n", "3 0.996684 42 False \n", "4 0.999926 624 True " ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Populate correct answers\n", "result_df[\"answer\"] = gsm8k.answer\n", "\n", "# Grade responses against correct answers\n", "result_df[\"response_correct\"] = [\n", " math_postprocessor(r) == a for r, a in zip(result_df[\"response\"], gsm8k[\"answer\"])\n", "]\n", "result_df.head(5)" ] }, { "cell_type": "code", "execution_count": 13, "metadata": { "tags": [] }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Baseline LLM accuracy: 0.51\n" ] } ], "source": [ "print(f\"\"\"Baseline LLM accuracy: {np.mean(result_df[\"response_correct\"])}\"\"\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### 3.1 Filtered LLM Accuracy Evaluation" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Here, we explore ‘filtered accuracy’ as a metric for evaluating the performance of our confidence scores. Filtered accuracy measures the change in LLM performance when responses with confidence scores below a specified threshold are excluded. By adjusting the confidence score threshold, we can observe how the accuracy of the LLM improves as less certain responses are filtered out.\n", "\n", "We will plot the filtered accuracy across various confidence score thresholds to visualize the relationship between confidence and LLM accuracy. This analysis helps in understanding the trade-off between response coverage (measured by sample size below) and LLM accuracy, providing insights into the reliability of the LLM’s outputs. We conduct this analysis separately for each of our scorers. " ] }, { "cell_type": "code", "execution_count": 14, "metadata": { "tags": [] }, "outputs": [ { "data": { "image/png": "", "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "image/png": "", "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "for scorer in [\"normalized_probability\", \"min_probability\"]:\n", " plot_model_accuracies(\n", " scores=result_df[scorer],\n", " correct_indicators=result_df.response_correct,\n", " title=f\"LLM Accuracy by {scorer} Score Threshold\",\n", " )" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### 3.2 Precision, Recall, F1-Score of Hallucination Detection" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Lastly, we compute the optimal threshold for binarizing confidence scores, using F1-score as the objective. Using this threshold, we compute precision, recall, and F1-score for black box scorer predictions of whether responses are correct." ] }, { "cell_type": "code", "execution_count": 15, "metadata": { "tags": [] }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "normalized_probability score F1-optimal threshold: 0.5700000000000001\n", " \n", "normalized_probability precision: 0.8070175438596491\n", "normalized_probability recall: 0.9019607843137255\n", "normalized_probability f1-score: 0.8518518518518519\n", " \n", " \n", "min_probability score F1-optimal threshold: 0.51\n", " \n", "min_probability precision: 0.8076923076923077\n", "min_probability recall: 0.8235294117647058\n", "min_probability f1-score: 0.8155339805825242\n", " \n", " \n" ] } ], "source": [ "# instantiate UQLM tuner object for threshold selection\n", "t = Tuner()\n", "\n", "correct_indicators = (\n", " result_df.response_correct\n", ") * 1 # Whether responses is actually correct\n", "for scorer in [\"normalized_probability\", \"min_probability\"]:\n", " y_scores = result_df[scorer] # confidence score\n", "\n", " # Solve for threshold that maximizes F1-score\n", " best_threshold = t.tune_threshold(\n", " y_scores=y_scores,\n", " correct_indicators=correct_indicators,\n", " thresh_objective=\"fbeta_score\",\n", " )\n", " y_pred = [\n", " (s > best_threshold) * 1 for s in y_scores\n", " ] # predicts whether response is correct based on confidence score\n", " print(f\"{scorer} score F1-optimal threshold: {best_threshold}\")\n", " print(\" \")\n", "\n", " # evaluate precision, recall, and f1-score of predictions of correctness\n", " print(\n", " f\"{scorer} precision: {precision_score(y_true=correct_indicators, y_pred=y_pred)}\"\n", " )\n", " print(f\"{scorer} recall: {recall_score(y_true=correct_indicators, y_pred=y_pred)}\")\n", " print(f\"{scorer} f1-score: {f1_score(y_true=correct_indicators, y_pred=y_pred)}\")\n", " print(\" \")\n", " print(\" \")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "## 4. Scorer Definitions\n", "White-box UQ scorers leverage token probabilities of the LLM's generated response to quantify uncertainty. All scorers have outputs ranging from 0 to 1, with higher values indicating higher confidence. We define two white-box UQ scorers below.\n", "\n", "### Length-Normalized Token Probability (`normalized_probability`)\n", "Let the tokenization LLM response $y_i$ be denoted as $\\{t_1,...,t_{L_i}\\}$, where $L_i$ denotes the number of tokens the response. Length-normalized token probability (LNTP) computes a length-normalized analog of joint token probability:\n", "\n", "\\begin{equation}\n", " LNTP(y_i) = \\prod_{t \\in y_i} p_t^{\\frac{1}{L_i}},\n", "\\end{equation}\n", "where $p_t$ denotes the token probability for token $t$. Note that this score is equivalent to the geometric mean of token probabilities for response $y_i$. For more on this scorer, refer to [Malinin & Gales, 2021](https://arxiv.org/pdf/2002.07650).\n", "\n", "\n", "### Minimum Token Probability (`min_probability`)\n", "Minimum token probability (MTP) uses the minimum among token probabilities for a given responses as a confidence score:\n", "\n", "\\begin{equation}\n", " MTP(y_i) = \\min_{t \\in y_i} p_t,\n", "\\end{equation}\n", "where $t$ and $p_t$ follow the same definitions as above. For more on this scorer, refer to [Manakul et al., 2023](https://arxiv.org/abs/2303.08896)." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "© 2025 CVS Health and/or one of its affiliates. All rights reserved." ] } ], "metadata": { "environment": { "kernel": "uqlm", "name": "workbench-notebooks.m126", "type": "gcloud", "uri": "us-docker.pkg.dev/deeplearning-platform-release/gcr.io/workbench-notebooks:m126" }, "kernelspec": { "display_name": "uqlm", "language": "python", "name": "uqlm" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.9.21" } }, "nbformat": 4, "nbformat_minor": 4 }