{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# ๐ŸŽฏ Tunable Ensemble for LLM Uncertainty (Advanced)\n", "\n", "
\n", "

\n", "Ensemble UQ methods combine multiple individual scorers to provide a more robust uncertainty estimate. They offer high flexibility and customizability, allowing you to tailor the ensemble to specific use cases. This ensemble can leverage any combination of black-box, white-box, or LLM-as-a-Judge scorers offered by uqlm. Below is a list of the available scorers:\n", "\n", "#### Black-Box (Consistency) Scorers\n", "* Non-Contradiction Probability ([Chen & Mueller, 2023](https://arxiv.org/abs/2308.16175); [Lin et al., 2025](https://arxiv.org/abs/2305.19187); [Manakul et al., 2023](https://arxiv.org/abs/2303.08896))\n", "* Semantic Negentropy (based on [Farquhar et al., 2024](https://www.nature.com/articles/s41586-024-07421-0); [Kuhn et al., 2023](https://arxiv.org/pdf/2302.09664))\n", "* Exact Match ([Cole et al., 2023](https://arxiv.org/abs/2305.14613); [Chen & Mueller, 2023](https://arxiv.org/abs/2308.16175))\n", "* BERT-score ([Manakul et al., 2023](https://arxiv.org/abs/2303.08896); [Zheng et al., 2020](https://arxiv.org/abs/1904.09675))\n", "* BLUERT ([Sellam et al., 2020](https://arxiv.org/abs/2004.04696))\n", "* Normalized Cosine Similarity ([Shorinwa et al., 2024](https://arxiv.org/pdf/2412.05563); [HuggingFace](https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2))\n", "\n", "#### White-Box (Token-Probability-Based) Scorers\n", "* Minimum token probability ([Manakul et al., 2023](https://arxiv.org/abs/2303.08896))\n", "* Length-Normalized Joint Token Probability ([Malinin & Gales, 2021](https://arxiv.org/pdf/2002.07650))\n", "\n", "#### LLM-as-a-Judge Scorers\n", "* Categorical LLM-as-a-Judge ([Manakul et al., 2023](https://arxiv.org/abs/2303.08896); [Chen & Mueller, 2023](https://arxiv.org/abs/2308.16175); [Luo et al., 2023](https://arxiv.org/pdf/2303.15621))\n", "* Continuous LLM-as-a-Judge ([Xiong et al., 2024](https://arxiv.org/pdf/2306.13063))\n", "

\n", "
\n", " \n", "## ๐Ÿ“Š What You'll Do in This Demo\n", "\n", "
\n", "
1
\n", "
\n", "

Set up LLM and prompts.

\n", "

Set up LLM instance and load example data prompts.

\n", "
\n", "
\n", "\n", "
\n", "
2
\n", "
\n", "

Tune Ensemble Weights

\n", "

Tune the ensemble weights on a set of tuning prompts. You will execute a single UQEnsemble.tune() method that will generate responses, compute confidence scores, and optimize weights using a provided answer key corresponding to the provided questions.

\n", "
\n", "
\n", "\n", "
\n", "
3
\n", "
\n", "

Generate LLM Responses and Confidence Scores with Tuned Ensemble.

\n", "

Generate and score LLM responses to the example questions using the tuned UQEnsemble() object.

\n", "
\n", "
\n", "\n", "
\n", "
4
\n", "
\n", "

Evaluate Hallucination Detection Performance.

\n", "

Visualize LLM accuracy at different thresholds of the ensemble score that combines various scorers. Compute precision, recall, and F1-score of hallucination detection.

\n", "
\n", "
\n", "\n", "## โš–๏ธ Advantages & Limitations\n", "\n", "
\n", "
\n", "

Pros

\n", " \n", "
\n", " \n", "
\n", "

Cons

\n", " \n", "
\n", "
" ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "tags": [] }, "outputs": [], "source": [ "import numpy as np\n", "from sklearn.metrics import precision_score, recall_score, f1_score\n", "\n", "from uqlm import UQEnsemble\n", "from uqlm.utils import load_example_dataset, math_postprocessor, plot_model_accuracies" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "## 1. Set up LLM and Prompts" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In this demo, we will illustrate this approach using a set of math questions from the [GSM8K benchmark](https://github.com/openai/grade-school-math). To implement with your use case, simply **replace the example prompts with your data**. " ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "tags": [] }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Loading dataset - gsm8k...\n", "Processing dataset...\n", "Dataset ready!\n" ] }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
questionanswer
0Natalia sold clips to 48 of her friends in Apr...72
1Weng earns $12 an hour for babysitting. Yester...10
2Betty is saving money for a new wallet which c...5
3Julie is reading a 120-page book. Yesterday, s...42
4James writes a 3-page letter to 2 different fr...624
\n", "
" ], "text/plain": [ " question answer\n", "0 Natalia sold clips to 48 of her friends in Apr... 72\n", "1 Weng earns $12 an hour for babysitting. Yester... 10\n", "2 Betty is saving money for a new wallet which c... 5\n", "3 Julie is reading a 120-page book. Yesterday, s... 42\n", "4 James writes a 3-page letter to 2 different fr... 624" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Load example dataset (GSM8K)\n", "gsm8k = load_example_dataset(\"gsm8k\", n=100)\n", "gsm8k.head()" ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "tags": [] }, "outputs": [], "source": [ "gsm8k_tune = gsm8k.iloc[0:50]\n", "gsm8k_test = gsm8k.iloc[51:100]" ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "tags": [] }, "outputs": [], "source": [ "# Define prompts\n", "MATH_INSTRUCTION = (\n", " \"When you solve this math problem only return the answer with no additional text.\\n\"\n", ")\n", "tune_prompts = [MATH_INSTRUCTION + prompt for prompt in gsm8k_tune.question]\n", "test_prompts = [MATH_INSTRUCTION + prompt for prompt in gsm8k_test.question]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In this example, we use `ChatVertexAI` and `AzureChatOpenAI` to instantiate our LLMs, but any [LangChain Chat Model](https://js.langchain.com/docs/integrations/chat/) may be used. Be sure to **replace with your LLM of choice.**" ] }, { "cell_type": "code", "execution_count": 6, "metadata": { "tags": [] }, "outputs": [], "source": [ "# import sys\n", "# !{sys.executable} -m pip install python-dotenv\n", "# !{sys.executable} -m pip install langchain-openai\n", "\n", "# # User to populate .env file with API credentials. In this step, replace with your LLM of choice.\n", "from dotenv import load_dotenv, find_dotenv\n", "from langchain_openai import AzureChatOpenAI\n", "\n", "load_dotenv(find_dotenv())\n", "gpt = AzureChatOpenAI(\n", " deployment_name=os.getenv(\"DEPLOYMENT_NAME\"),\n", " openai_api_key=os.getenv(\"API_KEY\"),\n", " azure_endpoint=os.getenv(\"API_BASE\"),\n", " openai_api_type=os.getenv(\"API_TYPE\"),\n", " openai_api_version=os.getenv(\"API_VERSION\"),\n", " temperature=1, # User to set temperature\n", ")" ] }, { "cell_type": "code", "execution_count": 7, "metadata": { "tags": [] }, "outputs": [], "source": [ "# import sys\n", "# !{sys.executable} -m pip install langchain-google-vertexai\n", "from langchain_google_vertexai import ChatVertexAI\n", "\n", "gemini = ChatVertexAI(model=\"gemini-pro\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "## 2. Tune Ensemble" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### `UQEnsemble()` - Ensemble of uncertainty scorers\n", "\n", "#### ๐Ÿ“‹ Class Attributes\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
ParameterType & DefaultDescription
llmBaseChatModel
default=None
A langchain llm `BaseChatModel`. User is responsible for specifying temperature and other relevant parameters to the constructor of the provided `llm` object.
scorersList
default=None
Specifies which black-box, white-box, or LLM-as-a-Judge scorers to include in the ensemble. List containing instances of BaseChatModel, LLMJudge, black-box scorer names from ['semantic_negentropy', 'noncontradiction','exact_match', 'bert_score', 'bleurt', 'cosine_sim'], or white-box scorer names from [\"normalized_probability\", \"min_probability\"]. If None, defaults to the off-the-shelf BS Detector ensemble by Chen & Mueller, 2023 which uses components [\"noncontradiction\", \"exact_match\",\"self_reflection\"] with respective weights of [0.56, 0.14, 0.3].
devicestr or torch.device
default=\"cpu\"
Specifies the device that NLI model use for prediction. Only applies to 'semantic_negentropy', 'noncontradiction' scorers. Pass a torch.device to leverage GPU.
use_bestbool
default=True
Specifies whether to swap the original response for the uncertainty-minimized response among all sampled responses based on semantic entropy clusters. Only used if `scorers` includes 'semantic_negentropy' or 'noncontradiction'.
system_promptstr or None
default=\"You are a helpful assistant.\"
Optional argument for user to provide custom system prompt for the LLM.
max_calls_per_minint
default=None
Specifies how many API calls to make per minute to avoid rate limit errors. By default, no limit is specified.
use_n_parambool
default=False
Specifies whether to use n parameter for BaseChatModel. Not compatible with all BaseChatModel classes. If used, it speeds up the generation process substantially when num_responses is large.
postprocessorcallable
default=None
A user-defined function that takes a string input and returns a string. Used for postprocessing outputs.
sampling_temperaturefloat
default=1
The 'temperature' parameter for LLM model to generate sampled LLM responses. Must be greater than 0.
weightslist of floats
default=None
Specifies weight for each component in ensemble. If None, and scorers is not None, and defaults to equal weights for each scorer. These weights get updated with tune method is executed.
nli_model_namestr
default=\"microsoft/deberta-large-mnli\"
Specifies which NLI model to use. Must be acceptable input to AutoTokenizer.from_pretrained() and AutoModelForSequenceClassification.from_pretrained().
\n", "\n", "#### ๐Ÿ” Parameter Groups\n", "\n", "
\n", "
\n", "

๐Ÿง  LLM-Specific

\n", " \n", "
\n", "
\n", "

๐Ÿ“Š Confidence Scores

\n", " \n", "
\n", "
\n", "

๐Ÿ–ฅ๏ธ Hardware

\n", " \n", "
\n", "
\n", "

โšก Performance

\n", " \n", "
\n", "
\n", "\n", "#### ๐Ÿ’ป Usage Examples\n", "\n", "```python\n", "# Basic usage with default parameters\n", "uqe = UQEnsemble(llm=llm)\n", "\n", "# Using GPU acceleration\n", "uqe = UQEnsemble(llm=llm, device=torch.device(\"cuda\"))\n", "\n", "# Custom scorer list\n", "uqe = BlackBoxUQ(llm=llm, scorers=[\"bert_score\", \"exact_match\", llm])\n", "\n", "# High-throughput configuration with rate limiting\n", "uqe = UQEnsemble(llm=llm, max_calls_per_min=200, use_n_param=True) \n", "```" ] }, { "cell_type": "code", "execution_count": 8, "metadata": { "tags": [] }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Using cuda device\n" ] } ], "source": [ "import torch\n", "\n", "# Set the torch device\n", "if torch.cuda.is_available(): # NVIDIA GPU\n", " device = torch.device(\"cuda\")\n", "elif torch.backends.mps.is_available(): # macOS\n", " device = torch.device(\"mps\")\n", "else:\n", " device = torch.device(\"cpu\") # CPU\n", "print(f\"Using {device.type} device\")" ] }, { "cell_type": "code", "execution_count": 9, "metadata": { "tags": [] }, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "Some weights of the model checkpoint at microsoft/deberta-large-mnli were not used when initializing DebertaForSequenceClassification: ['config']\n", "- This IS expected if you are initializing DebertaForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).\n", "- This IS NOT expected if you are initializing DebertaForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).\n" ] } ], "source": [ "scorers = [\n", " \"exact_match\", # Measures proportion of candidate responses that match original response (black-box)\n", " \"noncontradiction\", # mean non-contradiction probability between candidate responses and original response (black-box)\n", " \"normalized_probability\", # length-normalized joint token probability (white-box)\n", " gpt, # LLM-as-a-judge (self)\n", " gemini, # LLM-as-a-judge (separate LLM)\n", "]\n", "\n", "uqe = UQEnsemble(\n", " llm=gpt,\n", " device=device,\n", " max_calls_per_min=175,\n", " # postprocessor=math_postprocessor,\n", " use_n_param=True, # Set True if using AzureChatOpenAI for faster generation\n", " scorers=scorers,\n", ")" ] }, { "cell_type": "markdown", "metadata": { "tags": [] }, "source": [ "### ๐Ÿ”„ Class Methods: Tuning\n", "\n", "![Sample Image](https://raw.githubusercontent.com/cvs-health/uqlm/develop/assets/images/uqensemble_tune.png)\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
MethodDescription & Parameters
UQEnsemble.tune\n", "

Generate responses from provided prompts, grade responses with provided grader function, and tune ensemble weights. If weights and threshold objectives match, joint optimization will happen. Otherwise, sequential optimization will happen. If an optimization problem has fewer than three choice variables, grid search will happen.

\n", "

Parameters:

\n", "
    \n", "
  • prompts - (list of str) A list of input prompts for the model.
  • \n", "
  • ground_truth_answers - (List[str]) A list of ideal (correct) responses.
  • \n", "
  • grader_function - (callable, default=None) A user-defined function that takes a response and a ground truth 'answer' and returns a boolean indicator of whether the response is correct. If not provided, vectara's HHEM is used: https://huggingface.co/vectara/hallucination_evaluation_model
  • \n", "
  • num_responses - (int, default=5) The number of sampled responses used to compute consistency.
  • \n", "
  • thresh_objective - (str, default='fbeta_score') Objective function for threshold optimization via grid search. One of {'fbeta_score', 'accuracy_score', 'balanced_accuracy_score', 'roc_auc', 'log_loss'}.
  • \n", "
  • thresh_bounds - (tuple of floats, default=(0,1)) Bounds to search for threshold.
  • \n", "
  • n_trials - (int, default=100) Indicates how many trials to search over with optuna optimizer
  • \n", "
  • step_size - (float, default=0.01) Indicates step size in grid search, if used.
  • \n", "
  • fscore_beta - (float, default=1) Value of beta in fbeta_score.
  • \n", "
\n", "

Returns: UQResult containing data (prompts, responses, sampled responses, and confidence scores) and metadata

\n", "
\n", " ๐Ÿ’ก Best For: Tuning an optimized ensemble for detecting hallucinations in a specific use case.\n", "
\n", "
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Note that below, we are providing a grader function that is specific to our use case (math questions). If you are running this example notebook with your own prompts/questions, update the grader function accordingly. Note that the default grader function, `vectara/hallucination_evaluation_model`, is used if no grader function is provided and generally works well across use cases. " ] }, { "cell_type": "code", "execution_count": 10, "metadata": { "tags": [] }, "outputs": [], "source": [ "def grade_response(response: str, answer: str) -> bool:\n", " return (math_postprocessor(response) == answer)" ] }, { "cell_type": "code", "execution_count": 11, "metadata": { "tags": [] }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Generating responses...\n", "Generating candidate responses...\n", "Computing confidence scores...\n", "Generating LLMJudge scores...\n", "Generating LLMJudge scores...\n", "Grading responses with grader function...\n", "Optimizing weights...\n", "Optimizing threshold with grid search...\n" ] } ], "source": [ "tune_results = await uqe.tune(\n", " prompts=tune_prompts, # prompts for tuning (responses will be generated from these prompts)\n", " ground_truth_answers=gsm8k_tune[\"answer\"], # correct answers to 'grade' LLM responses against\n", " grader_function=grade_response, # grader function to grade responses against provided answers\n", ")" ] }, { "cell_type": "code", "execution_count": 12, "metadata": { "tags": [] }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
promptresponsesampled_responsesensemble_scoreexact_matchnoncontradictionnormalized_probabilityjudge_1judge_2
0When you solve this math problem only return t...72[72, 72, 72, 72, 72]0.9525661.01.0000000.9991881.00.5
1When you solve this math problem only return t...$10[$10, $10, $10, $10, $10]0.8950371.01.0000000.9990190.00.5
2When you solve this math problem only return t...$20[$20, $20, $20, $20, $10]0.7628770.80.8013010.9465841.00.0
3When you solve this math problem only return t...48[48, 48, 48, 48, 48]0.9417981.01.0000000.9960910.01.0
4When you solve this math problem only return t...624[624, 624, 624, 624, 624]0.9999691.01.0000000.9998281.01.0
\n", "
" ], "text/plain": [ " prompt response \\\n", "0 When you solve this math problem only return t... 72 \n", "1 When you solve this math problem only return t... $10 \n", "2 When you solve this math problem only return t... $20 \n", "3 When you solve this math problem only return t... 48 \n", "4 When you solve this math problem only return t... 624 \n", "\n", " sampled_responses ensemble_score exact_match noncontradiction \\\n", "0 [72, 72, 72, 72, 72] 0.952566 1.0 1.000000 \n", "1 [$10, $10, $10, $10, $10] 0.895037 1.0 1.000000 \n", "2 [$20, $20, $20, $20, $10] 0.762877 0.8 0.801301 \n", "3 [48, 48, 48, 48, 48] 0.941798 1.0 1.000000 \n", "4 [624, 624, 624, 624, 624] 0.999969 1.0 1.000000 \n", "\n", " normalized_probability judge_1 judge_2 \n", "0 0.999188 1.0 0.5 \n", "1 0.999019 0.0 0.5 \n", "2 0.946584 1.0 0.0 \n", "3 0.996091 0.0 1.0 \n", "4 0.999828 1.0 1.0 " ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "result_df = tune_results.to_df()\n", "result_df.head()" ] }, { "cell_type": "code", "execution_count": 13, "metadata": { "tags": [] }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Weight for exact_match: 0.14848838407926002\n", "Weight for noncontradiction: 0.5195905565435162\n", "Weight for normalized_probability: 0.17984565170407496\n", "Weight for judge_1: 0.05749867602904491\n", "Weight for judge_2: 0.09457673164410399\n" ] } ], "source": [ "for i, weight in enumerate(uqe.weights):\n", " print(f\"Weight for {uqe.component_names[i]}: {weight}\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "## 3. Generate LLM Responses and Confidence Scores" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To evaluate hallucination detection performance, we will generate responses and corresponding confidence scores on a holdout set using the tuned ensemble." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### ๐Ÿ”„ Class Methods: Generation + Scoring\n", "\n", "![Sample Image](https://raw.githubusercontent.com/cvs-health/uqlm/develop/assets/images/uqensemble_generate_score.png)\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
MethodDescription & Parameters
UQEnsemble.generate_and_score\n", "

Generate LLM responses, sampled LLM (candidate) responses, and compute confidence scores for the provided prompts.

\n", "

Parameters:

\n", "
    \n", "
  • prompts - (list of str) A list of input prompts for the model.
  • \n", "
  • num_responses - (int, default=5) The number of sampled responses used to compute consistency.
  • \n", "
\n", "

Returns: UQResult containing data (prompts, responses, sampled responses, and confidence scores) and metadata

\n", "
\n", " ๐Ÿ’ก Best For: Complete end-to-end uncertainty quantification when starting with prompts.\n", "
\n", "
UQEnsemble.score\n", "

Compute confidence scores on provided LLM responses. Should only be used if responses and sampled responses are already generated.

\n", "

Parameters:

\n", "
    \n", "
  • prompts - (list of str) A list of input prompts for the LLM.
  • \n", "
  • responses - (list of str) A list of LLM responses for the prompts.
  • \n", "
  • sampled_responses - (list of list of str, default=None) A list of lists of sampled LLM responses for each prompt. These will be used to compute consistency scores by comparing to the corresponding response from responses. Must be provided if using Black-Box scorers.
  • \n", "
  • logprobs_results - (list of logprobs_result, default=None) List of lists of dictionaries, each returned by BaseChatModel.agenerate. Must be provided if using white box scorers.
  • \n", "
\n", "

Returns: UQResult containing data (responses, sampled responses, and confidence scores) and metadata

\n", "
\n", " ๐Ÿ’ก Best For: Computing uncertainty scores when responses are already generated elsewhere.\n", "
\n", "
" ] }, { "cell_type": "code", "execution_count": 15, "metadata": { "tags": [] }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Generating responses...\n", "Generating candidate responses...\n", "Computing confidence scores...\n", "Generating LLMJudge scores...\n", "Generating LLMJudge scores...\n" ] } ], "source": [ "test_results = await uqe.generate_and_score(prompts=test_prompts, num_responses=5)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "## 4. Evaluate Hallucination Detection Performance" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To evaluate hallucination detection performance, we 'grade' the responses against an answer key. Again, note that the `grade_response` function is specific to our use case (math questions). **If you are using your own prompts/questions, update the grading method accordingly**." ] }, { "cell_type": "code", "execution_count": 16, "metadata": { "tags": [] }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
promptresponsesampled_responsesensemble_scoreexact_matchnoncontradictionnormalized_probabilityjudge_1judge_2response_correct
0When you solve this math problem only return t...160[68, 176, 152, 80, 72]0.0300600.00.0211500.1060420.00.0True
1When you solve this math problem only return t...12[12, 14, 18, 11, 13]0.1586900.20.2406500.0219820.00.0False
2When you solve this math problem only return t...$36[$36, $36, $36, $36, 36]0.8708010.80.9942310.9892871.00.0True
3When you solve this math problem only return t...9[$3, $9, 3, $10, 9]0.3591670.20.4524590.2050471.00.0False
4When you solve this math problem only return t...75%[75%, 75., 75%, 75%, 75%]0.8733140.80.9987180.9902971.00.0True
\n", "
" ], "text/plain": [ " prompt response \\\n", "0 When you solve this math problem only return t... 160 \n", "1 When you solve this math problem only return t... 12 \n", "2 When you solve this math problem only return t... $36 \n", "3 When you solve this math problem only return t... 9 \n", "4 When you solve this math problem only return t... 75% \n", "\n", " sampled_responses ensemble_score exact_match noncontradiction \\\n", "0 [68, 176, 152, 80, 72] 0.030060 0.0 0.021150 \n", "1 [12, 14, 18, 11, 13] 0.158690 0.2 0.240650 \n", "2 [$36, $36, $36, $36, 36] 0.870801 0.8 0.994231 \n", "3 [$3, $9, 3, $10, 9] 0.359167 0.2 0.452459 \n", "4 [75%, 75., 75%, 75%, 75%] 0.873314 0.8 0.998718 \n", "\n", " normalized_probability judge_1 judge_2 response_correct \n", "0 0.106042 0.0 0.0 True \n", "1 0.021982 0.0 0.0 False \n", "2 0.989287 1.0 0.0 True \n", "3 0.205047 1.0 0.0 False \n", "4 0.990297 1.0 0.0 True " ] }, "execution_count": 16, "metadata": {}, "output_type": "execute_result" } ], "source": [ "test_result_df = test_results.to_df()\n", "test_result_df[\"response_correct\"] = [ \n", " grade_response(r, a) for r, a in zip(test_result_df[\"response\"], gsm8k_test[\"answer\"])\n", "]\n", "test_result_df.head(5)" ] }, { "cell_type": "code", "execution_count": 17, "metadata": { "tags": [] }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Baseline LLM accuracy: 0.5714285714285714\n" ] } ], "source": [ "print(f\"\"\"Baseline LLM accuracy: {np.mean(test_result_df[\"response_correct\"])}\"\"\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### 4.1 Filtered LLM Accuracy Evaluation" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Here, we explore โ€˜filtered accuracyโ€™ as a metric for evaluating the performance of our confidence scores. Filtered accuracy measures the change in LLM performance when responses with confidence scores below a specified threshold are excluded. By adjusting the confidence score threshold, we can observe how the accuracy of the LLM improves as less certain responses are filtered out.\n", "\n", "We will plot the filtered accuracy across various confidence score thresholds to visualize the relationship between confidence and LLM accuracy. This analysis helps in understanding the trade-off between response coverage (measured by sample size below) and LLM accuracy, providing insights into the reliability of the LLMโ€™s outputs." ] }, { "cell_type": "code", "execution_count": 18, "metadata": { "tags": [] }, "outputs": [ { "data": { "image/png": "", "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "plot_model_accuracies(\n", " scores=test_result_df.ensemble_score,\n", " correct_indicators=test_result_df.response_correct,\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### 4.2 Precision, Recall, F1-Score of Hallucination Detection" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Lastly, we compute the optimal threshold for binarizing confidence scores, using F1-score as the objective. Using this threshold, we compute precision, recall, and F1-score for black box scorer predictions of whether responses are correct." ] }, { "cell_type": "code", "execution_count": 19, "metadata": { "tags": [] }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Ensemble F1-optimal threshold: 0.36\n" ] } ], "source": [ "# extract optimal threshold\n", "best_threshold = uqe.thresh\n", "\n", "# Define score vector and corresponding correct indicators (i.e. ground truth)\n", "y_scores = test_result_df[\"ensemble_score\"] # confidence score\n", "correct_indicators = (\n", " test_result_df.response_correct\n", ") * 1 # Whether responses is actually correct\n", "y_pred = [\n", " (s > best_threshold) * 1 for s in y_scores\n", "] # predicts whether response is correct based on confidence score\n", "print(f\"Ensemble F1-optimal threshold: {best_threshold}\")" ] }, { "cell_type": "code", "execution_count": 20, "metadata": { "tags": [] }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Ensemble precision: 0.6585365853658537\n", "Ensemble recall: 0.9642857142857143\n", "Ensemble f1-score: 0.782608695652174\n" ] } ], "source": [ "# evaluate precision, recall, and f1-score of semantic entropy predictions of correctness\n", "print(\n", " f\"Ensemble precision: {precision_score(y_true=correct_indicators, y_pred=y_pred)}\"\n", ")\n", "print(f\"Ensemble recall: {recall_score(y_true=correct_indicators, y_pred=y_pred)}\")\n", "print(f\"Ensemble f1-score: {f1_score(y_true=correct_indicators, y_pred=y_pred)}\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 5. Scorer Definitions\n", "\n", "### Black-Box Scorers\n", "Black-Box UQ scorers exploit variation in LLM responses to the same prompt to measure semantic consistency. All scorers have outputs ranging from 0 to 1, with higher values indicating higher confidence. \n", "\n", "For a given prompt $x_i$, these approaches involves generating $m$ responses $\\tilde{\\mathbf{y}}_i = \\{ \\tilde{y}_{i1},...,\\tilde{y}_{im}\\}$, using a non-zero temperature, from the same prompt and comparing these responses to the original response $y_{i}$. We provide detailed descriptions of each below.\n", "\n", "#### Exact Match Rate (`exact_match`)\n", "Exact Match Rate (EMR) computes the proportion of candidate responses that are identical to the original response.\n", "$$ EMR(y_i; \\tilde{\\mathbf{y}}_i) = \\frac{1}{m} \\sum_{j=1}^m \\mathbb{I}(y_i=\\tilde{y}_{ij}). $$\n", "\n", "For more on this scorer, refer to [Cole et al., 2023](https://arxiv.org/abs/2305.14613).\n", "\n", "#### Non-Contradiction Probability (`noncontradiction`)\n", "Non-contradiction probability (NCP) computes the mean non-contradiction probability estimated by a natural language inference (NLI) model. This score is formally defined as follows:\n", "\n", "\\begin{equation}\n", " NCP(y_i; \\tilde{\\mathbf{y}}_i) = \\frac{1}{m} \\sum_{j=1}^m(1 - p_j)\n", "\\end{equation}\n", "where\n", "\n", "\\begin{equation}\n", " p_j = \\frac{\\eta(y_{i}, \\tilde{y}_{ij}) + \\eta(\\tilde{y}_{ij},y_i)}{2}.\n", "\\end{equation}\n", "\n", "Above, $\\eta(\\tilde{y}_{ij},y_i)$ denotes the contradiction probability estimated by the NLI model for response $y_i$ and candidate $\\tilde{y}_{ij}$. For more on this scorer, refer to [Chen & Mueller, 2023](https://arxiv.org/abs/2308.16175), [Lin et al., 2025](https://arxiv.org/abs/2305.19187), or [Manakul et al., 2023](https://arxiv.org/abs/2303.08896).\n", "\n", "#### Normalized Semantic Negentropy (`semantic_negentropy`)\n", "Normalized Semantic Negentropy (NSN) normalizes the standard computation of discrete semantic entropy to be increasing with higher confidence and have [0,1] support. In contrast to the EMR and NCP, semantic entropy does not distinguish between an original response and candidate responses. Instead, this approach computes a single metric value on a list of responses generated from the same prompt. Under this approach, responses are clustered using an NLI model based on mutual entailment. We consider the discrete version of SE, where the final set of clusters is defined as follows:\n", "\n", "\\begin{equation}\n", " SE(y_i; \\tilde{\\mathbf{y}}_i) = - \\sum_{C \\in \\mathcal{C}} P(C|y_i, \\tilde{\\mathbf{y}}_i)\\log P(C|y_i, \\tilde{\\mathbf{y}}_i),\n", "\\end{equation}\n", "where $P(C|y_i, \\tilde{\\mathbf{y}}_i)$ denotes the probability a randomly selected response $y \\in \\{y_i\\} \\cup \\tilde{\\mathbf{y}}_i $ belongs to cluster $C$, and $\\mathcal{C}$ denotes the full set of clusters of $\\{y_i\\} \\cup \\tilde{\\mathbf{y}}_i$.\n", "\n", "To ensure that we have a normalized confidence score with $[0,1]$ support and with higher values corresponding to higher confidence, we implement the following normalization to arrive at *Normalized Semantic Negentropy* (NSN):\n", "\\begin{equation}\n", " NSN(y_i; \\tilde{\\mathbf{y}}_i) = 1 - \\frac{SE(y_i; \\tilde{\\mathbf{y}}_i)}{\\log m},\n", "\\end{equation}\n", "where $\\log m$ is included to normalize the support.\n", "\n", "#### BERTScore (`bert_score`)\n", "Let a tokenized text sequence be denoted as $\\textbf{t} = \\{t_1,...t_L\\}$ and the corresponding contextualized word embeddings as $\\textbf{E} = \\{\\textbf{e}_1,...,\\textbf{e}_L\\}$, where $L$ is the number of tokens in the text. The BERTScore precision, recall, and F1-scores between two tokenized texts $\\textbf{t}, \\textbf{t}'$ are respectively defined as follows:\n", "\n", "\\begin{equation}\n", " BertP(\\textbf{t}, \\textbf{t}') = \\frac{1}{| \\textbf{t}|} \\sum_{t \\in \\textbf{t}} \\max_{t' \\in \\textbf{t}'} \\textbf{e} \\cdot \\textbf{e}'\n", "\\end{equation}\n", "\n", "\\begin{equation}\n", " BertR(\\textbf{t}, \\textbf{t}') = \\frac{1}{| \\textbf{t}'|} \\sum_{t' \\in \\textbf{t}'} \\max_{t \\in \\textbf{t}} \\textbf{e} \\cdot \\textbf{e}'\n", "\\end{equation}\n", "\n", "\\begin{equation}\n", " BertF(\\textbf{t}, \\textbf{t}') = 2\\frac{ BertP(\\textbf{t}, \\textbf{t}') BertR(\\textbf{t}, \\textbf{t}')}{BertPr(\\textbf{t}, \\textbf{t}') + BertRec(\\textbf{t}, \\textbf{t}')},\n", "\\end{equation}\n", "where $e, e'$ respectively correspond to $t, t'$. We compute our BERTScore-based confidence scores as follows:\n", "\\begin{equation}\n", " BertConfidence(y_i; \\tilde{\\mathbf{y}}_i) = \\frac{1}{m} \\sum_{j=1}^m BertF(y_i, \\tilde{y}_{ij}),\n", "\\end{equation}\n", "i.e. the average BERTScore F1 across pairings of the original response with all candidate responses. For more on BERTScore, refer to [Zheng et al., 2020](https://arxiv.org/abs/1904.09675).\n", "\n", "#### BLEURT (`bleurt`)\n", "In contrast to the aforementioned scorers, BLEURT is specifically pre-trained and fine-tuned to learn human judgments of text similarity. Our BLEURT confidence score is the average BLEURT value across pairings of the original response with all candidate responses:\n", "\n", "\\begin{equation}\n", " BLEURTConfidence(y_i; \\tilde{\\mathbf{y}}_i) = \\frac{1}{m} \\sum_{j=1}^m BLEURT(y_i, \\tilde{y}_{ij}).\n", "\\end{equation}\n", "\n", "For more on this scorer, refer to [Sellam et al., 2020](https://arxiv.org/abs/2004.04696).\n", "\n", "\n", "#### Normalized Cosine Similarity (`cosine_sim`)\n", "This scorer leverages a sentence transformer to map LLM outputs to an embedding space and measure similarity using those sentence embeddings. Let $V: \\mathcal{Y} \\xrightarrow{} \\mathbb{R}^d$ denote the sentence transformer, where $d$ is the dimension of the embedding space. The average cosine similarity across pairings of the original response with all candidate responses is given as follows:\n", "\n", "\\begin{equation}\n", " CS(y_i; \\tilde{\\mathbf{y}}_i) = \\frac{1}{m} \\sum_{i=1}^m \\frac{\\mathbf{V}(y_i) \\cdot \\mathbf{V}(\\tilde{y}_{ij}) }{ \\lVert \\mathbf{V}(y_i) \\rVert \\lVert \\mathbf{V}(\\tilde{y}_{ij}) \\rVert}.\n", "\\end{equation}\n", "\n", "To ensure a standardized support of $[0, 1]$, we normalize cosine similarity to obtain confidence scores as follows:\n", "\n", "\\begin{equation}\n", " NCS(y_i; \\tilde{\\mathbf{y}}_i) = \\frac{CS(y_i; \\tilde{\\mathbf{y}}_i) + 1}{2}.\n", "\\end{equation}\n", "\n", "\n", "### White-Box UQ Scorers\n", "White-box UQ scorers leverage token probabilities of the LLM's generated response to quantify uncertainty. All scorers have outputs ranging from 0 to 1, with higher values indicating higher confidence. We define two white-box UQ scorers below.\n", "\n", "#### Length-Normalized Token Probability (`normalized_probability`)\n", "Let the tokenization LLM response $y_i$ be denoted as $\\{t_1,...,t_{L_i}\\}$, where $L_i$ denotes the number of tokens the response. Length-normalized token probability (LNTP) computes a length-normalized analog of joint token probability:\n", "\n", "\\begin{equation}\n", " LNTP(y_i) = \\prod_{t \\in y_i} p_t^{\\frac{1}{L_i}},\n", "\\end{equation}\n", "where $p_t$ denotes the token probability for token $t$. Note that this score is equivalent to the geometric mean of token probabilities for response $y_i$. For more on this scorer, refer to [Malinin & Gales, 2021](https://arxiv.org/pdf/2002.07650).\n", "\n", "\n", "#### Minimum Token Probability (`min_probability`)\n", "Minimum token probability (MTP) uses the minimum among token probabilities for a given responses as a confidence score:\n", "\n", "\\begin{equation}\n", " MTP(y_i) = \\min_{t \\in y_i} p_t,\n", "\\end{equation}\n", "where $t$ and $p_t$ follow the same definitions as above. For more on this scorer, refer to [Manakul et al., 2023](https://arxiv.org/abs/2303.08896).\n", "\n", "### LLM-as-a-Judge Scorers\n", "Under the LLM-as-a-Judge approach, either the same LLM that was used for generating the original responses or a different LLM is asked to form a judgment about a pre-generated response. Below, we define two LLM-as-a-Judge scorer templates. \n", "#### Categorical Judge Template (`true_false_uncertain`)\n", "We follow the approach proposed by [Chen & Mueller, 2023](https://arxiv.org/abs/2308.16175) in which an LLM is instructed to score a question-answer concatenation as either *incorrect*, *uncertain*, or *correct* using a carefully constructed prompt. These categories are respectively mapped to numerical scores of 0, 0.5, and 1. We denote the LLM-as-a-judge scorers as $J: \\mathcal{Y} \\xrightarrow[]{} \\{0, 0.5, 1\\}$. Formally, we can write this scorer function as follows:\n", "\n", "\\begin{equation}\n", "J(y_i) = \\begin{cases}\n", " 0 & \\text{LLM states response is incorrect} \\\\\n", " 0.5 & \\text{LLM states that it is uncertain} \\\\\n", " 1 & \\text{LLM states response is correct}.\n", "\\end{cases}\n", "\\end{equation}\n", "\n", "#### Continuous Judge Template (`continuous`)\n", "For the continuous template, the LLM is asked to directly score a question-answer concatenation's correctness on a scale of 0 to 1. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "ยฉ 2025 CVS Health and/or one of its affiliates. All rights reserved." ] } ], "metadata": { "environment": { "kernel": "uqlm", "name": "workbench-notebooks.m126", "type": "gcloud", "uri": "us-docker.pkg.dev/deeplearning-platform-release/gcr.io/workbench-notebooks:m126" }, "kernelspec": { "display_name": "uqlm", "language": "python", "name": "uqlm" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.9.21" } }, "nbformat": 4, "nbformat_minor": 4 }