{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# 🎯 White-Box Uncertainty Quantification\n", "\n", "
\n",
" White-box Uncertainty Quantification (UQ) methods leverage token probabilities to estimate uncertainty. Multi-generation white-box methods generate multiple responses from the same prompt, combining the sampling approach of black-box UQ with token-probability-based singals. This demo provides an illustration of how to use state-of-the-art white-box UQ methods with uqlm. The following multi-generation scorers are available:\n",
"
Set up LLM instance and load example data prompts.
\n", "Generate LLM Responses and Confidence Scores
\n", "Generate and score LLM responses to the example questions using the WhiteBoxUQ() class.
Evaluate Hallucination Detection Performance
\n", "Visualize model accuracy at different thresholds of the various white-box UQ confidence scores. Compute precision, recall, and F1-score of hallucination detection.
\n", "| \n", " | question | \n", "answer | \n", "
|---|---|---|
| 0 | \n", "Natalia sold clips to 48 of her friends in Apr... | \n", "72 | \n", "
| 1 | \n", "Weng earns $12 an hour for babysitting. Yester... | \n", "10 | \n", "
| 2 | \n", "Betty is saving money for a new wallet which c... | \n", "5 | \n", "
| 3 | \n", "Julie is reading a 120-page book. Yesterday, s... | \n", "42 | \n", "
| 4 | \n", "James writes a 3-page letter to 2 different fr... | \n", "624 | \n", "
| Parameter | \n", "Type & Default | \n", "Description | \n", "
|---|---|---|
| llm | \n", "BaseChatModeldefault=None | \n",
" A langchain llm `BaseChatModel`. User is responsible for specifying temperature and other relevant parameters to the constructor of their `llm` object. | \n", "
| scorers | \n", "List[str]default=None | \n",
" Specifies which white-box UQ scorers to include. Must be subset of [\"normalized_probability\", \"min_probability\", \"sequence_probability\", \"max_token_negentropy\", \"mean_token_negentropy\", \"probability_margin\", \"monte_carlo_negentropy\", \"consistency_and_confidence\", \"semantic_negentropy\", \"semantic_density\", \"p_true\"]. If None, defaults to [\"normalized_probability\", \"min_probability\"]. | \n", "
| system_prompt | \n", "str or Nonedefault=\"You are a helpful assistant.\" | \n",
" Optional argument for user to provide custom system prompt for the LLM. | \n", "
| max_calls_per_min | \n", "intdefault=None | \n",
" Specifies how many API calls to make per minute to avoid rate limit errors. By default, no limit is specified. | \n", "
| sampling_temperature | \n", "floatdefault=1 | \n",
" The 'temperature' parameter for LLM to use when generating sampled LLM responses. Only applies to \"monte_carlo_negentropy\", \"consistency_and_confidence\", \"semantic_negentropy\", \"semantic_density\". Must be greater than 0. | \n", "
🧠 Model-Specific
\n", "llmsystem_promptsampling_temperature📊 Confidence Scores
\n", "scorers⚡ Performance
\n", "max_calls_per_min| Method | \n", "Description & Parameters | \n", "
|---|---|
| WhiteBoxUQ.generate_and_score | \n", "\n",
" Generate LLM responses and compute confidence scores for the provided prompts. \n", "Parameters: \n", "
Returns: \n",
" 💡 Best For: Complete end-to-end uncertainty quantification when starting with prompts.\n",
" \n",
" | \n",
"
| BlackBoxUQ.score | \n", "\n",
" Compute confidence scores on provided LLM responses and logprobs. Should only be used if responses and sampled responses are already generated with logprobs. \n", "Parameters: \n", "
Returns: \n",
" 💡 Best For: Computing uncertainty scores when responses and logprobs are already generated elsewhere.\n",
" \n",
" | \n",
"
| \n", " | prompt | \n", "response | \n", "logprob | \n", "sampled_responses | \n", "sampled_logprob | \n", "consistency_and_confidence | \n", "monte_carlo_probability | \n", "p_true | \n", "
|---|---|---|---|---|---|---|---|---|
| 0 | \n", "When you solve this math problem only return t... | \n", "72 | \n", "[{'token': '72', 'bytes': [55, 50], 'logprob':... | \n", "[72, 72, 72, 72, 72] | \n", "[[{'token': '72', 'bytes': [55, 50], 'logprob'... | \n", "0.999819 | \n", "0.999955 | \n", "0.377549 | \n", "
| 1 | \n", "When you solve this math problem only return t... | \n", "$10 | \n", "[{'token': '$', 'bytes': [36], 'logprob': -0.0... | \n", "[$10, $10, $10, $10, $10] | \n", "[[{'token': '$', 'bytes': [36], 'logprob': -0.... | \n", "0.994463 | \n", "0.994415 | \n", "0.047430 | \n", "
| 2 | \n", "When you solve this math problem only return t... | \n", "$20 | \n", "[{'token': '$', 'bytes': [36], 'logprob': -0.0... | \n", "[$20, $20, $20, $20, $10] | \n", "[[{'token': '$', 'bytes': [36], 'logprob': -0.... | \n", "0.923075 | \n", "0.890358 | \n", "0.777260 | \n", "
| 3 | \n", "When you solve this math problem only return t... | \n", "48 | \n", "[{'token': '48', 'bytes': [52, 56], 'logprob':... | \n", "[48, 48, 48, 48, 48] | \n", "[[{'token': '48', 'bytes': [52, 56], 'logprob'... | \n", "0.994755 | \n", "0.996196 | \n", "0.182436 | \n", "
| 4 | \n", "When you solve this math problem only return t... | \n", "624 | \n", "[{'token': '624', 'bytes': [54, 50, 52], 'logp... | \n", "[624, 624 pages., 624, 624, 624] | \n", "[[{'token': '624', 'bytes': [54, 50, 52], 'log... | \n", "0.954816 | \n", "0.923305 | \n", "0.981987 | \n", "
| \n", " | prompt | \n", "response | \n", "logprob | \n", "sampled_responses | \n", "sampled_logprob | \n", "consistency_and_confidence | \n", "monte_carlo_probability | \n", "p_true | \n", "answer | \n", "response_correct | \n", "
|---|---|---|---|---|---|---|---|---|---|---|
| 0 | \n", "When you solve this math problem only return t... | \n", "72 | \n", "[{'token': '72', 'bytes': [55, 50], 'logprob':... | \n", "[72, 72, 72, 72, 72] | \n", "[[{'token': '72', 'bytes': [55, 50], 'logprob'... | \n", "0.999819 | \n", "0.999955 | \n", "0.377549 | \n", "72 | \n", "True | \n", "
| 1 | \n", "When you solve this math problem only return t... | \n", "$10 | \n", "[{'token': '$', 'bytes': [36], 'logprob': -0.0... | \n", "[$10, $10, $10, $10, $10] | \n", "[[{'token': '$', 'bytes': [36], 'logprob': -0.... | \n", "0.994463 | \n", "0.994415 | \n", "0.047430 | \n", "10 | \n", "True | \n", "
| 2 | \n", "When you solve this math problem only return t... | \n", "$20 | \n", "[{'token': '$', 'bytes': [36], 'logprob': -0.0... | \n", "[$20, $20, $20, $20, $10] | \n", "[[{'token': '$', 'bytes': [36], 'logprob': -0.... | \n", "0.923075 | \n", "0.890358 | \n", "0.777260 | \n", "5 | \n", "False | \n", "
| 3 | \n", "When you solve this math problem only return t... | \n", "48 | \n", "[{'token': '48', 'bytes': [52, 56], 'logprob':... | \n", "[48, 48, 48, 48, 48] | \n", "[[{'token': '48', 'bytes': [52, 56], 'logprob'... | \n", "0.994755 | \n", "0.996196 | \n", "0.182436 | \n", "42 | \n", "False | \n", "
| 4 | \n", "When you solve this math problem only return t... | \n", "624 | \n", "[{'token': '624', 'bytes': [54, 50, 52], 'logp... | \n", "[624, 624 pages., 624, 624, 624] | \n", "[[{'token': '624', 'bytes': [54, 50, 52], 'log... | \n", "0.954816 | \n", "0.923305 | \n", "0.981987 | \n", "624 | \n", "True | \n", "