{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# ๐ฏ Tunable Ensemble for LLM Uncertainty (Advanced)\n", "\n", "
\n",
"Ensemble UQ methods combine multiple individual scorers to provide a more robust uncertainty estimate. They offer high flexibility and customizability, allowing you to tailor the ensemble to specific use cases. This ensemble can leverage any combination of black-box, white-box, or LLM-as-a-Judge scorers offered by uqlm
. Below is a list of the available scorers:\n",
"\n",
"#### Black-Box (Consistency) Scorers\n",
"* Non-Contradiction Probability ([Chen & Mueller, 2023](https://arxiv.org/abs/2308.16175); [Lin et al., 2025](https://arxiv.org/abs/2305.19187); [Manakul et al., 2023](https://arxiv.org/abs/2303.08896))\n",
"* Semantic Negentropy (based on [Farquhar et al., 2024](https://www.nature.com/articles/s41586-024-07421-0); [Kuhn et al., 2023](https://arxiv.org/pdf/2302.09664))\n",
"* Exact Match ([Cole et al., 2023](https://arxiv.org/abs/2305.14613); [Chen & Mueller, 2023](https://arxiv.org/abs/2308.16175))\n",
"* BERT-score ([Manakul et al., 2023](https://arxiv.org/abs/2303.08896); [Zheng et al., 2020](https://arxiv.org/abs/1904.09675))\n",
"* BLUERT ([Sellam et al., 2020](https://arxiv.org/abs/2004.04696))\n",
"* Normalized Cosine Similarity ([Shorinwa et al., 2024](https://arxiv.org/pdf/2412.05563); [HuggingFace](https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2))\n",
"\n",
"#### White-Box (Token-Probability-Based) Scorers\n",
"* Minimum token probability ([Manakul et al., 2023](https://arxiv.org/abs/2303.08896))\n",
"* Length-Normalized Joint Token Probability ([Malinin & Gales, 2021](https://arxiv.org/pdf/2002.07650))\n",
"\n",
"#### LLM-as-a-Judge Scorers\n",
"* Categorical LLM-as-a-Judge ([Manakul et al., 2023](https://arxiv.org/abs/2303.08896); [Chen & Mueller, 2023](https://arxiv.org/abs/2308.16175); [Luo et al., 2023](https://arxiv.org/pdf/2303.15621))\n",
"* Continuous LLM-as-a-Judge ([Xiong et al., 2024](https://arxiv.org/pdf/2306.13063))\n",
"
Set up LLM instance and load example data prompts.
\n", "Tune the ensemble weights on a set of tuning prompts. You will execute a single UQEnsemble.tune()
method that will generate responses, compute confidence scores, and optimize weights using a provided answer key corresponding to the provided questions.
Generate LLM Responses and Confidence Scores with Tuned Ensemble.
\n", "Generate and score LLM responses to the example questions using the tuned UQEnsemble()
object.
Evaluate Hallucination Detection Performance.
\n", "Visualize LLM accuracy at different thresholds of the ensemble score that combines various scorers. Compute precision, recall, and F1-score of hallucination detection.
\n", "\n", " | question | \n", "answer | \n", "
---|---|---|
0 | \n", "Natalia sold clips to 48 of her friends in Apr... | \n", "72 | \n", "
1 | \n", "Weng earns $12 an hour for babysitting. Yester... | \n", "10 | \n", "
2 | \n", "Betty is saving money for a new wallet which c... | \n", "5 | \n", "
3 | \n", "Julie is reading a 120-page book. Yesterday, s... | \n", "42 | \n", "
4 | \n", "James writes a 3-page letter to 2 different fr... | \n", "624 | \n", "
Parameter | \n", "Type & Default | \n", "Description | \n", "
---|---|---|
llm | \n", "BaseChatModeldefault=None | \n",
" A langchain llm `BaseChatModel`. User is responsible for specifying temperature and other relevant parameters to the constructor of the provided `llm` object. | \n", "
scorers | \n", "Listdefault=None | \n",
" Specifies which black-box, white-box, or LLM-as-a-Judge scorers to include in the ensemble. List containing instances of BaseChatModel, LLMJudge, black-box scorer names from ['semantic_negentropy', 'noncontradiction','exact_match', 'bert_score', 'bleurt', 'cosine_sim'], or white-box scorer names from [\"normalized_probability\", \"min_probability\"]. If None, defaults to the off-the-shelf BS Detector ensemble by Chen & Mueller, 2023 which uses components [\"noncontradiction\", \"exact_match\",\"self_reflection\"] with respective weights of [0.56, 0.14, 0.3]. | \n", "
device | \n", "str or torch.devicedefault=\"cpu\" | \n",
" Specifies the device that NLI model use for prediction. Only applies to 'semantic_negentropy', 'noncontradiction' scorers. Pass a torch.device to leverage GPU. | \n", "
use_best | \n", "booldefault=True | \n",
" Specifies whether to swap the original response for the uncertainty-minimized response among all sampled responses based on semantic entropy clusters. Only used if `scorers` includes 'semantic_negentropy' or 'noncontradiction'. | \n", "
system_prompt | \n", "str or Nonedefault=\"You are a helpful assistant.\" | \n",
" Optional argument for user to provide custom system prompt for the LLM. | \n", "
max_calls_per_min | \n", "intdefault=None | \n",
" Specifies how many API calls to make per minute to avoid rate limit errors. By default, no limit is specified. | \n", "
use_n_param | \n", "booldefault=False | \n",
" Specifies whether to use n parameter for BaseChatModel . Not compatible with all BaseChatModel classes. If used, it speeds up the generation process substantially when num_responses is large. | \n",
"
postprocessor | \n", "callabledefault=None | \n",
" A user-defined function that takes a string input and returns a string. Used for postprocessing outputs. | \n", "
sampling_temperature | \n", "floatdefault=1 | \n",
" The 'temperature' parameter for LLM model to generate sampled LLM responses. Must be greater than 0. | \n", "
weights | \n", "list of floatsdefault=None | \n",
" Specifies weight for each component in ensemble. If None, and scorers is not None, and defaults to equal weights for each scorer. These weights get updated with tune method is executed. | \n",
"
nli_model_name | \n", "strdefault=\"microsoft/deberta-large-mnli\" | \n",
" Specifies which NLI model to use. Must be acceptable input to AutoTokenizer.from_pretrained() and AutoModelForSequenceClassification.from_pretrained() . | \n",
"
๐ง LLM-Specific
\n", "llm
system_prompt
sampling_temperature
๐ Confidence Scores
\n", "scorers
weights
use_best
nli_model_name
postprocessor
๐ฅ๏ธ Hardware
\n", "device
โก Performance
\n", "max_calls_per_min
use_n_param
Method | \n", "Description & Parameters | \n", "
---|---|
UQEnsemble.tune | \n", "\n",
" Generate responses from provided prompts, grade responses with provided grader function, and tune ensemble weights. If weights and threshold objectives match, joint optimization will happen. Otherwise, sequential optimization will happen. If an optimization problem has fewer than three choice variables, grid search will happen. \n", "Parameters: \n", "
Returns: \n",
" ๐ก Best For: Tuning an optimized ensemble for detecting hallucinations in a specific use case.\n",
" \n",
" | \n",
"
\n", " | prompt | \n", "response | \n", "sampled_responses | \n", "ensemble_score | \n", "exact_match | \n", "noncontradiction | \n", "normalized_probability | \n", "judge_1 | \n", "judge_2 | \n", "
---|---|---|---|---|---|---|---|---|---|
0 | \n", "When you solve this math problem only return t... | \n", "72 | \n", "[72, 72, 72, 72, 72] | \n", "0.952566 | \n", "1.0 | \n", "1.000000 | \n", "0.999188 | \n", "1.0 | \n", "0.5 | \n", "
1 | \n", "When you solve this math problem only return t... | \n", "$10 | \n", "[$10, $10, $10, $10, $10] | \n", "0.895037 | \n", "1.0 | \n", "1.000000 | \n", "0.999019 | \n", "0.0 | \n", "0.5 | \n", "
2 | \n", "When you solve this math problem only return t... | \n", "$20 | \n", "[$20, $20, $20, $20, $10] | \n", "0.762877 | \n", "0.8 | \n", "0.801301 | \n", "0.946584 | \n", "1.0 | \n", "0.0 | \n", "
3 | \n", "When you solve this math problem only return t... | \n", "48 | \n", "[48, 48, 48, 48, 48] | \n", "0.941798 | \n", "1.0 | \n", "1.000000 | \n", "0.996091 | \n", "0.0 | \n", "1.0 | \n", "
4 | \n", "When you solve this math problem only return t... | \n", "624 | \n", "[624, 624, 624, 624, 624] | \n", "0.999969 | \n", "1.0 | \n", "1.000000 | \n", "0.999828 | \n", "1.0 | \n", "1.0 | \n", "
Method | \n", "Description & Parameters | \n", "
---|---|
UQEnsemble.generate_and_score | \n", "\n",
" Generate LLM responses, sampled LLM (candidate) responses, and compute confidence scores for the provided prompts. \n", "Parameters: \n", "
Returns: \n",
" ๐ก Best For: Complete end-to-end uncertainty quantification when starting with prompts.\n",
" \n",
" | \n",
"
UQEnsemble.score | \n", "\n",
" Compute confidence scores on provided LLM responses. Should only be used if responses and sampled responses are already generated. \n", "Parameters: \n", "
Returns: \n",
" ๐ก Best For: Computing uncertainty scores when responses are already generated elsewhere.\n",
" \n",
" | \n",
"
\n", " | prompt | \n", "response | \n", "sampled_responses | \n", "ensemble_score | \n", "exact_match | \n", "noncontradiction | \n", "normalized_probability | \n", "judge_1 | \n", "judge_2 | \n", "response_correct | \n", "
---|---|---|---|---|---|---|---|---|---|---|
0 | \n", "When you solve this math problem only return t... | \n", "160 | \n", "[68, 176, 152, 80, 72] | \n", "0.030060 | \n", "0.0 | \n", "0.021150 | \n", "0.106042 | \n", "0.0 | \n", "0.0 | \n", "True | \n", "
1 | \n", "When you solve this math problem only return t... | \n", "12 | \n", "[12, 14, 18, 11, 13] | \n", "0.158690 | \n", "0.2 | \n", "0.240650 | \n", "0.021982 | \n", "0.0 | \n", "0.0 | \n", "False | \n", "
2 | \n", "When you solve this math problem only return t... | \n", "$36 | \n", "[$36, $36, $36, $36, 36] | \n", "0.870801 | \n", "0.8 | \n", "0.994231 | \n", "0.989287 | \n", "1.0 | \n", "0.0 | \n", "True | \n", "
3 | \n", "When you solve this math problem only return t... | \n", "9 | \n", "[$3, $9, 3, $10, 9] | \n", "0.359167 | \n", "0.2 | \n", "0.452459 | \n", "0.205047 | \n", "1.0 | \n", "0.0 | \n", "False | \n", "
4 | \n", "When you solve this math problem only return t... | \n", "75% | \n", "[75%, 75., 75%, 75%, 75%] | \n", "0.873314 | \n", "0.8 | \n", "0.998718 | \n", "0.990297 | \n", "1.0 | \n", "0.0 | \n", "True | \n", "