🎯 Confidence Score Calibration Demo#

Confidence scores from uncertainty quantification methods may not be well-calibrated probabilities. This demo demonstrates how to transform raw confidence scores into calibrated probabilities that better reflect the true likelihood of correctness using the ScoreCalibrator class.

📊 What You’ll Do in This Demo#

Set up LLM and prompts.

Set up LLM instance and load example data prompts.

Generate LLM Responses and Confidence Scores

Generate and score LLM responses to the example questions using the WhiteBoxUQ() class.

Fit Calibrators and Evaluate on Holdout Set

Train confidence score calibrators and evaluate on holdout set of prompts.

⚖️ Calibration Methods#

Platt Scaling

Method: Logistic regression
Parametric: Assumes sigmoid-shaped calibration function
Best for: Small datasets, well-behaved score distributions

Isotonic Regression

Method: Non-parametric, monotonic
Flexible: Can handle any monotonic calibration curve
Best for: Larger datasets, complex score distributions

[1]:

from uqlm import WhiteBoxUQ
from uqlm.calibration import ScoreCalibrator, evaluate_calibration
from uqlm.utils import load_example_dataset, LLMGrader

1. Set up LLM and Prompts#

For this demo, we’ll sample 1500 prompts from the NQ-Open benchmark. The first 1000 prompts will be used to train the calibrators and remaining 500 prompts will be used as a test dataset.

[2]:

n_train, n_test = 1000, 500
n_prompts = n_train + n_test

# Load example dataset for prompts/answers (optional, for context)
nq_open = load_example_dataset("nq_open", n=n_prompts)

# Define prompts
QA_INSTRUCTION = "You will be given a question. Return only the answer as concisely as possible without providing an explanation.\n"
prompts = [QA_INSTRUCTION + prompt for prompt in nq_open.question]
train_prompts = prompts[:n_train]
test_prompts = prompts[-n_test:]

Loading dataset - nq_open...
Processing dataset...
Dataset ready!

In this example, we use ChatVertexAI to instantiate our LLM, but any LangChain Chat Model may be used. Be sure to replace with your LLM of choice.

[3]:

# import sys
# !{sys.executable} -m pip install langchain-google-vertexai
from langchain_google_vertexai import ChatVertexAI

llm = ChatVertexAI(model="gemini-2.5-flash")

## 2. Compute Confidence Scores We generate model responses and associated confidence scores by leveraging the WhiteBoxUQ class. This class generates responses to prompts, while also estimating a confidence score for each response using token probabilities.

[4]:

wbuq = WhiteBoxUQ(llm=llm, scorers=["normalized_probability"])
uq_result = await wbuq.generate_and_score(prompts=train_prompts)

To obtain the labels for calibration, we ‘grade’ the responses against an answer key. Here, we use UQLM’s out-of-the-box LLM Grader, which can be used with LangChain Chat Model, but you may replace this with a grading method of your choice. Some notable alternatives are Vectara HHEM and AlignScore. If you are using your own prompts/questions, be sure to update the grading method accordingly.

[6]:

# set up the LLM grader to grade LLM responses against the ground truth answer key (we need these grades for calibration)
gemini_flash_lite = ChatVertexAI(model="gemini-2.5-flash-lite")
grader = LLMGrader(llm=gemini_flash_lite)

# Convert to dataframe and grade responses against correct answers
result_df = uq_result.to_df()
result_df["response_correct"] = await grader.grade_responses(prompts=nq_open["question"].to_list()[:n_train], responses=result_df["response"].to_list(), answers=nq_open["answer"].to_list()[:n_train])
result_df.head()

[6]:

	prompt	response	logprob	normalized_probability	response_correct
0	You will be given a question. Return only the ...	December 14, 1972	[{'token': 'December', 'logprob': -0.044779419...	0.980881	True
1	You will be given a question. Return only the ...	Bobby Scott and Bob Russell	[{'token': 'Bobby', 'logprob': -0.065702043473...	0.979054	True
2	You will be given a question. Return only the ...	1	[{'token': '1', 'logprob': -0.0108721693977713...	0.989187	True
3	You will be given a question. Return only the ...	Super Bowl LII	[{'token': 'Super', 'logprob': -1.586603045463...	0.672557	False
4	You will be given a question. Return only the ...	South Carolina	[{'token': 'South', 'logprob': -1.502075701864...	0.999992	True

## 3. Score Calibration Confidence scores from uncertainty quantification methods may not be well-calibrated probabilities. You can transform raw confidence scores into calibrated probabilities that better reflect the true likelihood of correctness using the calibrate_scores method.

The first step is to train the claibrators that can done using fit or fit_transform method of ScoreCalibrator class. You can initiate a class object by choosing a method for training calibrators. Then call fit_transform method be providing UQResult object from training dataset and correct responses.

[7]:

sc = ScoreCalibrator(method="isotonic")
sc.fit_transform(uq_result=uq_result, correct_indicators=result_df.response_correct)

results_df = uq_result.to_df()
results_df.head()

[7]:

	prompt	response	logprob	normalized_probability	calibrated_normalized_probability
0	You will be given a question. Return only the ...	December 14, 1972	[{'token': 'December', 'logprob': -0.044779419...	0.980881	0.628571
1	You will be given a question. Return only the ...	Bobby Scott and Bob Russell	[{'token': 'Bobby', 'logprob': -0.065702043473...	0.979054	0.628571
2	You will be given a question. Return only the ...	1	[{'token': '1', 'logprob': -0.0108721693977713...	0.989187	0.633929
3	You will be given a question. Return only the ...	Super Bowl LII	[{'token': 'Super', 'logprob': -1.586603045463...	0.672557	0.490196
4	You will be given a question. Return only the ...	South Carolina	[{'token': 'South', 'logprob': -1.502075701864...	0.999992	0.888889

You can evaluate the performance of calibrated scores using evaluate_calibration method, which will require the correct responses.

[28]:

# Uncomment the following lines to visualize the calibrated scores

# metrics = sc_object.evaluate_calibration(results, result_df.response_correct)
# metrics

Lets generate responses and compute score on test dataset using wbuq object, which will return a UQResult object on test dataset which will contain test prompts, responses, and confidence scores.

[29]:

test_result = await wbuq.generate_and_score(prompts=test_prompts)

We now have trained a ScoreCalibrator object containing fitted calibrators for each scorer (only normalized_probability for our example, but can be used with multiple scorers). Now, we can call transform method and provide a test dataset (UQResult object form test prompts), which will update the UQResult object to include calibrated scores.

Note: transform method updates UQResult object in place, such that for every ‘score’, it will also contain ‘calibrated_score’.

[30]:

# Calibrate scores
sc.transform(test_result)

test_result_df = test_result.to_df()
test_result_df.head()

[30]:

	prompt	response	logprob	normalized_probability	calibrated_normalized_probability
0	You will be given a question. Return only the ...	Games	[{'token': 'Games', 'logprob': -0.001829901477...	0.998172	0.578947
1	You will be given a question. Return only the ...	Amir Johnson	[{'token': 'Am', 'logprob': -2.050269904430024...	0.999993	0.741935
2	You will be given a question. Return only the ...	Frank Morris	[{'token': 'Frank', 'logprob': -8.653872646391...	0.999939	0.606061
3	You will be given a question. Return only the ...	May 7, 1992	[{'token': 'May', 'logprob': -0.01998697593808...	0.996715	0.538462
4	You will be given a question. Return only the ...	Daisuke Ohata	[{'token': 'D', 'logprob': -8.153352973749861e...	0.999962	0.647059

Lets evaluate calibrated score from test dataset (since we also have correct response on test dataset)

[31]:

# Grade responses against correct answers for test set
test_result_df["response_correct"] = await grader.grade_responses(prompts=nq_open["question"].to_list()[-n_test:], responses=test_result_df["response"].to_list(), answers=nq_open["answer"].to_list()[-n_test:])

test_metrics = evaluate_calibration(test_result, test_result_df.response_correct)
test_metrics

../../_images/_notebooks_examples_score_calibration_demo_20_0.png

../../_images/_notebooks_examples_score_calibration_demo_20_1.png

[31]:

	average_confidence	average_accuracy	calibration_gap	brier_score	log_loss	ece	mce
normalized_probability	0.904642	0.48	0.424642	0.421297	2.586512	0.428037	0.511129
calibrated_normalized_probability	0.492642	0.48	0.012642	0.233129	0.793354	0.030675	0.500000

Note the substantial improvement in calibration quality before and after transforming confidence scores with the fitted ScoreCalibrator object.

4. Summary#

This calibration analysis demonstrates:

🎯 Key Findings#

Calibration Quality: Use reliability diagrams and metrics like ECE and MCE score to assess how well confidence scores reflect true probabilities
Method Selection:
- Platt Scaling works well for smaller datasets and when the calibration curve is roughly sigmoid-shaped
- Isotonic Regression is more flexible and can handle complex, non-parametric calibration curves
Practical Impact: Calibration can significantly improve:
- Reliability of confidence scores for decision-making
- User trust in model predictions