{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# 🎯 Confidence Score Calibration Demo\n", "\n", "
\n",
" Confidence scores from uncertainty quantification methods may not be well-calibrated probabilities. This demo demonstrates how to transform raw confidence scores into calibrated probabilities that better reflect the true likelihood of correctness using the ScoreCalibrator class.\n",
"
Set up LLM instance and load example data prompts.
\n", "Generate LLM Responses and Confidence Scores
\n", "Generate and score LLM responses to the example questions using the WhiteBoxUQ() class.
Fit Calibrators and Evaluate on Holdout Set
\n", "Train confidence score calibrators and evaluate on holdout set of prompts.
\n", "| \n", " | prompt | \n", "response | \n", "logprob | \n", "normalized_probability | \n", "response_correct | \n", "
|---|---|---|---|---|---|
| 0 | \n", "You will be given a question. Return only the ... | \n", "December 14, 1972 | \n", "[{'token': 'December', 'logprob': -0.044779419... | \n", "0.980881 | \n", "True | \n", "
| 1 | \n", "You will be given a question. Return only the ... | \n", "Bobby Scott and Bob Russell | \n", "[{'token': 'Bobby', 'logprob': -0.065702043473... | \n", "0.979054 | \n", "True | \n", "
| 2 | \n", "You will be given a question. Return only the ... | \n", "1 | \n", "[{'token': '1', 'logprob': -0.0108721693977713... | \n", "0.989187 | \n", "True | \n", "
| 3 | \n", "You will be given a question. Return only the ... | \n", "Super Bowl LII | \n", "[{'token': 'Super', 'logprob': -1.586603045463... | \n", "0.672557 | \n", "False | \n", "
| 4 | \n", "You will be given a question. Return only the ... | \n", "South Carolina | \n", "[{'token': 'South', 'logprob': -1.502075701864... | \n", "0.999992 | \n", "True | \n", "
| \n", " | prompt | \n", "response | \n", "logprob | \n", "normalized_probability | \n", "calibrated_normalized_probability | \n", "
|---|---|---|---|---|---|
| 0 | \n", "You will be given a question. Return only the ... | \n", "December 14, 1972 | \n", "[{'token': 'December', 'logprob': -0.044779419... | \n", "0.980881 | \n", "0.628571 | \n", "
| 1 | \n", "You will be given a question. Return only the ... | \n", "Bobby Scott and Bob Russell | \n", "[{'token': 'Bobby', 'logprob': -0.065702043473... | \n", "0.979054 | \n", "0.628571 | \n", "
| 2 | \n", "You will be given a question. Return only the ... | \n", "1 | \n", "[{'token': '1', 'logprob': -0.0108721693977713... | \n", "0.989187 | \n", "0.633929 | \n", "
| 3 | \n", "You will be given a question. Return only the ... | \n", "Super Bowl LII | \n", "[{'token': 'Super', 'logprob': -1.586603045463... | \n", "0.672557 | \n", "0.490196 | \n", "
| 4 | \n", "You will be given a question. Return only the ... | \n", "South Carolina | \n", "[{'token': 'South', 'logprob': -1.502075701864... | \n", "0.999992 | \n", "0.888889 | \n", "
| \n", " | prompt | \n", "response | \n", "logprob | \n", "normalized_probability | \n", "calibrated_normalized_probability | \n", "
|---|---|---|---|---|---|
| 0 | \n", "You will be given a question. Return only the ... | \n", "Games | \n", "[{'token': 'Games', 'logprob': -0.001829901477... | \n", "0.998172 | \n", "0.578947 | \n", "
| 1 | \n", "You will be given a question. Return only the ... | \n", "Amir Johnson | \n", "[{'token': 'Am', 'logprob': -2.050269904430024... | \n", "0.999993 | \n", "0.741935 | \n", "
| 2 | \n", "You will be given a question. Return only the ... | \n", "Frank Morris | \n", "[{'token': 'Frank', 'logprob': -8.653872646391... | \n", "0.999939 | \n", "0.606061 | \n", "
| 3 | \n", "You will be given a question. Return only the ... | \n", "May 7, 1992 | \n", "[{'token': 'May', 'logprob': -0.01998697593808... | \n", "0.996715 | \n", "0.538462 | \n", "
| 4 | \n", "You will be given a question. Return only the ... | \n", "Daisuke Ohata | \n", "[{'token': 'D', 'logprob': -8.153352973749861e... | \n", "0.999962 | \n", "0.647059 | \n", "
| \n", " | average_confidence | \n", "average_accuracy | \n", "calibration_gap | \n", "brier_score | \n", "log_loss | \n", "ece | \n", "mce | \n", "
|---|---|---|---|---|---|---|---|
| normalized_probability | \n", "0.904642 | \n", "0.48 | \n", "0.424642 | \n", "0.421297 | \n", "2.586512 | \n", "0.428037 | \n", "0.511129 | \n", "
| calibrated_normalized_probability | \n", "0.492642 | \n", "0.48 | \n", "0.012642 | \n", "0.233129 | \n", "0.793354 | \n", "0.030675 | \n", "0.500000 | \n", "