Release Notes#

v0.6.2#

Released on 2026-06-28 - GitHub - PyPI

What's Changed

Patch release: v0.6.2 by @dylanbouchard in #412

Full Changelog: v0.6.1...v0.6.2

v0.6.1#

Released on 2026-06-08 - GitHub - PyPI

What's Changed

v0.6.0 updates by @dylanbouchard in #408
Add FactScore-STEM-Geo dataset; Include CodeGenUQ in docs by @dylanbouchard in #409
Patch release: v0.6.1 by @dylanbouchard in #410

Full Changelog: v0.6.0...v0.6.1

v0.6.0#

Released on 2026-05-28 - GitHub - PyPI

Highlights

Add `CodeGenUQ` for Code Generation Uncertainty Quantification

As an extension of the work in Functional Entropy: Predicting Functional Correctness in LLM-Generated Code with Uncertainty Quantification (Bouchard et. al, 2026), this minor release introduces CodeGenUQ, a unified interface for computing confidence scores that predict functional correctness in LLM-generated code without requiring execution or test cases at inference time. Under the hood, CodeGenUQ relies on new methods defined in uqlm.code subpackage. This release also creates a new demo notebook to illustrate this functionality.

Overview

CodeGenUQ extends ShortFormUQ to support code-specific uncertainty quantification across three scorer families:

Token-Probability Scorers (white-box)

Length-normalized sequence probability (LNSP)
Minimum token probability (MTP)
Probability margin (PM)
Average/minimum token negentropy (ATN@K, MTN@K)

Sampling-Based Consistency Scorers (black-box)

Functional equivalence methods: Replace NLI-based semantic comparison with LLM-based assessment of whether code snippets produce identical outputs for all valid inputs
- functional_equivalence_rate: Proportion of samples functionally equivalent to the original
- functional_negentropy: Entropy over functional equivalence clusters (code analogue of semantic entropy)
- functional_sets_confidence: Normalized count of unique functional clusters
Similarity methods:
- cosine_sim: Code embedding similarity (default: jina-embeddings-v2-base-code)
- codebleu: CodeBLEU consistency incorporating n-gram, syntax, and data-flow matching

Reflexive Scorers

p_true: Token probability assigned to "True" for self-evaluation
verbalized_confidence: Likert-scale confidence elicitation

Usage

from uqlm import CodeGenUQ

uq = CodeGenUQ(
    llm=my_llm,
    scorers=["functional_equivalence_rate", "cosine_sim", "sequence_probability"],
    language="python"
)
results = await uq.generate_and_score(prompts, num_responses=5)

What's Changed

CodeGenUQ : Code Generation tasks by @mohitcek in #358
Unit tests for CodeGenUQ by @zeya30 in #369
Patch/v0.5.8 by @dylanbouchard in #373
Progress bars for CodeGenUQ by @zeya30 in #376
Refactor: remove unused/redundant code by @zeya30 in #377
Add LiveCodeBench dataset loading and code execution evaluation by @zeya30 in #380
Code gen progress bar by @dylanbouchard in #379
Cleanup code gen by @dylanbouchard in #382
Add OSS quality improvements by @dylanbouchard in #381
Develop by @dylanbouchard in #383
v0.5.9 updates by @dylanbouchard in #387
Merge develop -> code-gen-uq by @dylanbouchard in #388
UQ for Code Generation by @dylanbouchard in #389
Update docstring by @zeya30 in #391
v0.5.10 updates by @dylanbouchard in #393
Refactor evaluation code to accept list inputs and return dict results by @zeya30 in #394
v0.5.11 updates by @dylanbouchard in #396
Remove print and fix edge case handling by @dylanbouchard in #397
Remove platform-wide test skips by @mohitcek in #399
Minor updates and improve code coverage for CodeGenUQ class by @mohitcek in #403
Structured outputs by @aaronlohner in #384
Add Discord badge and community link by @virenbajaj in #404
add code gen links and bibtex by @dylanbouchard in #405
Minor release: v0.6.0 by @dylanbouchard in #406

Full Changelog: v0.5.11...v0.6.0

v0.5.11#

Released on 2026-05-01 - GitHub - PyPI

Highlights

Package upgrades to resolve dependabot security alerts

What's Changed

Patch release: v0.5.10 by @dylanbouchard in #392
Patch/v0.5.11 by @dylanbouchard in #395

Full Changelog: v0.5.10...v0.5.11

v0.5.10#

Released on 2026-04-15 - GitHub - PyPI

What's Changed

Fix edge case handling of top-logprobs for applicable white-box scorers
Fix logprob extraction for Llama models

Full Changelog: v0.5.9...v0.5.10

v0.5.9#

Released on 2026-04-10 - GitHub - PyPI

What's Changed

v0.5.4 updates by @dylanbouchard in #336
Readme - New images added by @doyajii1 in #340
Enable automated PyPI publishing via GitHub Actions by @vgyani in #374
Add configurable claim decomposition prompt templates for long-text UQ by @kaushik-42 in #342
update landing page by @dylanbouchard in #385
Patch release: v0.5.9 by @dylanbouchard in #386

Full Changelog: v0.5.8...v0.5.9

v0.5.8#

Released on 2026-03-30 - GitHub - PyPI

Highlights

fix noncontradiction calculation for longform scorers
fix semantic sets confidence bounds
dependabot security alerts

What's Changed

Ruff formatting by @mohitcek in #361
Bug fixes by @dylanbouchard in #371
Patch release: v0.5.8 by @dylanbouchard in #372

Full Changelog: v0.5.7...v0.5.8

v0.5.7#

Released on 2026-03-13 - GitHub - PyPI

What's Changed

Fixed doc site inaccuracies by @vgyani in #367
Patch release: v0.5.7 by @dylanbouchard in #368

Full Changelog: v0.5.6...v0.5.7

v0.5.6#

Released on 2026-03-02 - GitHub - PyPI

Highlights

package upgrades from dependabot
update badge colors and links
update citation information

What's Changed

Badge by @dylanbouchard in #351
Revise publication links and citation details by @dylanbouchard in #352
Badge by @dylanbouchard in #353
Bump nbsphinx from 0.9.6 to 0.9.8 by @dependabot[bot] in #356
Bump pytest-asyncio from 1.1.1 to 1.3.0 by @dependabot[bot] in #250
Bump sphinx-gallery from 0.18.0 to 0.20.0 by @dependabot[bot] in #354
Update rich requirement from <14.0.0,>=13.8.0 to >=13.8.0,<15.0.0 by @dependabot[bot] in #355
Patch release: v0.5.6 by @dylanbouchard in #357

Full Changelog: v0.5.5...v0.5.6

v0.5.5#

Released on 2026-02-25 - GitHub - PyPI

Highlights

replace agenerate with ainvoke where generation is used, since ainvoke appears to be the better-maintained and better-documented LangChain method
replace poetry with uv for dependency management
update badges

What's Changed

Badge updates by @dylanbouchard in #348
Switch from Poetry to astral-sh/uv by @vgyani in #349
Patch release: v0.5.5 by @dylanbouchard in #350

Full Changelog: v0.5.4...v0.5.5

v0.5.4#

Released on 2026-01-30 - GitHub - PyPI

Highlights

1. Add new white-box scorers to `UQEnsemble` accepted scorers list:

Top-logprobs scorers (3):

min_token_negentropy - Minimum negentropy across tokens
mean_token_negentropy - Average negentropy across tokens
probability_margin - Mean difference between top-2 token probabilities

Sampled-logprobs scorers (4):

semantic_negentropy - Entropy based on semantic clustering
semantic_density - Density-based confidence measure
monte_carlo_probability - Average sequence probability across samples
consistency_and_confidence - Cosine similarity × response probability

P(True) scorer (1):

p_true - LLM's estimate of P(response is true)

2. Fix embeddings model specification for `cosine_sim` and `consistency_and_confidence`, enable with `WhiteBoxUQ`

Corrects a string error in embedding model specification with sentence_transformer parameter of BlackBoxUQ. Previously, the string was forced to begin with "sentence_transformers" but now the full string is specified with the parameter.

Previous: sentence_transformer=all-MiniLM-L12-v2 was specified and then "sentence-transformers/" was prepended to the string when storing the class attribute.

Now: sentence_transformer=sentence-transformers/all-MiniLM-L12-v2 is specified. This allows other embeddings models that don't start with "sentence_transformers/", such as jinaai/jina-embeddings-v2-base-code to be specified.

Also adds missing sentence_transformer parameter for WhiteBoxUQ

What's Changed

v0.5.3 updates by @dylanbouchard in #327
Fix embedding model specification by @dylanbouchard in #332
Enable use of new white-box scorers in UQEnsemble by @dylanbouchard in #333
Feature/enable all white box scorers by @kaushik-42 in #328
Patch release: v0.5.4 by @dylanbouchard in #334

Full Changelog: v0.5.3...v0.5.4

v0.5.3#

Released on 2026-01-20 - GitHub - PyPI

Highlights

added now demo notebook to illustrate langgraph-uqlm integration
upgrade package versions per dependabot
fix some LaTeX in docs site
fix links in readme

What's Changed

v0.5.2 updates by @dylanbouchard in #322
fix latex in docs site by @dylanbouchard in #324
Added LangGraph demo notebook by @vnnair98 in #323
Security updates by @dylanbouchard in #325
Patch release: v0.5.3 by @dylanbouchard in #326

New Contributors

@vnnair98 made their first contribution in #323

Full Changelog: v0.5.2...v0.5.3

v0.5.2#

Released on 2026-01-14 - GitHub - PyPI

Highlights

Create uqlm.nli.EntailmentClassifier class for LLM-based entailment classification. This is well-suited for long-text scoring when responses exceed the length that can be handled by the Hugging Face NLI model
Update LongTextGraph, LongTexUQ, UnitResponseScorer, GraphScorer and associated notebooks to allow for LLM-based entailment classification.
Update unit tests
Misc. docs site cleanup

What's Changed

Add LLM-based entailment classification + Docs cleanup by @dylanbouchard in #320
Patch release: v0.5.2 by @dylanbouchard in #321

Full Changelog: v0.5.1...v0.5.2

v0.5.1#

Released on 2026-01-09 - GitHub - PyPI

Highlights

fixes rendering of long-form scorer content on the docs site
adds missing uqlm/longform subpackage to pyproject.toml so it appears in API reference on docs site
misc. docs site cleanup

What's Changed

v0.5.0 updates by @dylanbouchard in #316
Add longform subpackage and fix docs links by @dylanbouchard in #317
fix code block in get started by @dylanbouchard in #318
Patch release: v0.5.1 by @dylanbouchard in #319

Full Changelog: v0.5.0...v0.5.1

v0.5.0#

Released on 2026-01-08 - GitHub - PyPI

New Methods: Long-Form UQ

Short-form UQ methods have been shown to generalize poorly to long-form LLM outputs. Fine-grained methods for long-form UQ address these limitations by first decomposing responses into granular units (sentences or claims) and then scoring each unit.

Response Decomposition

We enable decomposition of responses into sentences or claims using our ResponseDecomposer class. This class implements claim decomposition using an LLM or sentence decomposition using a rule-based approach.

Scoring methods

We add three families of fine-grained scorers for long-form uncertainty quantification: Unit-Response, Matched-Unit, and Unit-QA

1. Unit-Response (Based on the LUQ/LUQ-Atomic methods)

These scorers measure whether sampled responses entail each unit (sentence or claim) in the original response and average across sampled responses to obtain unit-level confidence scores. This is implemented with the uqlm.scorers.longform.LongTextUQ class.

2. Matched-Unit (Based on the LUQ-pair method)

These scorers work by matching each original sentence or claim to its most similar counterpart in sampled responses before computing entailment scores. Matched scores are then averaged across sampled responses to obtain a confidence score for each unit in the original response. This is implemented with the uqlm.scorers.longform.LongTextUQ class.

3. Unit-QA (Based on the Longform Semantic Entropy method)

These scorers work by decomposing a response into granular units (sentences or claims), generating questions whose answers are the claims given context, sampling multiple answers, and computes black-box UQ scores across these answers. his is implemented with the uqlm.scorers.longform.LongTextQA class.

4. Graph-Based (Based on the Jiang et al., 2024)

Graph-based scorers decompose original and sampled responses into claims, obtain the union of unique claims across all responses, and compute graph centrality metrics on the bipartite graph of claim-response entailment to measure uncertainty. This is implemented with the uqlm.scorers.longform.LongTextGraph class.

These scorer classes all share the same parent class: uqlm.scorers.longform.baseclass.LongFormUQ.

Response Refinement with Uncertainty Aware Decoding

Response refinement works by dropping claims with confidence scores (specified with claim_filtering_scorer parameter) below a specified threshold (specified with response_refinement_threshold parameter) and reconstructing the response from the retained claims. This functionality is available in combination with any of the four methods described above by setting response_refinement=True in the constructor of the corresponding scorer class.

Performance Evaluation

We enable FactScore-based grading using an LLM. This works by comparing units (sentences or claims) in a generated response to a FactScore question against the corresponding text of the subject's wikipedia article.

New docs site pages

We have added a "Scorer Definitions" tab to the docs site, intended to serve as an 'encyclopedia' of available scoring methods. It provides formal definitions, explanations in simple terms, and code snippets for all available methods.

Other changes

uqlm.scorers has now been refactored with two subfolders: uqlm.scorers.shortform (which contains existing scorer classes as of v0.4) and uqlm.scorers.longform which contains classes to implement the above mentioned scoring methods
the readme has been updated to reflect new longform scorers, and a new readme has been added inside the examples/ directory to provide more details on the available tutorials
various package upgrades to address security vulnerabilities identified by dependabot

Breaking changes

normalized_probability has been deprecated from acceptable white-box scorer list in WhiteBoxUQ and UQEnsemble in favor of sequence_probability with length_normalize=True (default). This also affects the key/column names in the returned UQResult object.

What's Changed

v0.3 updates by @dylanbouchard in #197
LLM based NLI + ResponseDecomposer upgrades + restructured prompts by @dskarbrevik in #199
Minor refactor by @dylanbouchard in #201
add aggregation method by @dylanbouchard in #202
Add mode, granularity parameters in place of scorers by @dylanbouchard in #204
Long-form Semantic Entropy by @mohitcek in #203
add factscore grader by @dylanbouchard in #207
Enable more granular score return by @dylanbouchard in #208
Binary style for NLI class by @dskarbrevik in #206
update grader by @dylanbouchard in #215
Longform Feature: evaluate method to compute semantic entropy by @mohitcek in #217
Refactor ClaimQA class by @mohitcek in #218
Patch/v0.3.1 by @dylanbouchard in #225
v0.3.1 updates by @dylanbouchard in #224
update question template by @dylanbouchard in #227
Feat: ClaimQA class - multiple questions per factoid/claim by @mohitcek in #228
Claimqa updates by @dylanbouchard in #235
v0.4.4 updates by @dylanbouchard in #279
Merge develop -> longform UQ branch by @dylanbouchard in #282
v0.4.5 updates by @dylanbouchard in #286
LongForm UQ by @dylanbouchard in #283
Created new directories for short-form and long-form responses by @mohitcek in #288
Refactor uqlm.scorers for shorform vs. longform parent classes by @dylanbouchard in #289
Issue #244 - Added Scorer Definitions on Docs Site by @vgyani in #287
Add long-text definition to docs by @dylanbouchard in #298
Rearrange subpackages by @dylanbouchard in #300
Rename modules, add UAD scorer specification by @dylanbouchard in #304
Update notebooks by @dylanbouchard in #308
Graph based long-form scoring by @dskarbrevik in #307
Fix links and test by @dylanbouchard in #309
Add new unit tests by @dylanbouchard in #310
update uad graphics by @dylanbouchard in #311
update luq graphic and version by @dylanbouchard in #313
add qa unit test by @dylanbouchard in #314
Minor release: v0.5.0 by @dylanbouchard in #315

Full Changelog: v0.4.5...v0.5.0

v0.4.5#

Released on 2025-12-08 - GitHub - PyPI

Highlights

fix bug in model name string checking when retrieving logprobs, per issue #284

What's Changed

patch release: v0.4.5 by @dylanbouchard in #285

Full Changelog: v0.4.4...v0.4.5

v0.4.4#

Released on 2025-12-04 - GitHub - PyPI

Highlights

max_length parameter to WhiteBoxUQ to avoid the CUDA OutOfMemoryError.
updates demo and docstring accordingly

What's Changed

v0.4.3 updates by @dylanbouchard in #276
Patch release: v0.4.4 by @dylanbouchard in #277

Full Changelog: v0.4.3...v0.4.4

v0.4.3#

Released on 2025-12-03 - GitHub - PyPI

Highlights

Automated torch.device selection for all scorers that use NLI or cosine similarity
Updates to docstring and demo notebooks to reflect the above change

What's Changed

Added automatic device detection by @kaushik-42 in #249
Update notebooks for auto device by @dylanbouchard in #271
v0.4.2 updates by @dylanbouchard in #273
Patch release: v0.4.3 by @dylanbouchard in #274

Full Changelog: v0.4.2...v0.4.3

v0.4.2#

Released on 2025-11-21 - GitHub - PyPI

Highlights

Fix broken links to notebooks on docs site
Update PyPI readme

What's Changed

v0.4.1 updates by @dylanbouchard in #267
add missing example notebooks by @dylanbouchard in #268
Patch release: v0.4.2 by @dylanbouchard in #269

Full Changelog: v0.4.1...v0.4.2

v0.4.1#

Released on 2025-11-14 - GitHub - PyPI

Highlights

Relax langchain>=1.0.0 restriction to allow for >=0.3.7
Fx notebook links on docs site

What's Changed

Support langchain < 1.0.0 by @doyajii1 in #265
Patch release: v0.4.1 by @dylanbouchard in #266

Full Changelog: v0.4.0...v0.4.1

v0.4.0#

Released on 2025-11-12 - GitHub - PyPI

Highlights

1. Varied tutorials for more model and dataset coverage

We have updated the example notebooks to have broader coverage over LLMs and example datasets.

LLMs

Gemini models
GPT-4* models
o3-mini
Qwen
Mistral
LLama
Deepseek

Datasets

GSM8K
SVAMP
PopQA
NQ-Open
AI2-ARC
CSQA
SimpleQA
HotpotQA
Image (multimodal demo)

2. New scorers added

This release includes the addition of 11 new scorers spanning various categories (with accompanying unit tests). Details are provided below.

White-Box scorers

We are offering 9 new white-box scorers with this release. These scorers can be implemented with WhiteBoxUQ by specifying the respective scorer names in the scorers list. The length_normalize parameter determines whether response probabilities are length-normalized for the sampling-based white-box scorers.

Single-generation white-box scorers

Likelihood margin Farr et al., 2024
Sequence probability Vashurin et al., 2024
Mean top-k token entropy Scalena et al., 2025
Max top-token entropy Scalena et al., 2025

Sampling-based white-box scorers

Semantic Entropy (logprobs version) Farquhar et al., 2024
Semantic Density Qiu et al., 2024 (can also be implemented with SemanticDensity from uqlm.scorers
Monte carlo predictive entropy Kuhn et al., 2023
CoCoA Vashurin et al., 2025

Reflexive white-box scorers

P(True) Kadavath et al., 2022

Black-Box scorers

We are implementing two new black-box scorers with this release. They can be specified using the scorers parameter in BlackBoxUQ.

Number of Semantic Sets (Lin et al., 2024; Vashurin et al., 2025; Kuhn et al., 2023)
Entailment Probability (Lin et al., 2025; Chen & Mueller, 2023)

Definitions of new scorers are provided with LaTeX at the end of applicable tutorial notebooks. We have also added new tutorial notebooks for Semantic Density and multi-generation white-box scorers. The readme has also been updated to reflect the new scorers.

3. New `LLMGrader` class and updated default grader for `UQEnsemble`

This release includes a new utility class uqlm.utils.grader.LLMGrader which is instantiated from a BaseChatModel and grades LLM responses against an answer key. This class appears:

in the example notebooks for evaluating hallucination detection performance.
as the updated default grader, replacing vectara/hallucination_evaluation_model, as that model is now gated.

4. Option to provide additional context to LLM judges

Users can now pass additional instructions to their LLM judges by using the additional_context parameter in the constructor of LLMPanel.

5. New datasets available with `load_example_dataset`

The utility function load_example_dataset now offers HotpotQA and SimpleQA datasets.

6. `uqlm.nli` sub-package

Created uqlm.nli sub-package that contains the following:

NLI class for NLI scoring only. Semantic entropy and noncontradiction calculations are respectively moved to uqlm.scorers.SemanticEntropy and uqlm.black_box.ConsistencyScorer classes.
SemanticClusterer class for semantic clustering (used for semantic entropy, semantic density, and number of semantic sets)

7. `uqlm.white_box` sub-package

Created uqlm.white_box sub-package that contains three classes for white-box computations from logprobs:

SingleLogprobsScorer for computing scores that depend on only logprobs from one generated response: normalized probability, sequence probability, minimum probability
TopLogprobsScorer for computing scorers that depend on top-K logprobs from generated response: mean top-k token negentropy, min top-k token negentropy, and likelihood margin
SampledLogprobsScorer for computing scores that that depend on logprobs from multiple sampled responses: monte carlo probability, CoCoA, semantic entropy, and semantic density
PTrueScorer for implementing the P(True) method

8. Minor changes & future deprecations

Renamed NLIScorer -> ConsistencyScorer and moved some methods to uqlm.nli.NLI class
normalized_probability scorer name in WhiteBoxUQ will be deprecated in v0.5 in favor of sequence_probability with length_normalize. The default scorers of WhiteBoxUQ will be scorers=["min_probability", "sequence_probability"]. The default value of length_normalize=True will apply to sequence_probability, so that it returns what normalized_probability currently returns.
system_prompt and template_ques_ans are deprecated in favor of additional_context parameter
default grader in UQEnsemble.tune now uses LLMGrader with the user-provided LLM used for generation

What's Changed

Add Semantic Density scorer by @dross20 in #209
Adding HotPotQA and SimpleQA by @dskarbrevik in #210
Semantic density, docs by @dylanbouchard in #212
Semantic density notebook by @dross20 in #213
Semantic density by @dylanbouchard in #214
v0.3.1 updates by @dylanbouchard in #220
add judge customization option by @dylanbouchard in #221
v0.3.1 updates by @dylanbouchard in #223
New White Box Scorers by @dylanbouchard in #219
Diversify demos by @dylanbouchard in #232
Update notebooks by @dylanbouchard in #233
update demo notebooks by @zeya30 in #229
Update demo notebooks by @zeya30 in #234
Llm grader by @dylanbouchard in #238
update demo notebooks by @dylanbouchard in #239
Refactor: NLI Subpackage II by @mohitcek in #237
Feature: Integrate SemanticEntropy and SemanticDensity methods with WhiteBoxUQ class by @mohitcek in #240
Drop python 3.9 support by @doyajii1 in #242
Jmlr revisions by @dylanbouchard in #243
Bump sphinx-autodoc-typehints from 2.2.0 to 2.3.0 by @dependabot[bot] in #230
Polish notebooks and readme by @dylanbouchard in #246
Bump pytest-cov from 6.3.0 to 7.0.0 by @dependabot[bot] in #176
Number of semantic sets scorer by @dylanbouchard in #247
Improve unit tests code coverage by @zeya30 in #241
Minor refactor + Improved test coverage by @dylanbouchard in #255
Minor refactor + updated demos by @dylanbouchard in #256
Update scorer definitions + fix logprobs bug in SemanticEntropy by @dylanbouchard in #257
Reuse NLI Scores by @mohitcek in #260
Improve unit tests code coverage by @zeya30 in #258
Allow for torch.device in WhitBoxUQ by @dylanbouchard in #261
Fix logprob bug by @dylanbouchard in #262
Update docs site by @dylanbouchard in #259
Release PR: v0.4.0 by @dylanbouchard in #263

New Contributors

@dross20 made their first contribution in #209

Full Changelog: v0.3.1...v0.4.0

v0.3.1#

Released on 2025-10-23 - GitHub - PyPI

Highlights

remove unused uqlm/utils/calibation.py
fix docs site to include uqlm/calibration
fix bug related to rich progress bars
add baseline reference to plot_ranked_auc function

What's Changed

v0.3.0 updates by @dylanbouchard in #193
[docsite] Adding missing package name for API page generation by @doyajii1 in #200
Patch release: v0.3.1 by @dylanbouchard in #205

Full Changelog: v0.3.0...v0.3.1

v0.3.0#

Released on 2025-10-01 - GitHub - PyPI

1. Dataset-specific confidence score calibration

Introduced the new ScoreCalibrator class for calibrating confidence scores on specific datasets (Platt or Isotonic)
Includes evaluate_calibration function for evaluating score calibration with plots and various metrics, including ECE, MCE, Brier Score, Calibration Gap, and log-loss
For a detailed walkthrough of this feature, please refer to the demo notebook

2. Enabled use of LangChain `BaseMessage` with `prompts` argument

Added support for List[List[BaseMessage]] alongside the existing List[str] format for prompts argument of generate_and_score method in the following classes:
- UQEnsemble
- BlackBoxUQ
- WhiteBoxUQ
- SemanticEntropy
This enhancement enables uncertainty quantification and hallucination detection with:
- Multimodal inputs (e.g. image)
- Chat history
- Various message types (HumanMessage, AIMessage, SystemMessage)
Note: This feature is currently in Beta and is not compatible with LLM judges (LLMPanel or judge components of UQEnsemble)
For a detailed walkthrough of this feature, please refer to the demo notebook

3. LLM Judge explanations

Enhanced the LLMPanel class to provide explanations alongside scores
Judges can now justify their evaluations with detailed reasoning
Specified with boolean parameter explanations

4. Benchmark Dataset Extension

Added support for the FactScore benchmark dataset via the load_example_dataset function
Enables evaluation of long-form question answering capabilities in LLMs

5. Updated utility plotting functions

Added plot_ranked_auc option to compute AUPRC (rather then current AUROC only) and rank them in a color-coded bar plot (as seen in our research paper). Added missing legend to this function.

6. Bug Fixes

Fixed the LiveError issue that occurred with rich progress bars when retrying after code interruption
Removed unused images for docs site
Added missing unit tests for utility plotting functions
Updated demo notebooks to use non-deprecated LLMs (gemini-1.5-flash -> gemini-2.5-flash)

What's Changed

Add score calibration by @jmabry in #147
v0.2.7 updates by @dylanbouchard in #171
Feat: Integrate ScoreCalibration class to existing structure by @mohitcek in #165
Confidence score calibration by @dylanbouchard in #181
Enable UQ with multimodal inputs by @dylanbouchard in #182
Bump sphinx from 7.3.7 to 7.4.7 by @dependabot[bot] in #177
Removing unused images and set correct switcher json url by @doyajii1 in #184
update URLs in README to use main branch by @vgyani in #187
Removed a typo from black_box_demo.ipynb by @kaushik-42 in #188
Update plot_ranked_auc by @zeya30 in #183
Enable explanations with LLM judge scores by @NamrataWalanj7 in #178
fix continuous judge output handling by @dylanbouchard in #189
Adding factscore dataset by @dskarbrevik in #191
Minor release: v0.3 by @dylanbouchard in #192

New Contributors

@jmabry made their first contribution in #147
@vgyani made their first contribution in #187
@kaushik-42 made their first contribution in #188

Full Changelog: v0.2.7...v0.3.0

v0.2.7#

Released on 2025-09-12 - GitHub - PyPI

Highlights

New utility plotting functions:
- plot_ranked_auc to compute AUPRC (rather then current AUROC only) and rank them in a color-coded bar plot (as seen in our research paper)
- plot_filtered_accuracy to compute scorer-specific filtered LLM accuracy at various confidence thresholds (as seen in our research paper)
Automated Docs site build
Breaking change: UQResult import statement is changed to the following:
- Previous import: from uqlm.scorers.baseclass.uncertainty import UncertaintyQuantifier
- New import: from uqlm.utils.results import UQResult

What's Changed

ci: manage dependencies in CI with poetry for consistency by @trumant in #160
Feat: Visualization utility functions by @mohitcek in #161
#29 GitHub actions to automate documentation site build on new release by @dimtsap in #100
v0.2.6 updates by @dylanbouchard in #168
Update Utility Visualization function by @mohitcek in #170
Patch release: v0.2.7 by @dylanbouchard in #169

New Contributors

@trumant made their first contribution in #160

Full Changelog: v0.2.6...v0.2.7

v0.2.6#

Released on 2025-08-28 - GitHub - PyPI

Highlights

Remove unused attributes in UQEnsemble that was creating a bug with LLMPanel.score
Fix alignment in uqlm.utils.plot_model_accuracies function and enable displaying sample sizes as percentages

What's Changed

Refactor plot model accuracies by @mohitcek in #157
Patch release: v0.2.6 by @dylanbouchard in #159

Full Changelog: v0.2.5...v0.2.6

v0.2.5#

Released on 2025-08-25 - GitHub - PyPI

Highlights

Add missing num_responses parameter to generate_candidate_responses method in BlackBoxUQ, SemanticEntropy, and UQEnsemble.
Add missing fields/links to pyproject.toml

What's Changed

v0.2.4 updates by @dylanbouchard in #154
Add attribute num_responses by @mohitcek in #155
Patch release: v0.2.5 by @dylanbouchard in #156

Full Changelog: v0.2.4...v0.2.5

v0.2.4#

Released on 2025-08-13 - GitHub - PyPI

Highlights

Enable specification of LLM Judge scoring templates in UQEnsemble with scoring_templates argument.
Enable specification of postprocesed response return options in UQEnsemble: return only raw responses, return only postprocessed responses, or return both.

What's Changed

Enable different postprocessing return options and judge scoring templates with UQEnsemble by @dylanbouchard in #145
Patch release: v0.2.4 by @dylanbouchard in #146

Full Changelog: v0.2.3...v0.2.4

v0.2.3#

Released on 2025-08-05 - GitHub - PyPI

Highlights

Replaces use of bert_score.score with bert_score.BERTScorer.score for a ~43x speedup. While the former (old approach) re-checks and re-assigns torch.device with each use of score, the latter (updated approach) assigns torch.device only once during instantiation.
Creates the option for users to specify whether they want only postprocessed responses, only raw responses, or both versions when they specify a postprocessor. This applies to BlackBoxUQ, UQEnsemble, and SemanticEntropy. To do so, users can respectively specify 'postprocessed', 'raw', or 'all' in the 'return_responses' argument in the constructor of these classes. By default, 'all' is specified.
[black] is removed where specified in rich print statements to avoid inconsistent colors in progress bars.

What's Changed

v0.2.2 updates by @dylanbouchard in #140
use bert_score class rather than function for 43x speedup by @dylanbouchard in #141
Enable different handling of raw vs postprocessed responses by @dylanbouchard in #143
Patch release: v0.2.3 by @dylanbouchard in #144

Full Changelog: v0.2.2...v0.2.3

v0.2.2#

Released on 2025-08-01 - GitHub - PyPI

Highlights

improved handling of missing logprobs
adds warning when logprobs missing
removes benign transformers warning for NLI instantiation and BERTScore scoring
Add flaky and skip logic for unit tests to avoid benign failures
Fix escape character usage in Judge prompt
Update version of LangChain per Dependabot suggestion

What's Changed

Release/v0.2.0 by @dylanbouchard in #118
Release/v0.2.0 by @dylanbouchard in #128
v0.2.1 updates by @dylanbouchard in #131
Bump langchain from 0.3.26 to 0.3.27 by @dependabot[bot] in #121
Db/missing logprobs unittest skip by @dylanbouchard in #132
Add missing logprobs warning and update version by @dylanbouchard in #133
fix escape character in judge prompt by @dylanbouchard in #134
Fix logprobs syntax error and judge prompt escape character by @dylanbouchard in #137
suppress benign transformers warnings by @dylanbouchard in #138
Patch release: v0.2.2 by @dylanbouchard in #139

Full Changelog: v0.2.1...v0.2.2

v0.2.1#

Released on 2025-07-28 - GitHub - PyPI

Highlights

If exception is raised during generation (e.g. RateLimitError), the progress bar is stopped to avoid LiveError upon retry.
Fix BERTScore printed text
Fix Ensemble diagram for dark mode
Fixes missing max_calls_per_min being passed to LLMPanel constructor inside of UQEnsemble. After this fix, max_calls_per_min will be applied to ensemble judges as well.
Add flaky retry logic using @pytest.mark.flaky(retries=3) to tests that fail due to network issues related to HuggingFace.
Fix handling of missing logprobs with multiple responses in UQEnsemble

What's Changed

Patch release: v0.2.1 by @dylanbouchard in #129

Full Changelog: v0.2.0...v0.2.1

v0.2.0#

Released on 2025-07-25 - GitHub - PyPI

These release notes are for minor release v0.2.0.

New Features

1. Progress bars with `rich`

This feature enables the use of progress bars when generating LLM responses, scoring responses, and tuning ensemble weights. This feature introduced rich and ipywidgets as new dependencies.
By default, progress bars are turned on, but users can turn them off by setting show_progress_bars=False in generate_and_score, score, and tune methods for the scorer classes. Below is a screenshot illustrating the use of rich progress bars with the UQEnsemble.tune method:

2. Ensemble weights printing

After running the UQEnsemble.tune method, ensemble weights are now printed in a pretty table using rich. Ensemble weights are sorted from highest to lowest. See the above screenshot for an example. Users can also display this table with an already tuned ensemble using the UQEnsemble.print_weights method.

3. Support for Python 3.13

As of v0.2.0, uqlm can now be used with Python 3.13. All previous functionality is supported except for bleurt, which is not compatible with Python 3.13.

4. Ensemble saving and loading

UQEnsemble now offers two new methods: save_config and load_config. These methods offer user-friendly saving and loading the ensemble scorer components and weights.

Example use of ensemble saving:

uqe_tuned_config_file = "uqe_config_tuned.json"uqe.save_config(uqe_tuned_config_file)

Example use of ensemble loading:

loaded_ensemble = UQEnsemble.load_config("uqe_config_tuned.json")

These methods make storing a tuned ensemble an easier process for later use.

5. Token-probability-based Semantic Entropy

The SemanticEntropy class now supports token-probability-based estimates of semantic entropy and associated confidence scores. Note that attribute names in the returned object and column names in the associated dataframe have changed from those in v0.1.

Breaking Changes

1. BLEURT Deprecation

This release deprecates BLEURT as a black-box scorer. The following code will now produce errors:

Use of uqlm.black_box.BLEURTScorer
Use of "bleurt" in uqlm.scorers.BlackBoxUQ scorers parameter
Use of "bleurt" in uqlm.scorers.UQEnsemble scorers parameter

v0.1.9#

Released on 2025-07-22 - GitHub - PyPI

What's Changed

Patch/v0.1.9 by @dylanbouchard in #105

Full Changelog: v0.1.8...v0.1.9

v0.1.8#

Released on 2025-07-02 - GitHub - PyPI

Highlights

update version of pillow per Dependabot security alert

What's Changed

patch release: v0.1.8 by @dylanbouchard in #77

Full Changelog: v0.1.7...v0.1.8

v0.1.7#

Released on 2025-06-25 - GitHub - PyPI

Highlights

Fixes bug related to floating point precision causing ensemble score greater than 1 (1.00000002). This was throwing an error when certain tuner metrics were being computed. Patched with np.clip.
Allow use of brier_score and average_precision with Tuner and UQEnsemble

What's Changed

v0.1.6 updates by @dylanbouchard in #68
New metrics by @dylanbouchard in #69
Patch/v0.1.7 by @dylanbouchard in #70

Full Changelog: v0.1.6...v0.1.7

v0.1.6#

Released on 2025-06-19 - GitHub - PyPI

Highlights

Add missing unit tests
Update version of urllib3 per Dependabot security alert

What's Changed

Additional unit tests for ResponseGenerator class by @zeya30 in #58
Additional unit tests for UncertaintyQuantifier class by @zeya30 in #60
Improving coverage for unit tests by @zeya30 in #53
Additional unit tests for LLM Panel class by @zeya30 in #62
Unit tests Tuner class by @mohitcek in #63
v0.1.5 updates by @dylanbouchard in #65
Patch release: v0.1.6 by @dylanbouchard in #66

Full Changelog: v0.1.5...v0.1.6

v0.1.5#

Released on 2025-06-18 - GitHub - PyPI

Highlights

add missing unit tests to achieve 100% code coverage
implement auto-linting/formatting with ruff
reduce Tuner (and UQEnsemble.tune) latency (no API changes)
allow likert option for judges

What's Changed

v0.1.2 updates by @dylanbouchard in #34
adding Likert scale scoring for LLMJudge class by @zeya30 in #36
Tuner class: Low Latency by @mohitcek in #39
Linting CI workflow by @dimtsap in #28
Za/unit tests by @zeya30 in #50
Bugfix/ruff linting by @mohitcek in #55
Additional unit tests for UQensemble class by @mohitcek in #54
dependabot security fix by @dylanbouchard in #56
patch release: reduce tuner latency, add unit tests, auto linting by @dylanbouchard in #64

New Contributors

@dimtsap made their first contribution in #28

Full Changelog: v0.1.4...v0.1.5

v0.1.4#

Released on 2025-06-11 - GitHub - PyPI

What's Changed

Patch release: v0.1.3 by @dylanbouchard in #48

Full Changelog: v0.1.2...v0.1.4

v0.1.3#

Released on 2025-06-02 - GitHub - PyPI

Highlights

upgrade tornado version per dependabot

What's Changed

Patch release: v0.1.3 by @dylanbouchard in #48

Full Changelog: v0.1.2...v0.1.3

v0.1.2#

Released on 2025-05-14 - GitHub - PyPI

Highlights

streamline workflow for LLMPanel by enabling scoring template specification in the constructor
update LLMPanel demo
fix typos in readme
update readme badges
fix bleurt error message typo

What's Changed

v0.1.0 updates by @dylanbouchard in #16
v0.1.1 updates by @dylanbouchard in #18
Update readme, error message by @dylanbouchard in #22
Simplify LLMPanel workflow by @dylanbouchard in #23
Patch/v0.1.2 by @dylanbouchard in #26

Full Changelog: v0.1.1...v0.1.2

v0.1.1#

Released on 2025-05-12 - GitHub - PyPI

Highlights

Restore missing argument, thresh_objective, for UQEnsemble

What's Changed

Refactor UQEnsemble class by @mohitcek in #17

Full Changelog: v0.1.0...v0.1.1

v0.1.0#

Released on 2025-05-06 - GitHub - PyPI

UQLM v0.1.0 Release Notes

Introducing UQLM: Uncertainty Quantification for Language Models. UQLM is an Python library for detecting LLM hallucinations using state-of-the-art uncertainty quantification techniques.

Highlights

Comprehensive Scorer Suite

UQLM offers a versatile suite of response-level scorers, each providing a confidence score to indicate the likelihood of errors or hallucinations. The scorers are categorized into four main types:

🎯 Black-Box Scorers: Assess uncertainty through response consistency, compatible with any LLM.

🎲 White-Box Scorers: Utilize token probabilities for faster and cost-effective uncertainty estimation.

⚖️ LLM-as-a-Judge Scorers: Employ LLMs to evaluate response reliability, customizable through prompt engineering.

🔀 Ensemble Scorers: Combine multiple scorers for robust and flexible uncertainty/confidence estimates.

Installation:

Install the latest version from PyPI with:

pip install uqlm

Documentation and Demos:

Visit our documentation site for detailed instructions, API references, and demo notebooks showcasing various hallucination detection methods. The following demo notebooks are available:

Black-Box Uncertainty Quantification: A notebook demonstrating hallucination detection with black-box (consistency) scorers.
White-Box Uncertainty Quantification: A notebook demonstrating hallucination detection with white-box (token probability-based) scorers.
LLM-as-a-Judge: A notebook demonstrating hallucination detection with LLM-as-a-Judge.
Tunable UQ Ensemble: A notebook demonstrating hallucination detection with a tunable ensemble of UQ scorers (Bouchard & Chauhan, 2023).
Off-the-Shelf UQ Ensemble: A notebook demonstrating hallucination detection using BS Detector (Chen & Mueller, 2023) off-the-shelf ensemble.

Associated Research:

Our companion paper provides a technical description of the UQLM scorers and extensive experimental results, introducing a novel, tunable ensemble approach.

Release Notes#

v0.6.2#

What's Changed

v0.6.1#

What's Changed

v0.6.0#

Highlights

Add CodeGenUQ for Code Generation Uncertainty Quantification

Overview

Usage

What's Changed

v0.5.11#

Highlights

What's Changed

v0.5.10#

What's Changed

v0.5.9#

What's Changed

v0.5.8#

Highlights

What's Changed

v0.5.7#

What's Changed

v0.5.6#

Highlights

What's Changed

v0.5.5#

Highlights

What's Changed

v0.5.4#

Highlights

1. Add new white-box scorers to UQEnsemble accepted scorers list:

2. Fix embeddings model specification for cosine_sim and consistency_and_confidence, enable with WhiteBoxUQ

What's Changed

v0.5.3#

Highlights

What's Changed

New Contributors

v0.5.2#

Highlights

What's Changed

v0.5.1#

Highlights

What's Changed

v0.5.0#

New Methods: Long-Form UQ

Response Decomposition

Scoring methods

1. Unit-Response (Based on the LUQ/LUQ-Atomic methods)

2. Matched-Unit (Based on the LUQ-pair method)

3. Unit-QA (Based on the Longform Semantic Entropy method)

4. Graph-Based (Based on the Jiang et al., 2024)

Response Refinement with Uncertainty Aware Decoding

Performance Evaluation

New docs site pages

Other changes

Breaking changes

What's Changed

v0.4.5#

Highlights

What's Changed

v0.4.4#

Highlights

What's Changed

v0.4.3#

Highlights

What's Changed

v0.4.2#

Highlights

What's Changed

v0.4.1#

Highlights

What's Changed

v0.4.0#

Highlights

1. Varied tutorials for more model and dataset coverage

LLMs

Datasets

2. New scorers added

White-Box scorers

Add `CodeGenUQ` for Code Generation Uncertainty Quantification

1. Add new white-box scorers to `UQEnsemble` accepted scorers list:

2. Fix embeddings model specification for `cosine_sim` and `consistency_and_confidence`, enable with `WhiteBoxUQ`

3. New `LLMGrader` class and updated default grader for `UQEnsemble`

5. New datasets available with `load_example_dataset`

6. `uqlm.nli` sub-package

7. `uqlm.white_box` sub-package

2. Enabled use of LangChain `BaseMessage` with `prompts` argument

1. Progress bars with `rich`