Release Notes#
v0.5.10#
Released on 2026-04-15 - GitHub - PyPI
Full Changelog: v0.5.9...v0.5.10
v0.5.9#
Released on 2026-04-10 - GitHub - PyPI
What's Changed
- v0.5.4 updates by @dylanbouchard in #336
- Readme - New images added by @doyajii1 in #340
- Enable automated PyPI publishing via GitHub Actions by @vgyani in #374
- Add configurable claim decomposition prompt templates for long-text UQ by @kaushik-42 in #342
- update landing page by @dylanbouchard in #385
- Patch release:
v0.5.9by @dylanbouchard in #386
Full Changelog: v0.5.8...v0.5.9
v0.5.8#
Released on 2026-03-30 - GitHub - PyPI
Highlights
- fix noncontradiction calculation for longform scorers
- fix semantic sets confidence bounds
- dependabot security alerts
What's Changed
- Ruff formatting by @mohitcek in #361
- Bug fixes by @dylanbouchard in #371
- Patch release:
v0.5.8by @dylanbouchard in #372
Full Changelog: v0.5.7...v0.5.8
v0.5.7#
Released on 2026-03-13 - GitHub - PyPI
What's Changed
- Fixed doc site inaccuracies by @vgyani in #367
- Patch release:
v0.5.7by @dylanbouchard in #368
Full Changelog: v0.5.6...v0.5.7
v0.5.6#
Released on 2026-03-02 - GitHub - PyPI
Highlights
- package upgrades from dependabot
- update badge colors and links
- update citation information
What's Changed
- Badge by @dylanbouchard in #351
- Revise publication links and citation details by @dylanbouchard in #352
- Badge by @dylanbouchard in #353
- Bump nbsphinx from 0.9.6 to 0.9.8 by @dependabot[bot] in #356
- Bump pytest-asyncio from 1.1.1 to 1.3.0 by @dependabot[bot] in #250
- Bump sphinx-gallery from 0.18.0 to 0.20.0 by @dependabot[bot] in #354
- Update rich requirement from <14.0.0,>=13.8.0 to >=13.8.0,<15.0.0 by @dependabot[bot] in #355
- Patch release:
v0.5.6by @dylanbouchard in #357
Full Changelog: v0.5.5...v0.5.6
v0.5.5#
Released on 2026-02-25 - GitHub - PyPI
Highlights
- replace
ageneratewithainvokewhere generation is used, sinceainvokeappears to be the better-maintained and better-documented LangChain method - replace
poetrywithuvfor dependency management - update badges
What's Changed
- Badge updates by @dylanbouchard in #348
- Switch from Poetry to astral-sh/uv by @vgyani in #349
- Patch release:
v0.5.5by @dylanbouchard in #350
Full Changelog: v0.5.4...v0.5.5
v0.5.4#
Released on 2026-01-30 - GitHub - PyPI
Highlights
1. Add new white-box scorers to UQEnsemble accepted scorers list:
Top-logprobs scorers (3):
min_token_negentropy- Minimum negentropy across tokensmean_token_negentropy- Average negentropy across tokensprobability_margin- Mean difference between top-2 token probabilities
Sampled-logprobs scorers (4):
semantic_negentropy- Entropy based on semantic clusteringsemantic_density- Density-based confidence measuremonte_carlo_probability- Average sequence probability across samplesconsistency_and_confidence- Cosine similarity × response probability
P(True) scorer (1):
p_true- LLM's estimate of P(response is true)
2. Fix embeddings model specification for cosine_sim and consistency_and_confidence, enable with WhiteBoxUQ
Corrects a string error in embedding model specification with sentence_transformer parameter of BlackBoxUQ. Previously, the string was forced to begin with "sentence_transformers" but now the full string is specified with the parameter.
Previous: sentence_transformer=all-MiniLM-L12-v2 was specified and then "sentence-transformers/" was prepended to the string when storing the class attribute.
Now: sentence_transformer=sentence-transformers/all-MiniLM-L12-v2 is specified. This allows other embeddings models that don't start with "sentence_transformers/", such as jinaai/jina-embeddings-v2-base-code to be specified.
Also adds missing sentence_transformer parameter for WhiteBoxUQ
What's Changed
- v0.5.3 updates by @dylanbouchard in #327
- Fix embedding model specification by @dylanbouchard in #332
- Enable use of new white-box scorers in
UQEnsembleby @dylanbouchard in #333 - Feature/enable all white box scorers by @kaushik-42 in #328
- Patch release:
v0.5.4by @dylanbouchard in #334
Full Changelog: v0.5.3...v0.5.4
v0.5.3#
Released on 2026-01-20 - GitHub - PyPI
Highlights
- added now demo notebook to illustrate langgraph-uqlm integration
- upgrade package versions per dependabot
- fix some LaTeX in docs site
- fix links in readme
What's Changed
- v0.5.2 updates by @dylanbouchard in #322
- fix latex in docs site by @dylanbouchard in #324
- Added LangGraph demo notebook by @vnnair98 in #323
- Security updates by @dylanbouchard in #325
- Patch release:
v0.5.3by @dylanbouchard in #326
New Contributors
Full Changelog: v0.5.2...v0.5.3
v0.5.2#
Released on 2026-01-14 - GitHub - PyPI
Highlights
- Create
uqlm.nli.EntailmentClassifierclass for LLM-based entailment classification. This is well-suited for long-text scoring when responses exceed the length that can be handled by the Hugging Face NLI model - Update
LongTextGraph,LongTexUQ,UnitResponseScorer,GraphScorerand associated notebooks to allow for LLM-based entailment classification. - Update unit tests
- Misc. docs site cleanup
What's Changed
- Add LLM-based entailment classification + Docs cleanup by @dylanbouchard in #320
- Patch release:
v0.5.2by @dylanbouchard in #321
Full Changelog: v0.5.1...v0.5.2
v0.5.1#
Released on 2026-01-09 - GitHub - PyPI
Highlights
- fixes rendering of long-form scorer content on the docs site
- adds missing uqlm/longform subpackage to pyproject.toml so it appears in API reference on docs site
- misc. docs site cleanup
What's Changed
- v0.5.0 updates by @dylanbouchard in #316
- Add longform subpackage and fix docs links by @dylanbouchard in #317
- fix code block in get started by @dylanbouchard in #318
- Patch release:
v0.5.1by @dylanbouchard in #319
Full Changelog: v0.5.0...v0.5.1
v0.5.0#
Released on 2026-01-08 - GitHub - PyPI
New Methods: Long-Form UQ
Short-form UQ methods have been shown to generalize poorly to long-form LLM outputs. Fine-grained methods for long-form UQ address these limitations by first decomposing responses into granular units (sentences or claims) and then scoring each unit.
Response Decomposition
We enable decomposition of responses into sentences or claims using our ResponseDecomposer class. This class implements claim decomposition using an LLM or sentence decomposition using a rule-based approach.
Scoring methods
We add three families of fine-grained scorers for long-form uncertainty quantification: Unit-Response, Matched-Unit, and Unit-QA
1. Unit-Response (Based on the LUQ/LUQ-Atomic methods)
These scorers measure whether sampled responses entail each unit (sentence or claim) in the original response and average across sampled responses to obtain unit-level confidence scores. This is implemented with the uqlm.scorers.longform.LongTextUQ class.

2. Matched-Unit (Based on the LUQ-pair method)
These scorers work by matching each original sentence or claim to its most similar counterpart in sampled responses before computing entailment scores. Matched scores are then averaged across sampled responses to obtain a confidence score for each unit in the original response. This is implemented with the uqlm.scorers.longform.LongTextUQ class.

3. Unit-QA (Based on the Longform Semantic Entropy method)
These scorers work by decomposing a response into granular units (sentences or claims), generating questions whose answers are the claims given context, sampling multiple answers, and computes black-box UQ scores across these answers. his is implemented with the uqlm.scorers.longform.LongTextQA class.

4. Graph-Based (Based on the Jiang et al., 2024)
Graph-based scorers decompose original and sampled responses into claims, obtain the union of unique claims across all responses, and compute graph centrality metrics on the bipartite graph of claim-response entailment to measure uncertainty. This is implemented with the uqlm.scorers.longform.LongTextGraph class.

These scorer classes all share the same parent class: uqlm.scorers.longform.baseclass.LongFormUQ.
Response Refinement with Uncertainty Aware Decoding
Response refinement works by dropping claims with confidence scores (specified with claim_filtering_scorer parameter) below a specified threshold (specified with response_refinement_threshold parameter) and reconstructing the response from the retained claims. This functionality is available in combination with any of the four methods described above by setting response_refinement=True in the constructor of the corresponding scorer class.
Performance Evaluation
We enable FactScore-based grading using an LLM. This works by comparing units (sentences or claims) in a generated response to a FactScore question against the corresponding text of the subject's wikipedia article.
New docs site pages
We have added a "Scorer Definitions" tab to the docs site, intended to serve as an 'encyclopedia' of available scoring methods. It provides formal definitions, explanations in simple terms, and code snippets for all available methods.
Other changes
uqlm.scorershas now been refactored with two subfolders:uqlm.scorers.shortform(which contains existing scorer classes as of v0.4) anduqlm.scorers.longformwhich contains classes to implement the above mentioned scoring methods- the readme has been updated to reflect new longform scorers, and a new readme has been added inside the examples/ directory to provide more details on the available tutorials
- various package upgrades to address security vulnerabilities identified by dependabot
Breaking changes
normalized_probabilityhas been deprecated from acceptable white-box scorer list inWhiteBoxUQandUQEnsemblein favor ofsequence_probabilitywithlength_normalize=True(default). This also affects the key/column names in the returnedUQResultobject.
What's Changed
- v0.3 updates by @dylanbouchard in #197
- LLM based NLI + ResponseDecomposer upgrades + restructured prompts by @dskarbrevik in #199
- Minor refactor by @dylanbouchard in #201
- add aggregation method by @dylanbouchard in #202
- Add
mode,granularityparameters in place ofscorersby @dylanbouchard in #204 - Long-form Semantic Entropy by @mohitcek in #203
- add factscore grader by @dylanbouchard in #207
- Enable more granular score return by @dylanbouchard in #208
- Binary style for NLI class by @dskarbrevik in #206
- update grader by @dylanbouchard in #215
- Longform Feature: evaluate method to compute semantic entropy by @mohitcek in #217
- Refactor ClaimQA class by @mohitcek in #218
- Patch/v0.3.1 by @dylanbouchard in #225
- v0.3.1 updates by @dylanbouchard in #224
- update question template by @dylanbouchard in #227
- Feat: ClaimQA class - multiple questions per factoid/claim by @mohitcek in #228
- Claimqa updates by @dylanbouchard in #235
- v0.4.4 updates by @dylanbouchard in #279
- Merge develop -> longform UQ branch by @dylanbouchard in #282
- v0.4.5 updates by @dylanbouchard in #286
- LongForm UQ by @dylanbouchard in #283
- Created new directories for short-form and long-form responses by @mohitcek in #288
- Refactor
uqlm.scorersfor shorform vs. longform parent classes by @dylanbouchard in #289 - Issue #244 - Added Scorer Definitions on Docs Site by @vgyani in #287
- Add long-text definition to docs by @dylanbouchard in #298
- Rearrange subpackages by @dylanbouchard in #300
- Rename modules, add UAD scorer specification by @dylanbouchard in #304
- Update notebooks by @dylanbouchard in #308
- Graph based long-form scoring by @dskarbrevik in #307
- Fix links and test by @dylanbouchard in #309
- Add new unit tests by @dylanbouchard in #310
- update uad graphics by @dylanbouchard in #311
- update luq graphic and version by @dylanbouchard in #313
- add qa unit test by @dylanbouchard in #314
- Minor release:
v0.5.0by @dylanbouchard in #315
Full Changelog: v0.4.5...v0.5.0
v0.4.5#
Released on 2025-12-08 - GitHub - PyPI
Highlights
- fix bug in model name string checking when retrieving logprobs, per issue #284
What's Changed
- patch release:
v0.4.5by @dylanbouchard in #285
Full Changelog: v0.4.4...v0.4.5
v0.4.4#
Released on 2025-12-04 - GitHub - PyPI
Highlights
max_lengthparameter toWhiteBoxUQto avoid the CUDAOutOfMemoryError.- updates demo and docstring accordingly
What's Changed
- v0.4.3 updates by @dylanbouchard in #276
- Patch release:
v0.4.4by @dylanbouchard in #277
Full Changelog: v0.4.3...v0.4.4
v0.4.3#
Released on 2025-12-03 - GitHub - PyPI
Highlights
- Automated
torch.deviceselection for all scorers that use NLI or cosine similarity - Updates to docstring and demo notebooks to reflect the above change
What's Changed
- Added automatic device detection by @kaushik-42 in #249
- Update notebooks for auto device by @dylanbouchard in #271
- v0.4.2 updates by @dylanbouchard in #273
- Patch release:
v0.4.3by @dylanbouchard in #274
Full Changelog: v0.4.2...v0.4.3
v0.4.2#
Released on 2025-11-21 - GitHub - PyPI
Highlights
- Fix broken links to notebooks on docs site
- Update PyPI readme
What's Changed
- v0.4.1 updates by @dylanbouchard in #267
- add missing example notebooks by @dylanbouchard in #268
- Patch release:
v0.4.2by @dylanbouchard in #269
Full Changelog: v0.4.1...v0.4.2
v0.4.1#
Released on 2025-11-14 - GitHub - PyPI
Highlights
- Relax
langchain>=1.0.0restriction to allow for>=0.3.7 - Fx notebook links on docs site
What's Changed
- Support langchain < 1.0.0 by @doyajii1 in #265
- Patch release:
v0.4.1by @dylanbouchard in #266
Full Changelog: v0.4.0...v0.4.1
v0.4.0#
Released on 2025-11-12 - GitHub - PyPI
Highlights
1. Varied tutorials for more model and dataset coverage
We have updated the example notebooks to have broader coverage over LLMs and example datasets.
LLMs
- Gemini models
- GPT-4* models
- o3-mini
- Qwen
- Mistral
- LLama
- Deepseek
Datasets
- GSM8K
- SVAMP
- PopQA
- NQ-Open
- AI2-ARC
- CSQA
- SimpleQA
- HotpotQA
- Image (multimodal demo)
2. New scorers added
This release includes the addition of 11 new scorers spanning various categories (with accompanying unit tests). Details are provided below.
White-Box scorers
We are offering 9 new white-box scorers with this release. These scorers can be implemented with WhiteBoxUQ by specifying the respective scorer names in the scorers list. The length_normalize parameter determines whether response probabilities are length-normalized for the sampling-based white-box scorers.
Single-generation white-box scorers
- Likelihood margin Farr et al., 2024
- Sequence probability Vashurin et al., 2024
- Mean top-k token entropy Scalena et al., 2025
- Max top-token entropy Scalena et al., 2025
Sampling-based white-box scorers
- Semantic Entropy (logprobs version) Farquhar et al., 2024
- Semantic Density Qiu et al., 2024 (can also be implemented with
SemanticDensityfromuqlm.scorers - Monte carlo predictive entropy Kuhn et al., 2023
- CoCoA Vashurin et al., 2025
Reflexive white-box scorers
- P(True) Kadavath et al., 2022
Black-Box scorers
We are implementing two new black-box scorers with this release. They can be specified using the scorers parameter in BlackBoxUQ.
- Number of Semantic Sets (Lin et al., 2024; Vashurin et al., 2025; Kuhn et al., 2023)
- Entailment Probability (Lin et al., 2025; Chen & Mueller, 2023)
Definitions of new scorers are provided with LaTeX at the end of applicable tutorial notebooks. We have also added new tutorial notebooks for Semantic Density and multi-generation white-box scorers. The readme has also been updated to reflect the new scorers.
3. New LLMGrader class and updated default grader for UQEnsemble
This release includes a new utility class uqlm.utils.grader.LLMGrader which is instantiated from a BaseChatModel and grades LLM responses against an answer key. This class appears:
- in the example notebooks for evaluating hallucination detection performance.
- as the updated default grader, replacing
vectara/hallucination_evaluation_model, as that model is now gated.
4. Option to provide additional context to LLM judges
Users can now pass additional instructions to their LLM judges by using the additional_context parameter in the constructor of LLMPanel.
5. New datasets available with load_example_dataset
The utility function load_example_dataset now offers HotpotQA and SimpleQA datasets.
6. uqlm.nli sub-package
Created uqlm.nli sub-package that contains the following:
NLIclass for NLI scoring only. Semantic entropy and noncontradiction calculations are respectively moved touqlm.scorers.SemanticEntropyanduqlm.black_box.ConsistencyScorerclasses.SemanticClustererclass for semantic clustering (used for semantic entropy, semantic density, and number of semantic sets)
7. uqlm.white_box sub-package
Created uqlm.white_box sub-package that contains three classes for white-box computations from logprobs:
SingleLogprobsScorerfor computing scores that depend on only logprobs from one generated response: normalized probability, sequence probability, minimum probabilityTopLogprobsScorerfor computing scorers that depend on top-K logprobs from generated response: mean top-k token negentropy, min top-k token negentropy, and likelihood marginSampledLogprobsScorerfor computing scores that that depend on logprobs from multiple sampled responses: monte carlo probability, CoCoA, semantic entropy, and semantic densityPTrueScorerfor implementing the P(True) method
8. Minor changes & future deprecations
- Renamed
NLIScorer->ConsistencyScorerand moved some methods touqlm.nli.NLIclass normalized_probabilityscorer name inWhiteBoxUQwill be deprecated inv0.5in favor ofsequence_probabilitywithlength_normalize. The default scorers ofWhiteBoxUQwill bescorers=["min_probability", "sequence_probability"]. The default value oflength_normalize=Truewill apply tosequence_probability, so that it returns whatnormalized_probabilitycurrently returns.system_promptandtemplate_ques_ansare deprecated in favor ofadditional_contextparameter- default grader in
UQEnsemble.tunenow usesLLMGraderwith the user-provided LLM used for generation
What's Changed
- Add Semantic Density scorer by @dross20 in #209
- Adding HotPotQA and SimpleQA by @dskarbrevik in #210
- Semantic density, docs by @dylanbouchard in #212
- Semantic density notebook by @dross20 in #213
- Semantic density by @dylanbouchard in #214
- v0.3.1 updates by @dylanbouchard in #220
- add judge customization option by @dylanbouchard in #221
- v0.3.1 updates by @dylanbouchard in #223
- New White Box Scorers by @dylanbouchard in #219
- Diversify demos by @dylanbouchard in #232
- Update notebooks by @dylanbouchard in #233
- update demo notebooks by @zeya30 in #229
- Update demo notebooks by @zeya30 in #234
- Llm grader by @dylanbouchard in #238
- update demo notebooks by @dylanbouchard in #239
- Refactor: NLI Subpackage II by @mohitcek in #237
- Feature: Integrate SemanticEntropy and SemanticDensity methods with WhiteBoxUQ class by @mohitcek in #240
- Drop python 3.9 support by @doyajii1 in #242
- Jmlr revisions by @dylanbouchard in #243
- Bump sphinx-autodoc-typehints from 2.2.0 to 2.3.0 by @dependabot[bot] in #230
- Polish notebooks and readme by @dylanbouchard in #246
- Bump pytest-cov from 6.3.0 to 7.0.0 by @dependabot[bot] in #176
- Number of semantic sets scorer by @dylanbouchard in #247
- Improve unit tests code coverage by @zeya30 in #241
- Minor refactor + Improved test coverage by @dylanbouchard in #255
- Minor refactor + updated demos by @dylanbouchard in #256
- Update scorer definitions + fix logprobs bug in
SemanticEntropyby @dylanbouchard in #257 - Reuse NLI Scores by @mohitcek in #260
- Improve unit tests code coverage by @zeya30 in #258
- Allow for
torch.deviceinWhitBoxUQby @dylanbouchard in #261 - Fix logprob bug by @dylanbouchard in #262
- Update docs site by @dylanbouchard in #259
- Release PR:
v0.4.0by @dylanbouchard in #263
New Contributors
Full Changelog: v0.3.1...v0.4.0
v0.3.1#
Released on 2025-10-23 - GitHub - PyPI
Highlights
- remove unused uqlm/utils/calibation.py
- fix docs site to include uqlm/calibration
- fix bug related to
richprogress bars - add baseline reference to
plot_ranked_aucfunction
What's Changed
- v0.3.0 updates by @dylanbouchard in #193
- [docsite] Adding missing package name for API page generation by @doyajii1 in #200
- Patch release:
v0.3.1by @dylanbouchard in #205
Full Changelog: v0.3.0...v0.3.1
v0.3.0#
Released on 2025-10-01 - GitHub - PyPI
1. Dataset-specific confidence score calibration
- Introduced the new
ScoreCalibratorclass for calibrating confidence scores on specific datasets (Platt or Isotonic) - Includes
evaluate_calibrationfunction for evaluating score calibration with plots and various metrics, including ECE, MCE, Brier Score, Calibration Gap, and log-loss - For a detailed walkthrough of this feature, please refer to the demo notebook
2. Enabled use of LangChain BaseMessage with prompts argument
- Added support for
List[List[BaseMessage]]alongside the existingList[str]format forpromptsargument ofgenerate_and_scoremethod in the following classes:UQEnsembleBlackBoxUQWhiteBoxUQSemanticEntropy
- This enhancement enables uncertainty quantification and hallucination detection with:
- Multimodal inputs (e.g. image)
- Chat history
- Various message types (HumanMessage, AIMessage, SystemMessage)
- Note: This feature is currently in Beta and is not compatible with LLM judges (LLMPanel or judge components of UQEnsemble)
- For a detailed walkthrough of this feature, please refer to the demo notebook
3. LLM Judge explanations
- Enhanced the LLMPanel class to provide explanations alongside scores
- Judges can now justify their evaluations with detailed reasoning
- Specified with boolean parameter
explanations
4. Benchmark Dataset Extension
- Added support for the FactScore benchmark dataset via the
load_example_datasetfunction - Enables evaluation of long-form question answering capabilities in LLMs
5. Updated utility plotting functions
- Added
plot_ranked_aucoption to compute AUPRC (rather then current AUROC only) and rank them in a color-coded bar plot (as seen in our research paper). Added missing legend to this function.
6. Bug Fixes
- Fixed the
LiveErrorissue that occurred with rich progress bars when retrying after code interruption - Removed unused images for docs site
- Added missing unit tests for utility plotting functions
- Updated demo notebooks to use non-deprecated LLMs (
gemini-1.5-flash->gemini-2.5-flash)
What's Changed
- Add score calibration by @jmabry in #147
- v0.2.7 updates by @dylanbouchard in #171
- Feat: Integrate ScoreCalibration class to existing structure by @mohitcek in #165
- Confidence score calibration by @dylanbouchard in #181
- Enable UQ with multimodal inputs by @dylanbouchard in #182
- Bump sphinx from 7.3.7 to 7.4.7 by @dependabot[bot] in #177
- Removing unused images and set correct switcher json url by @doyajii1 in #184
- update URLs in README to use main branch by @vgyani in #187
- Removed a typo from black_box_demo.ipynb by @kaushik-42 in #188
- Update plot_ranked_auc by @zeya30 in #183
- Enable explanations with LLM judge scores by @NamrataWalanj7 in #178
- fix continuous judge output handling by @dylanbouchard in #189
- Adding factscore dataset by @dskarbrevik in #191
- Minor release:
v0.3by @dylanbouchard in #192
New Contributors
- @jmabry made their first contribution in #147
- @vgyani made their first contribution in #187
- @kaushik-42 made their first contribution in #188
Full Changelog: v0.2.7...v0.3.0
v0.2.7#
Released on 2025-09-12 - GitHub - PyPI
Highlights
- New utility plotting functions:
plot_ranked_aucto compute AUPRC (rather then current AUROC only) and rank them in a color-coded bar plot (as seen in our research paper)plot_filtered_accuracyto compute scorer-specific filtered LLM accuracy at various confidence thresholds (as seen in our research paper)
- Automated Docs site build
- Breaking change:
UQResultimport statement is changed to the following:- Previous import:
from uqlm.scorers.baseclass.uncertainty import UncertaintyQuantifier - New import:
from uqlm.utils.results import UQResult
- Previous import:
What's Changed
- ci: manage dependencies in CI with poetry for consistency by @trumant in #160
- Feat: Visualization utility functions by @mohitcek in #161
- #29 GitHub actions to automate documentation site build on new release by @dimtsap in #100
- v0.2.6 updates by @dylanbouchard in #168
- Update Utility Visualization function by @mohitcek in #170
- Patch release: v0.2.7 by @dylanbouchard in #169
New Contributors
Full Changelog: v0.2.6...v0.2.7
v0.2.6#
Released on 2025-08-28 - GitHub - PyPI
Highlights
- Remove unused attributes in
UQEnsemblethat was creating a bug withLLMPanel.score - Fix alignment in
uqlm.utils.plot_model_accuraciesfunction and enable displaying sample sizes as percentages
What's Changed
- Refactor plot model accuracies by @mohitcek in #157
- Patch release:
v0.2.6by @dylanbouchard in #159
Full Changelog: v0.2.5...v0.2.6
v0.2.5#
Released on 2025-08-25 - GitHub - PyPI
Highlights
- Add missing
num_responsesparameter togenerate_candidate_responsesmethod inBlackBoxUQ,SemanticEntropy, andUQEnsemble. - Add missing fields/links to pyproject.toml
What's Changed
- v0.2.4 updates by @dylanbouchard in #154
- Add attribute
num_responsesby @mohitcek in #155 - Patch release:
v0.2.5by @dylanbouchard in #156
Full Changelog: v0.2.4...v0.2.5
v0.2.4#
Released on 2025-08-13 - GitHub - PyPI
Highlights
- Enable specification of LLM Judge scoring templates in
UQEnsemblewithscoring_templatesargument. - Enable specification of postprocesed response return options in
UQEnsemble: return only raw responses, return only postprocessed responses, or return both.
What's Changed
- Enable different postprocessing return options and judge scoring templates with UQEnsemble by @dylanbouchard in #145
- Patch release:
v0.2.4by @dylanbouchard in #146
Full Changelog: v0.2.3...v0.2.4
v0.2.3#
Released on 2025-08-05 - GitHub - PyPI
Highlights
- Replaces use of
bert_score.scorewithbert_score.BERTScorer.scorefor a ~43x speedup. While the former (old approach) re-checks and re-assignstorch.devicewith each use ofscore, the latter (updated approach) assignstorch.deviceonly once during instantiation. - Creates the option for users to specify whether they want only postprocessed responses, only raw responses, or both versions when they specify a postprocessor. This applies to
BlackBoxUQ,UQEnsemble, andSemanticEntropy. To do so, users can respectively specify 'postprocessed', 'raw', or 'all' in the 'return_responses' argument in the constructor of these classes. By default, 'all' is specified. [black]is removed where specified inrichprint statements to avoid inconsistent colors in progress bars.
What's Changed
- v0.2.2 updates by @dylanbouchard in #140
- use bert_score class rather than function for 43x speedup by @dylanbouchard in #141
- Enable different handling of raw vs postprocessed responses by @dylanbouchard in #143
- Patch release:
v0.2.3by @dylanbouchard in #144
Full Changelog: v0.2.2...v0.2.3
v0.2.2#
Released on 2025-08-01 - GitHub - PyPI
Highlights
- improved handling of missing logprobs
- adds warning when logprobs missing
- removes benign transformers warning for NLI instantiation and BERTScore scoring
- Add flaky and skip logic for unit tests to avoid benign failures
- Fix escape character usage in Judge prompt
- Update version of LangChain per Dependabot suggestion
What's Changed
- Release/v0.2.0 by @dylanbouchard in #118
- Release/v0.2.0 by @dylanbouchard in #128
- v0.2.1 updates by @dylanbouchard in #131
- Bump langchain from 0.3.26 to 0.3.27 by @dependabot[bot] in #121
- Db/missing logprobs unittest skip by @dylanbouchard in #132
- Add missing logprobs warning and update version by @dylanbouchard in #133
- fix escape character in judge prompt by @dylanbouchard in #134
- Fix logprobs syntax error and judge prompt escape character by @dylanbouchard in #137
- suppress benign transformers warnings by @dylanbouchard in #138
- Patch release: v0.2.2 by @dylanbouchard in #139
Full Changelog: v0.2.1...v0.2.2
v0.2.1#
Released on 2025-07-28 - GitHub - PyPI
Highlights
- If exception is raised during generation (e.g.
RateLimitError), the progress bar is stopped to avoidLiveErrorupon retry. - Fix BERTScore printed text
- Fix Ensemble diagram for dark mode
- Fixes missing
max_calls_per_minbeing passed toLLMPanelconstructor inside ofUQEnsemble. After this fix,max_calls_per_minwill be applied to ensemble judges as well. - Add flaky retry logic using
@pytest.mark.flaky(retries=3)to tests that fail due to network issues related to HuggingFace. - Fix handling of missing
logprobswith multiple responses inUQEnsemble
What's Changed
- Patch release: v0.2.1 by @dylanbouchard in #129
Full Changelog: v0.2.0...v0.2.1
v0.2.0#
Released on 2025-07-25 - GitHub - PyPI
These release notes are for minor release v0.2.0.
New Features
1. Progress bars with rich
This feature enables the use of progress bars when generating LLM responses, scoring responses, and tuning ensemble weights. This feature introduced rich and ipywidgets as new dependencies.
By default, progress bars are turned on, but users can turn them off by setting show_progress_bars=False in generate_and_score, score, and tune methods for the scorer classes. Below is a screenshot illustrating the use of rich progress bars with the UQEnsemble.tune method:

2. Ensemble weights printing
After running the UQEnsemble.tune method, ensemble weights are now printed in a pretty table using rich. Ensemble weights are sorted from highest to lowest. See the above screenshot for an example. Users can also display this table with an already tuned ensemble using the UQEnsemble.print_weights method.
3. Support for Python 3.13
As of v0.2.0, uqlm can now be used with Python 3.13. All previous functionality is supported except for bleurt, which is not compatible with Python 3.13.
4. Ensemble saving and loading
UQEnsemble now offers two new methods: save_config and load_config. These methods offer user-friendly saving and loading the ensemble scorer components and weights.
Example use of ensemble saving:
uqe_tuned_config_file = "uqe_config_tuned.json"uqe.save_config(uqe_tuned_config_file)Example use of ensemble loading:
loaded_ensemble = UQEnsemble.load_config("uqe_config_tuned.json")These methods make storing a tuned ensemble an easier process for later use.
5. Token-probability-based Semantic Entropy
The SemanticEntropy class now supports token-probability-based estimates of semantic entropy and associated confidence scores. Note that attribute names in the returned object and column names in the associated dataframe have changed from those in v0.1.
Breaking Changes
1. BLEURT Deprecation
This release deprecates BLEURT as a black-box scorer. The following code will now produce errors:
- Use of
uqlm.black_box.BLEURTScorer - Use of
"bleurt"inuqlm.scorers.BlackBoxUQscorersparameter - Use of
"bleurt"inuqlm.scorers.UQEnsemblescorersparameter
v0.1.9#
Released on 2025-07-22 - GitHub - PyPI
What's Changed
- Patch/v0.1.9 by @dylanbouchard in #105
Full Changelog: v0.1.8...v0.1.9
v0.1.8#
Released on 2025-07-02 - GitHub - PyPI
Highlights
- update version of
pillowper Dependabot security alert
What's Changed
- patch release: v0.1.8 by @dylanbouchard in #77
Full Changelog: v0.1.7...v0.1.8
v0.1.7#
Released on 2025-06-25 - GitHub - PyPI
Highlights
- Fixes bug related to floating point precision causing ensemble score greater than 1 (1.00000002). This was throwing an error when certain tuner metrics were being computed. Patched with
np.clip. - Allow use of
brier_scoreandaverage_precisionwithTunerandUQEnsemble
What's Changed
- v0.1.6 updates by @dylanbouchard in #68
- New metrics by @dylanbouchard in #69
- Patch/v0.1.7 by @dylanbouchard in #70
Full Changelog: v0.1.6...v0.1.7
v0.1.6#
Released on 2025-06-19 - GitHub - PyPI
Highlights
- Add missing unit tests
- Update version of
urllib3per Dependabot security alert
What's Changed
- Additional unit tests for ResponseGenerator class by @zeya30 in #58
- Additional unit tests for UncertaintyQuantifier class by @zeya30 in #60
- Improving coverage for unit tests by @zeya30 in #53
- Additional unit tests for LLM Panel class by @zeya30 in #62
- Unit tests Tuner class by @mohitcek in #63
- v0.1.5 updates by @dylanbouchard in #65
- Patch release: v0.1.6 by @dylanbouchard in #66
Full Changelog: v0.1.5...v0.1.6
v0.1.5#
Released on 2025-06-18 - GitHub - PyPI
Highlights
- add missing unit tests to achieve 100% code coverage
- implement auto-linting/formatting with
ruff - reduce
Tuner(andUQEnsemble.tune) latency (no API changes) - allow
likertoption for judges
What's Changed
- v0.1.2 updates by @dylanbouchard in #34
- adding Likert scale scoring for LLMJudge class by @zeya30 in #36
- Tuner class: Low Latency by @mohitcek in #39
- Linting CI workflow by @dimtsap in #28
- Za/unit tests by @zeya30 in #50
- Bugfix/ruff linting by @mohitcek in #55
- Additional unit tests for UQensemble class by @mohitcek in #54
- dependabot security fix by @dylanbouchard in #56
- patch release: reduce tuner latency, add unit tests, auto linting by @dylanbouchard in #64
New Contributors
Full Changelog: v0.1.4...v0.1.5
v0.1.4#
Released on 2025-06-11 - GitHub - PyPI
What's Changed
- Patch release: v0.1.3 by @dylanbouchard in #48
Full Changelog: v0.1.2...v0.1.4
v0.1.3#
Released on 2025-06-02 - GitHub - PyPI
Highlights
- upgrade tornado version per dependabot
What's Changed
- Patch release: v0.1.3 by @dylanbouchard in #48
Full Changelog: v0.1.2...v0.1.3
v0.1.2#
Released on 2025-05-14 - GitHub - PyPI
Highlights
- streamline workflow for
LLMPanelby enabling scoring template specification in the constructor - update
LLMPaneldemo - fix typos in readme
- update readme badges
- fix bleurt error message typo
What's Changed
- v0.1.0 updates by @dylanbouchard in #16
- v0.1.1 updates by @dylanbouchard in #18
- Update readme, error message by @dylanbouchard in #22
- Simplify
LLMPanelworkflow by @dylanbouchard in #23 - Patch/v0.1.2 by @dylanbouchard in #26
Full Changelog: v0.1.1...v0.1.2
v0.1.1#
Released on 2025-05-12 - GitHub - PyPI
Highlights
- Restore missing argument,
thresh_objective, forUQEnsemble
What's Changed
Full Changelog: v0.1.0...v0.1.1
v0.1.0#
Released on 2025-05-06 - GitHub - PyPI
UQLM v0.1.0 Release Notes
Introducing UQLM: Uncertainty Quantification for Language Models. UQLM is an Python library for detecting LLM hallucinations using state-of-the-art uncertainty quantification techniques.
Highlights
Comprehensive Scorer Suite
UQLM offers a versatile suite of response-level scorers, each providing a confidence score to indicate the likelihood of errors or hallucinations. The scorers are categorized into four main types:
🎯 Black-Box Scorers: Assess uncertainty through response consistency, compatible with any LLM.
🎲 White-Box Scorers: Utilize token probabilities for faster and cost-effective uncertainty estimation.
⚖️ LLM-as-a-Judge Scorers: Employ LLMs to evaluate response reliability, customizable through prompt engineering.
🔀 Ensemble Scorers: Combine multiple scorers for robust and flexible uncertainty/confidence estimates.
Installation:
Install the latest version from PyPI with:
pip install uqlmDocumentation and Demos:
Visit our documentation site for detailed instructions, API references, and demo notebooks showcasing various hallucination detection methods. The following demo notebooks are available:
- Black-Box Uncertainty Quantification: A notebook demonstrating hallucination detection with black-box (consistency) scorers.
- White-Box Uncertainty Quantification: A notebook demonstrating hallucination detection with white-box (token probability-based) scorers.
- LLM-as-a-Judge: A notebook demonstrating hallucination detection with LLM-as-a-Judge.
- Tunable UQ Ensemble: A notebook demonstrating hallucination detection with a tunable ensemble of UQ scorers (Bouchard & Chauhan, 2023).
- Off-the-Shelf UQ Ensemble: A notebook demonstrating hallucination detection using BS Detector (Chen & Mueller, 2023) off-the-shelf ensemble.
Associated Research:
Our companion paper provides a technical description of the UQLM scorers and extensive experimental results, introducing a novel, tunable ensemble approach.