Release Notes#

v0.5.10#

Released on 2026-04-15 - GitHub - PyPI

Full Changelog: v0.5.9...v0.5.10

v0.5.9#

Released on 2026-04-10 - GitHub - PyPI

What's Changed

Full Changelog: v0.5.8...v0.5.9

v0.5.8#

Released on 2026-03-30 - GitHub - PyPI

Highlights

  • fix noncontradiction calculation for longform scorers
  • fix semantic sets confidence bounds
  • dependabot security alerts

What's Changed

Full Changelog: v0.5.7...v0.5.8

v0.5.7#

Released on 2026-03-13 - GitHub - PyPI

What's Changed

Full Changelog: v0.5.6...v0.5.7

v0.5.6#

Released on 2026-03-02 - GitHub - PyPI

Highlights

  • package upgrades from dependabot
  • update badge colors and links
  • update citation information

What's Changed

Full Changelog: v0.5.5...v0.5.6

v0.5.5#

Released on 2026-02-25 - GitHub - PyPI

Highlights

  • replace agenerate with ainvoke where generation is used, since ainvoke appears to be the better-maintained and better-documented LangChain method
  • replace poetry with uv for dependency management
  • update badges

What's Changed

Full Changelog: v0.5.4...v0.5.5

v0.5.4#

Released on 2026-01-30 - GitHub - PyPI

Highlights

1. Add new white-box scorers to UQEnsemble accepted scorers list:

Top-logprobs scorers (3):

  • min_token_negentropy - Minimum negentropy across tokens
  • mean_token_negentropy - Average negentropy across tokens
  • probability_margin - Mean difference between top-2 token probabilities

Sampled-logprobs scorers (4):

  • semantic_negentropy - Entropy based on semantic clustering
  • semantic_density - Density-based confidence measure
  • monte_carlo_probability - Average sequence probability across samples
  • consistency_and_confidence - Cosine similarity × response probability

P(True) scorer (1):

  • p_true - LLM's estimate of P(response is true)

2. Fix embeddings model specification for cosine_sim and consistency_and_confidence, enable with WhiteBoxUQ

Corrects a string error in embedding model specification with sentence_transformer parameter of BlackBoxUQ. Previously, the string was forced to begin with "sentence_transformers" but now the full string is specified with the parameter.

Previous: sentence_transformer=all-MiniLM-L12-v2 was specified and then "sentence-transformers/" was prepended to the string when storing the class attribute.

Now: sentence_transformer=sentence-transformers/all-MiniLM-L12-v2 is specified. This allows other embeddings models that don't start with "sentence_transformers/", such as jinaai/jina-embeddings-v2-base-code to be specified.

Also adds missing sentence_transformer parameter for WhiteBoxUQ

What's Changed

Full Changelog: v0.5.3...v0.5.4

v0.5.3#

Released on 2026-01-20 - GitHub - PyPI

Highlights

  • added now demo notebook to illustrate langgraph-uqlm integration
  • upgrade package versions per dependabot
  • fix some LaTeX in docs site
  • fix links in readme

What's Changed

New Contributors

Full Changelog: v0.5.2...v0.5.3

v0.5.2#

Released on 2026-01-14 - GitHub - PyPI

Highlights

  • Create uqlm.nli.EntailmentClassifier class for LLM-based entailment classification. This is well-suited for long-text scoring when responses exceed the length that can be handled by the Hugging Face NLI model
  • Update LongTextGraph, LongTexUQ, UnitResponseScorer, GraphScorer and associated notebooks to allow for LLM-based entailment classification.
  • Update unit tests
  • Misc. docs site cleanup

What's Changed

Full Changelog: v0.5.1...v0.5.2

v0.5.1#

Released on 2026-01-09 - GitHub - PyPI

Highlights

  • fixes rendering of long-form scorer content on the docs site
  • adds missing uqlm/longform subpackage to pyproject.toml so it appears in API reference on docs site
  • misc. docs site cleanup

What's Changed

Full Changelog: v0.5.0...v0.5.1

v0.5.0#

Released on 2026-01-08 - GitHub - PyPI

New Methods: Long-Form UQ

Short-form UQ methods have been shown to generalize poorly to long-form LLM outputs. Fine-grained methods for long-form UQ address these limitations by first decomposing responses into granular units (sentences or claims) and then scoring each unit.

Response Decomposition

We enable decomposition of responses into sentences or claims using our ResponseDecomposer class. This class implements claim decomposition using an LLM or sentence decomposition using a rule-based approach.

Scoring methods

We add three families of fine-grained scorers for long-form uncertainty quantification: Unit-Response, Matched-Unit, and Unit-QA

1. Unit-Response (Based on the LUQ/LUQ-Atomic methods)

These scorers measure whether sampled responses entail each unit (sentence or claim) in the original response and average across sampled responses to obtain unit-level confidence scores. This is implemented with the uqlm.scorers.longform.LongTextUQ class.
unit-response (1)

2. Matched-Unit (Based on the LUQ-pair method)

These scorers work by matching each original sentence or claim to its most similar counterpart in sampled responses before computing entailment scores. Matched scores are then averaged across sampled responses to obtain a confidence score for each unit in the original response. This is implemented with the uqlm.scorers.longform.LongTextUQ class.
matched-unit

3. Unit-QA (Based on the Longform Semantic Entropy method)

These scorers work by decomposing a response into granular units (sentences or claims), generating questions whose answers are the claims given context, sampling multiple answers, and computes black-box UQ scores across these answers. his is implemented with the uqlm.scorers.longform.LongTextQA class.
unit-qa3 (1)

4. Graph-Based (Based on the Jiang et al., 2024)

Graph-based scorers decompose original and sampled responses into claims, obtain the union of unique claims across all responses, and compute graph centrality metrics on the bipartite graph of claim-response entailment to measure uncertainty. This is implemented with the uqlm.scorers.longform.LongTextGraph class.
graph-uq3

These scorer classes all share the same parent class: uqlm.scorers.longform.baseclass.LongFormUQ.

Response Refinement with Uncertainty Aware Decoding

Response refinement works by dropping claims with confidence scores (specified with claim_filtering_scorer parameter) below a specified threshold (specified with response_refinement_threshold parameter) and reconstructing the response from the retained claims. This functionality is available in combination with any of the four methods described above by setting response_refinement=True in the constructor of the corresponding scorer class.

uad_graphic

Performance Evaluation

We enable FactScore-based grading using an LLM. This works by comparing units (sentences or claims) in a generated response to a FactScore question against the corresponding text of the subject's wikipedia article.

New docs site pages

We have added a "Scorer Definitions" tab to the docs site, intended to serve as an 'encyclopedia' of available scoring methods. It provides formal definitions, explanations in simple terms, and code snippets for all available methods.

Other changes

  • uqlm.scorers has now been refactored with two subfolders: uqlm.scorers.shortform (which contains existing scorer classes as of v0.4) and uqlm.scorers.longform which contains classes to implement the above mentioned scoring methods
  • the readme has been updated to reflect new longform scorers, and a new readme has been added inside the examples/ directory to provide more details on the available tutorials
  • various package upgrades to address security vulnerabilities identified by dependabot

Breaking changes

  • normalized_probability has been deprecated from acceptable white-box scorer list in WhiteBoxUQ and UQEnsemble in favor of sequence_probability with length_normalize=True (default). This also affects the key/column names in the returned UQResult object.

What's Changed

Full Changelog: v0.4.5...v0.5.0

v0.4.5#

Released on 2025-12-08 - GitHub - PyPI

Highlights

  • fix bug in model name string checking when retrieving logprobs, per issue #284

What's Changed

Full Changelog: v0.4.4...v0.4.5

v0.4.4#

Released on 2025-12-04 - GitHub - PyPI

Highlights

  • max_length parameter to WhiteBoxUQ to avoid the CUDA OutOfMemoryError.
  • updates demo and docstring accordingly

What's Changed

Full Changelog: v0.4.3...v0.4.4

v0.4.3#

Released on 2025-12-03 - GitHub - PyPI

Highlights

  • Automated torch.device selection for all scorers that use NLI or cosine similarity
  • Updates to docstring and demo notebooks to reflect the above change

What's Changed

Full Changelog: v0.4.2...v0.4.3

v0.4.2#

Released on 2025-11-21 - GitHub - PyPI

Highlights

  • Fix broken links to notebooks on docs site
  • Update PyPI readme

What's Changed

Full Changelog: v0.4.1...v0.4.2

v0.4.1#

Released on 2025-11-14 - GitHub - PyPI

Highlights

  • Relax langchain>=1.0.0 restriction to allow for >=0.3.7
  • Fx notebook links on docs site

What's Changed

Full Changelog: v0.4.0...v0.4.1

v0.4.0#

Released on 2025-11-12 - GitHub - PyPI

Highlights

1. Varied tutorials for more model and dataset coverage

We have updated the example notebooks to have broader coverage over LLMs and example datasets.

LLMs

  • Gemini models
  • GPT-4* models
  • o3-mini
  • Qwen
  • Mistral
  • LLama
  • Deepseek

Datasets

  • GSM8K
  • SVAMP
  • PopQA
  • NQ-Open
  • AI2-ARC
  • CSQA
  • SimpleQA
  • HotpotQA
  • Image (multimodal demo)

2. New scorers added

This release includes the addition of 11 new scorers spanning various categories (with accompanying unit tests). Details are provided below.

White-Box scorers

We are offering 9 new white-box scorers with this release. These scorers can be implemented with WhiteBoxUQ by specifying the respective scorer names in the scorers list. The length_normalize parameter determines whether response probabilities are length-normalized for the sampling-based white-box scorers.

Single-generation white-box scorers
Sampling-based white-box scorers
Reflexive white-box scorers

Black-Box scorers

We are implementing two new black-box scorers with this release. They can be specified using the scorers parameter in BlackBoxUQ.

Definitions of new scorers are provided with LaTeX at the end of applicable tutorial notebooks. We have also added new tutorial notebooks for Semantic Density and multi-generation white-box scorers. The readme has also been updated to reflect the new scorers.

3. New LLMGrader class and updated default grader for UQEnsemble

This release includes a new utility class uqlm.utils.grader.LLMGrader which is instantiated from a BaseChatModel and grades LLM responses against an answer key. This class appears:

  • in the example notebooks for evaluating hallucination detection performance.
  • as the updated default grader, replacing vectara/hallucination_evaluation_model, as that model is now gated.

4. Option to provide additional context to LLM judges

Users can now pass additional instructions to their LLM judges by using the additional_context parameter in the constructor of LLMPanel.

5. New datasets available with load_example_dataset

The utility function load_example_dataset now offers HotpotQA and SimpleQA datasets.

6. uqlm.nli sub-package

Created uqlm.nli sub-package that contains the following:

  • NLI class for NLI scoring only. Semantic entropy and noncontradiction calculations are respectively moved to uqlm.scorers.SemanticEntropy and uqlm.black_box.ConsistencyScorer classes.
  • SemanticClusterer class for semantic clustering (used for semantic entropy, semantic density, and number of semantic sets)

7. uqlm.white_box sub-package

Created uqlm.white_box sub-package that contains three classes for white-box computations from logprobs:

  • SingleLogprobsScorer for computing scores that depend on only logprobs from one generated response: normalized probability, sequence probability, minimum probability
  • TopLogprobsScorer for computing scorers that depend on top-K logprobs from generated response: mean top-k token negentropy, min top-k token negentropy, and likelihood margin
  • SampledLogprobsScorer for computing scores that that depend on logprobs from multiple sampled responses: monte carlo probability, CoCoA, semantic entropy, and semantic density
  • PTrueScorer for implementing the P(True) method

8. Minor changes & future deprecations

  • Renamed NLIScorer -> ConsistencyScorer and moved some methods to uqlm.nli.NLI class
  • normalized_probability scorer name in WhiteBoxUQ will be deprecated in v0.5 in favor of sequence_probability with length_normalize. The default scorers of WhiteBoxUQ will be scorers=["min_probability", "sequence_probability"]. The default value of length_normalize=True will apply to sequence_probability, so that it returns what normalized_probability currently returns.
  • system_prompt and template_ques_ans are deprecated in favor of additional_context parameter
  • default grader in UQEnsemble.tune now uses LLMGrader with the user-provided LLM used for generation

What's Changed

New Contributors

Full Changelog: v0.3.1...v0.4.0

v0.3.1#

Released on 2025-10-23 - GitHub - PyPI

Highlights

  • remove unused uqlm/utils/calibation.py
  • fix docs site to include uqlm/calibration
  • fix bug related to rich progress bars
  • add baseline reference to plot_ranked_auc function

What's Changed

Full Changelog: v0.3.0...v0.3.1

v0.3.0#

Released on 2025-10-01 - GitHub - PyPI

1. Dataset-specific confidence score calibration

  • Introduced the new ScoreCalibrator class for calibrating confidence scores on specific datasets (Platt or Isotonic)
  • Includes evaluate_calibration function for evaluating score calibration with plots and various metrics, including ECE, MCE, Brier Score, Calibration Gap, and log-loss
  • For a detailed walkthrough of this feature, please refer to the demo notebook

2. Enabled use of LangChain BaseMessage with prompts argument

  • Added support for List[List[BaseMessage]] alongside the existing List[str] format for prompts argument of generate_and_score method in the following classes:
    • UQEnsemble
    • BlackBoxUQ
    • WhiteBoxUQ
    • SemanticEntropy
  • This enhancement enables uncertainty quantification and hallucination detection with:
    • Multimodal inputs (e.g. image)
    • Chat history
    • Various message types (HumanMessage, AIMessage, SystemMessage)
  • Note: This feature is currently in Beta and is not compatible with LLM judges (LLMPanel or judge components of UQEnsemble)
  • For a detailed walkthrough of this feature, please refer to the demo notebook

3. LLM Judge explanations

  • Enhanced the LLMPanel class to provide explanations alongside scores
  • Judges can now justify their evaluations with detailed reasoning
  • Specified with boolean parameter explanations

4. Benchmark Dataset Extension

  • Added support for the FactScore benchmark dataset via the load_example_dataset function
  • Enables evaluation of long-form question answering capabilities in LLMs

5. Updated utility plotting functions

  • Added plot_ranked_auc option to compute AUPRC (rather then current AUROC only) and rank them in a color-coded bar plot (as seen in our research paper). Added missing legend to this function.

6. Bug Fixes

  • Fixed the LiveError issue that occurred with rich progress bars when retrying after code interruption
  • Removed unused images for docs site
  • Added missing unit tests for utility plotting functions
  • Updated demo notebooks to use non-deprecated LLMs (gemini-1.5-flash -> gemini-2.5-flash)

What's Changed

New Contributors

Full Changelog: v0.2.7...v0.3.0

v0.2.7#

Released on 2025-09-12 - GitHub - PyPI

Highlights

  • New utility plotting functions:
    • plot_ranked_auc to compute AUPRC (rather then current AUROC only) and rank them in a color-coded bar plot (as seen in our research paper)
    • plot_filtered_accuracy to compute scorer-specific filtered LLM accuracy at various confidence thresholds (as seen in our research paper)
  • Automated Docs site build
  • Breaking change: UQResult import statement is changed to the following:
    • Previous import: from uqlm.scorers.baseclass.uncertainty import UncertaintyQuantifier
    • New import: from uqlm.utils.results import UQResult

What's Changed

New Contributors

Full Changelog: v0.2.6...v0.2.7

v0.2.6#

Released on 2025-08-28 - GitHub - PyPI

Highlights

  • Remove unused attributes in UQEnsemble that was creating a bug with LLMPanel.score
  • Fix alignment in uqlm.utils.plot_model_accuracies function and enable displaying sample sizes as percentages

What's Changed

Full Changelog: v0.2.5...v0.2.6

v0.2.5#

Released on 2025-08-25 - GitHub - PyPI

Highlights

  • Add missing num_responses parameter to generate_candidate_responses method in BlackBoxUQ, SemanticEntropy, and UQEnsemble.
  • Add missing fields/links to pyproject.toml

What's Changed

Full Changelog: v0.2.4...v0.2.5

v0.2.4#

Released on 2025-08-13 - GitHub - PyPI

Highlights

  • Enable specification of LLM Judge scoring templates in UQEnsemble with scoring_templates argument.
  • Enable specification of postprocesed response return options in UQEnsemble: return only raw responses, return only postprocessed responses, or return both.

What's Changed

Full Changelog: v0.2.3...v0.2.4

v0.2.3#

Released on 2025-08-05 - GitHub - PyPI

Highlights

  • Replaces use of bert_score.score with bert_score.BERTScorer.score for a ~43x speedup. While the former (old approach) re-checks and re-assigns torch.device with each use of score, the latter (updated approach) assigns torch.device only once during instantiation.
  • Creates the option for users to specify whether they want only postprocessed responses, only raw responses, or both versions when they specify a postprocessor. This applies to BlackBoxUQ, UQEnsemble, and SemanticEntropy. To do so, users can respectively specify 'postprocessed', 'raw', or 'all' in the 'return_responses' argument in the constructor of these classes. By default, 'all' is specified.
  • [black] is removed where specified in rich print statements to avoid inconsistent colors in progress bars.

What's Changed

Full Changelog: v0.2.2...v0.2.3

v0.2.2#

Released on 2025-08-01 - GitHub - PyPI

Highlights

  • improved handling of missing logprobs
  • adds warning when logprobs missing
  • removes benign transformers warning for NLI instantiation and BERTScore scoring
  • Add flaky and skip logic for unit tests to avoid benign failures
  • Fix escape character usage in Judge prompt
  • Update version of LangChain per Dependabot suggestion

What's Changed

Full Changelog: v0.2.1...v0.2.2

v0.2.1#

Released on 2025-07-28 - GitHub - PyPI

Highlights

  • If exception is raised during generation (e.g. RateLimitError), the progress bar is stopped to avoid LiveError upon retry.
  • Fix BERTScore printed text
  • Fix Ensemble diagram for dark mode
  • Fixes missing max_calls_per_min being passed to LLMPanel constructor inside of UQEnsemble. After this fix, max_calls_per_min will be applied to ensemble judges as well.
  • Add flaky retry logic using @pytest.mark.flaky(retries=3) to tests that fail due to network issues related to HuggingFace.
  • Fix handling of missing logprobs with multiple responses in UQEnsemble

What's Changed

Full Changelog: v0.2.0...v0.2.1

v0.2.0#

Released on 2025-07-25 - GitHub - PyPI

These release notes are for minor release v0.2.0.

New Features

1. Progress bars with rich

This feature enables the use of progress bars when generating LLM responses, scoring responses, and tuning ensemble weights. This feature introduced rich and ipywidgets as new dependencies.
By default, progress bars are turned on, but users can turn them off by setting show_progress_bars=False in generate_and_score, score, and tune methods for the scorer classes. Below is a screenshot illustrating the use of rich progress bars with the UQEnsemble.tune method:
image

2. Ensemble weights printing

After running the UQEnsemble.tune method, ensemble weights are now printed in a pretty table using rich. Ensemble weights are sorted from highest to lowest. See the above screenshot for an example. Users can also display this table with an already tuned ensemble using the UQEnsemble.print_weights method.

3. Support for Python 3.13

As of v0.2.0, uqlm can now be used with Python 3.13. All previous functionality is supported except for bleurt, which is not compatible with Python 3.13.

4. Ensemble saving and loading

UQEnsemble now offers two new methods: save_config and load_config. These methods offer user-friendly saving and loading the ensemble scorer components and weights.

Example use of ensemble saving:

uqe_tuned_config_file = "uqe_config_tuned.json"uqe.save_config(uqe_tuned_config_file)

Example use of ensemble loading:

loaded_ensemble = UQEnsemble.load_config("uqe_config_tuned.json")

These methods make storing a tuned ensemble an easier process for later use.

5. Token-probability-based Semantic Entropy

The SemanticEntropy class now supports token-probability-based estimates of semantic entropy and associated confidence scores. Note that attribute names in the returned object and column names in the associated dataframe have changed from those in v0.1.

Breaking Changes

1. BLEURT Deprecation

This release deprecates BLEURT as a black-box scorer. The following code will now produce errors:

  • Use of uqlm.black_box.BLEURTScorer
  • Use of "bleurt" in uqlm.scorers.BlackBoxUQ scorers parameter
  • Use of "bleurt" in uqlm.scorers.UQEnsemble scorers parameter

v0.1.9#

Released on 2025-07-22 - GitHub - PyPI

What's Changed

Full Changelog: v0.1.8...v0.1.9

v0.1.8#

Released on 2025-07-02 - GitHub - PyPI

Highlights

  • update version of pillow per Dependabot security alert

What's Changed

Full Changelog: v0.1.7...v0.1.8

v0.1.7#

Released on 2025-06-25 - GitHub - PyPI

Highlights

  • Fixes bug related to floating point precision causing ensemble score greater than 1 (1.00000002). This was throwing an error when certain tuner metrics were being computed. Patched with np.clip.
  • Allow use of brier_score and average_precision with Tuner and UQEnsemble

What's Changed

Full Changelog: v0.1.6...v0.1.7

v0.1.6#

Released on 2025-06-19 - GitHub - PyPI

Highlights

  • Add missing unit tests
  • Update version of urllib3 per Dependabot security alert

What's Changed

Full Changelog: v0.1.5...v0.1.6

v0.1.5#

Released on 2025-06-18 - GitHub - PyPI

Highlights

  • add missing unit tests to achieve 100% code coverage
  • implement auto-linting/formatting with ruff
  • reduce Tuner (and UQEnsemble.tune) latency (no API changes)
  • allow likert option for judges

What's Changed

New Contributors

Full Changelog: v0.1.4...v0.1.5

v0.1.4#

Released on 2025-06-11 - GitHub - PyPI

What's Changed

Full Changelog: v0.1.2...v0.1.4

v0.1.3#

Released on 2025-06-02 - GitHub - PyPI

Highlights

  • upgrade tornado version per dependabot

What's Changed

Full Changelog: v0.1.2...v0.1.3

v0.1.2#

Released on 2025-05-14 - GitHub - PyPI

Highlights

  • streamline workflow for LLMPanel by enabling scoring template specification in the constructor
  • update LLMPanel demo
  • fix typos in readme
  • update readme badges
  • fix bleurt error message typo

What's Changed

Full Changelog: v0.1.1...v0.1.2

v0.1.1#

Released on 2025-05-12 - GitHub - PyPI

Highlights

  • Restore missing argument, thresh_objective, for UQEnsemble

What's Changed

Full Changelog: v0.1.0...v0.1.1

v0.1.0#

Released on 2025-05-06 - GitHub - PyPI

UQLM v0.1.0 Release Notes

Introducing UQLM: Uncertainty Quantification for Language Models. UQLM is an Python library for detecting LLM hallucinations using state-of-the-art uncertainty quantification techniques.

Highlights

Comprehensive Scorer Suite

UQLM offers a versatile suite of response-level scorers, each providing a confidence score to indicate the likelihood of errors or hallucinations. The scorers are categorized into four main types:

🎯 Black-Box Scorers: Assess uncertainty through response consistency, compatible with any LLM.

🎲 White-Box Scorers: Utilize token probabilities for faster and cost-effective uncertainty estimation.

⚖️ LLM-as-a-Judge Scorers: Employ LLMs to evaluate response reliability, customizable through prompt engineering.

🔀 Ensemble Scorers: Combine multiple scorers for robust and flexible uncertainty/confidence estimates.

Installation:

Install the latest version from PyPI with:

pip install uqlm

Documentation and Demos:

Visit our documentation site for detailed instructions, API references, and demo notebooks showcasing various hallucination detection methods. The following demo notebooks are available:

Associated Research:

Our companion paper provides a technical description of the UQLM scorers and extensive experimental results, introducing a novel, tunable ensemble approach.