uqlm.scorers.ensemble.UQEnsemble#

class uqlm.scorers.ensemble.UQEnsemble(llm=None, scorers=None, device=None, postprocessor=None, system_prompt='You are a helpful assistant.', max_calls_per_min=None, use_n_param=False, thresh=0.5, weights=None, nli_model_name='microsoft/deberta-large-mnli', use_best=True, sampling_temperature=1.0, max_length=2000, verbose=False)#

Bases: UncertaintyQuantifier

__init__(llm=None, scorers=None, device=None, postprocessor=None, system_prompt='You are a helpful assistant.', max_calls_per_min=None, use_n_param=False, thresh=0.5, weights=None, nli_model_name='microsoft/deberta-large-mnli', use_best=True, sampling_temperature=1.0, max_length=2000, verbose=False)#

Class for detecting bad and speculative answer from a pretrained Large Language Model (LLM Hallucination).

Parameters:
  • llm (langchain BaseChatModel, default=None) – A langchain llm BaseChatModel. User is responsible for specifying temperature and other relevant parameters to the constructor of their llm object.

  • scorers (List containing instances of BaseChatModel, LLMJudge, black-box scorer names from ['semantic_negentropy', 'noncontradiction','exact_match', 'bert_score', 'bleurt', 'cosine_sim'], or white-box scorer names from ["normalized_probability", "min_probability"] default=None) – Specifies which UQ components to include. If None, defaults to the off-the-shelf BS Detector ensemble by Chen and Mueller (2023) [1] which uses components [“noncontradiction”, “exact_match”,”self_reflection”] with respective weights of [0.56, 0.14, 0.3]

  • device (str or torch.device input or torch.device object, default="cpu") – Specifies the device that NLI model use for prediction. Only applies to ‘semantic_negentropy’, ‘noncontradiction’ scorers. Pass a torch.device to leverage GPU.

  • postprocessor (callable, default=None) – A user-defined function that takes a string input and returns a string. Used for postprocessing outputs.

  • use_best (bool, default=True) – Specifies whether to swap the original response for the uncertainty-minimized response based on semantic entropy clusters.

  • sampling_temperature (float, default=1.0) – The ‘temperature’ parameter for llm model to generate sampled LLM responses. Must be greater than 0.

  • system_prompt (str or None, default="You are a helpful assistant.") – Optional argument for user to provide custom system prompt

  • max_calls_per_min (int, default=None) – Specifies how many api calls to make per minute to avoid a rate limit error. By default, no limit is specified.

  • use_n_param (bool, default=False) – Specifies whether to use n parameter for BaseChatModel. Not compatible with all BaseChatModel classes. If used, it speeds up the generation process substantially when num_responses > 1.

  • weights (list of floats, default=None) – Specifies weight for each component in ensemble. If None and scorers is not None, each component will receive equal weight. If scorers is None, defaults to the off-the-shelf BS Detector ensemble by Chen and Mueller (2023) [1] which uses components [“noncontradiction”, “exact_match”,”self_reflection”] with respective weights of [0.56, 0.14, 0.3].

  • nli_model_name (str, default="microsoft/deberta-large-mnli") – Specifies which NLI model to use. Must be acceptable input to AutoTokenizer.from_pretrained() and AutoModelForSequenceClassification.from_pretrained()

  • max_length (int, default=2000) – Specifies the maximum allowed string length. Responses longer than this value will be truncated to avoid OutOfMemoryError

  • verbose (bool, default=False) – Specifies whether to print the index of response currently being scored.

Methods

__init__([llm, scorers, device, ...])

Class for detecting bad and speculative answer from a pretrained Large Language Model (LLM Hallucination).

generate_and_score(prompts[, num_responses])

Generate LLM responses from provided prompts and compute confidence scores.

generate_candidate_responses(prompts)

This method generates multiple responses for uncertainty estimation.

generate_original_responses(prompts)

This method generates original responses for uncertainty estimation.

score(prompts, responses[, ...])

Generate LLM responses from provided prompts and compute confidence scores.

tune(prompts, ground_truth_answers[, ...])

Generate responses from provided prompts, grade responses with provided grader function, and tune ensemble weights.

tune_from_graded(correct_indicators[, ...])

Tunes weights and threshold parameters on a set of user-provided graded responses.

async generate_and_score(prompts, num_responses=5)#

Generate LLM responses from provided prompts and compute confidence scores.

Parameters:
  • prompts (list of str) – A list of input prompts for the model.

  • num_responses (int, default=5) – The number of sampled responses used to compute consistency.

Returns:

Instance of UQResult, containing data (prompts, responses, and semantic entropy scores) and metadata

Return type:

UQResult

async generate_candidate_responses(prompts)#

This method generates multiple responses for uncertainty estimation. If specified in the child class, all responses are postprocessed using the callable function defined by the user.

Return type:

List[List[str]]

async generate_original_responses(prompts)#

This method generates original responses for uncertainty estimation. If specified in the child class, all responses are postprocessed using the callable function defined by the user.

Return type:

List[str]

async score(prompts, responses, sampled_responses=None, logprobs_results=None)#

Generate LLM responses from provided prompts and compute confidence scores.

Parameters:
  • prompts (list of str) – A list of input prompts for the model.

  • responses (list of str) – A list of model responses for the prompts.

  • sampled_responses (list of list of str, default=None) – A list of lists of sampled LLM responses for each prompt. These will be used to compute consistency scores by comparing to the corresponding response from responses. Must be provided if using black box scorers.

  • logprobs_results (list of logprobs_result, default=None) – List of lists of dictionaries, each returned by BaseChatModel.agenerate. Must be provided if using white box scorers.

Returns:

Instance of UQResult, containing data (prompts, responses, and semantic entropy scores) and metadata

Return type:

UQResult

async tune(prompts, ground_truth_answers, grader_function=None, num_responses=5, weights_objective='roc_auc', thresh_bounds=(0, 1), n_trials=100, step_size=0.01, fscore_beta=1)#

Generate responses from provided prompts, grade responses with provided grader function, and tune ensemble weights.

Return type:

UQResult

Parameters:
  • prompts (list of str) – A list of input prompts for the model.

  • ground_truth_answers (list of str) – A list of ideal (correct) responses

  • grader_function (function(response: str, answer: str) -> bool, default=None) – A user-defined function that takes a response and a ground truth ‘answer’ and returns a boolean indicator of whether the response is correct. If not provided, vectara’s HHEM is used: https://huggingface.co/vectara/hallucination_evaluation_model

  • num_responses (int, default=5) – The number of sampled responses used to compute consistency.

  • weights_objective ({'fbeta_score', 'accuracy_score', 'balanced_accuracy_score', 'roc_auc', 'log_loss'}, default='roc_auc') – Objective function for optimization of alpha and beta. Must match thresh_objective if one of ‘fbeta_score’, ‘accuracy_score’, ‘balanced_accuracy_score’. If same as thresh_objective, joint optimization will be done.

  • thresh_bounds (tuple of floats, default=(0,1)) – Bounds to search for threshold

  • thresh_objective ({'fbeta_score', 'accuracy_score', 'balanced_accuracy_score', 'roc_auc', 'log_loss'}, default='fbeta_score') – Objective function for threshold optimization via grid search.

  • n_trials (int, default=100) – Indicates how many trials to search over with optuna optimizer

  • step_size (float, default=0.01) – Indicates step size in grid search, if used

  • fscore_beta (float, default=1) – Value of beta in fbeta_score

Return type:

UQResult

tune_from_graded(correct_indicators, weights_objective='roc_auc', thresh_bounds=(0, 1), n_trials=100, step_size=0.01, fscore_beta=1)#

Tunes weights and threshold parameters on a set of user-provided graded responses.

Return type:

UQResult

Parameters:
  • correct_indicators (list of bool) – A list of boolean indicators of whether self.responses are correct.

  • weights_objective ({'fbeta_score', 'accuracy_score', 'balanced_accuracy_score', 'roc_auc', 'log_loss'}, default='roc_auc') – Objective function for optimization of alpha and beta. Must match thresh_objective if one of ‘fbeta_score’, ‘accuracy_score’, ‘balanced_accuracy_score’. If same as thresh_objective, joint optimization will be done.

  • thresh_bounds (tuple of floats, default=(0,1)) – Bounds to search for threshold

  • thresh_objective ({'fbeta_score', 'accuracy_score', 'balanced_accuracy_score', 'roc_auc', 'log_loss'}, default='fbeta_score') – Objective function for threshold optimization via grid search.

  • n_trials (int, default=100) – Indicates how many candidates to search over with optuna optimizer

  • step_size (float, default=0.01) – Indicates step size in grid search, if used

  • fscore_beta (float, default=1) – Value of beta in fbeta_score

Return type:

UQResult

References