uqlm.scorers.black_box.BlackBoxUQ#

class uqlm.scorers.black_box.BlackBoxUQ(llm=None, scorers=None, device=None, use_best=True, nli_model_name='microsoft/deberta-large-mnli', sentence_transformer='all-MiniLM-L6-v2', postprocessor=None, system_prompt=None, max_calls_per_min=None, sampling_temperature=1.0, return_responses='all', use_n_param=False, max_length=2000, verbose=False)#

Bases: ShortFormUQ

__init__(llm=None, scorers=None, device=None, use_best=True, nli_model_name='microsoft/deberta-large-mnli', sentence_transformer='all-MiniLM-L6-v2', postprocessor=None, system_prompt=None, max_calls_per_min=None, sampling_temperature=1.0, return_responses='all', use_n_param=False, max_length=2000, verbose=False)#

Class for black box uncertainty quantification. Leverages multiple responses to the same prompt to evaluate consistency as an indicator of hallucination likelihood.

Parameters:

llm (langchain BaseChatModel, default=None) – A langchain llm BaseChatModel. User is responsible for specifying temperature and other relevant parameters to the constructor of their llm object.
scorers (List[str], default=None) – Specifies which black box (consistency) scorers to include. If None, defaults to [“semantic_negentropy”, “noncontradiction”, “exact_match”, “cosine_sim”, “entailment”, “semantic_sets_confidence”]. The bleurt scorer is deprecated as of v0.2.0.
device (str or torch.device input or torch.device object, default="cpu") – Specifies the device that NLI model use for prediction. Only applies to ‘semantic_negentropy’, ‘noncontradiction’, ‘entailment’, ‘semantic_sets_confidence’ scorers. If None, detects and returns the best available PyTorch device. Prioritizes CUDA (NVIDIA GPU), then MPS (macOS), then CPU.
use_best (bool, default=True) – Specifies whether to swap the original response for the uncertainty-minimized response based on semantic entropy clusters. Only used if scorers includes ‘semantic_negentropy’, ‘noncontradiction’, ‘entailment’, or ‘semantic_sets_confidence’.
nli_model_name (str, default="microsoft/deberta-large-mnli") – Specifies which NLI model to use. Must be acceptable input to AutoTokenizer.from_pretrained() and AutoModelForSequenceClassification.from_pretrained()
sentence_transformer (str, default="all-MiniLM-L6-v2") – Specifies which huggingface sentence transformer to use when computing cosine similarity. See https://huggingface.co/sentence-transformers?sort_models=likes#models for more information. The recommended sentence transformer is ‘all-MiniLM-L6-v2’.
postprocessor (callable, default=None) – A user-defined function that takes a string input and returns a string. Used for postprocessing outputs before black-box comparisons.
return_responses (str, default="all") – If a postprocessor is used, specifies whether to return only postprocessed responses, only raw responses, or both. Specified with ‘postprocessed’, ‘raw’, or ‘all’, respectively.
system_prompt (str, default=None) – Optional argument for user to provide custom system prompt. If prompts are list of strings and system_prompt is None, defaults to “You are a helpful assistant.”
max_calls_per_min (int, default=None) – Specifies how many api calls to make per minute to avoid a rate limit error. By default, no limit is specified.
sampling_temperature (float, default=1.0) – The ‘temperature’ parameter for llm model to generate sampled LLM responses. Must be greater than 0.
use_n_param (bool, default=False) – Specifies whether to use n parameter for BaseChatModel. Not compatible with all BaseChatModel classes. If used, it speeds up the generation process substantially when num_responses > 1.
max_length (int, default=2000) – Specifies the maximum allowed string length. Responses longer than this value will be truncated to avoid OutOfMemoryError
verbose (bool, default=False) – Specifies whether to print the index of response currently being scored.

Methods

`__init__`([llm, scorers, device, use_best, ...])	Class for black box uncertainty quantification.
`generate_and_score`(prompts[, num_responses, ...])	Generate LLM responses, sampled LLM (candidate) responses, and compute confidence scores with specified scorers for the provided prompts.
`generate_candidate_responses`(prompts[, ...])	This method generates multiple responses for uncertainty estimation.
`generate_original_responses`(prompts[, ...])	This method generates original responses for uncertainty estimation.
`score`(responses, sampled_responses[, ...])	Compute confidence scores with specified scorers on provided LLM responses.

async generate_and_score(prompts, num_responses=5, show_progress_bars=True)#

Generate LLM responses, sampled LLM (candidate) responses, and compute confidence scores with specified scorers for the provided prompts.

Return type:

UQResult

Parameters:

prompts (List[Union[str, List[BaseMessage]]]) – List of prompts from which LLM responses will be generated. Prompts in list may be strings or lists of BaseMessage. If providing input type List[List[BaseMessage]], refer to https://python.langchain.com/docs/concepts/messages/#langchain-messages for support.
num_responses (int, default=5) – The number of sampled responses used to compute consistency.
show_progress_bars (bool, default=True) – If True, displays progress bars while generating and scoring responses

Returns:

UQResult containing data (prompts, responses, and scores) and metadata

Return type:

UQResult

async generate_candidate_responses(prompts, num_responses=5, progress_bar=None)#

This method generates multiple responses for uncertainty estimation. If specified in the child class, all responses are postprocessed using the callable function defined by the user.

Return type:

List[List[str]]

Parameters:

prompts (List[Union[str, List[BaseMessage]]]) – List of prompts from which LLM responses will be generated. Prompts in list may be strings or lists of BaseMessage. If providing input type List[List[BaseMessage]], refer to https://python.langchain.com/docs/concepts/messages/#langchain-messages for support.
num_responses (int, default=5) – The number of sampled responses used to compute consistency.
progress_bar (rich.progress.Progress, default=None) – A progress bar object to display progress.

Returns:

A list of sampled responses for each prompt.

Return type:

list of list of str

async generate_original_responses(prompts, top_k_logprobs=None, progress_bar=None)#

This method generates original responses for uncertainty estimation. If specified in the child class, all responses are postprocessed using the callable function defined by the user.

Return type:

List[str]

Parameters:

prompts (List[Union[str, List[BaseMessage]]]) – List of prompts from which LLM responses will be generated. Prompts in list may be strings or lists of BaseMessage. If providing input type List[List[BaseMessage]], refer to https://python.langchain.com/docs/concepts/messages/#langchain-messages for support.
progress_bar (rich.progress.Progress, default=None) – A progress bar object to display progress.

Returns:

A list of original responses for each prompt.

Return type:

list of str

score(responses, sampled_responses, show_progress_bars=True, _display_header=True)#

Compute confidence scores with specified scorers on provided LLM responses. Should only be used if responses and sampled responses are already generated. Otherwise, use generate_and_score.

Return type:

UQResult

Parameters:

responses (list of str, default=None) – A list of model responses for the prompts.
sampled_responses (list of list of str, default=None) – A list of lists of sampled LLM responses for each prompt. These will be used to compute consistency scores by comparing to the corresponding response from responses.
show_progress_bars (bool, default=True) – If True, displays a progress bar while scoring responses

Returns:

UQResult containing data (responses and scores) and metadata

Return type:

UQResult

References

uqlm.scorers.black_box.BlackBoxUQ#

This Page