uqlm.scorers.qa.LongTextQA#

class uqlm.scorers.qa.LongTextQA(llm, scorers=None, granularity='claim', aggregation='mean', response_refinement=False, claim_filtering_scorer=None, system_prompt='You are a helpful assistant.', claim_decomposition_llm=None, claim_decomposition_prompt='zhang_2025', question_generator_llm=None, sampling_temperature=1.0, max_calls_per_min=None, questioner_max_calls_per_min=None, max_length=1000, device=None, use_n_param=False)#

Bases: LongFormUQ

__init__(llm, scorers=None, granularity='claim', aggregation='mean', response_refinement=False, claim_filtering_scorer=None, system_prompt='You are a helpful assistant.', claim_decomposition_llm=None, claim_decomposition_prompt='zhang_2025', question_generator_llm=None, sampling_temperature=1.0, max_calls_per_min=None, questioner_max_calls_per_min=None, max_length=1000, device=None, use_n_param=False)#

Implements a generalization of the longform semantic entropy approach by Farquhar et al. (2024): https://www.nature.com/articles/s41586-024-07421-0.

Parameters:

llm (langchain BaseChatModel, default=None) – A langchain llm BaseChatModel. User is responsible for specifying temperature and other relevant parameters to the constructor of their llm object.
scorers (subset of {"entailment", "noncontradiction", "contrasted_entailment", "bert_score", "cosine_sim"}, default=None) – Specifies which black box (consistency) scorers to include. If None, defaults to [“entailment”].
granularity (str, default="claim") – Specifies whether to decompose and score at claim or sentence level granularity. Must be either “claim” or “sentence”
aggregation (str, default="mean") – Specifies how to aggregate claim/sentence-level scores to response-level scores. Must be one of ‘min’ or ‘mean’.
response_refinement (bool, default=False) – Specifies whether to refine responses with uncertainty-aware decoding. This approach removes claims with confidence scores below the response_refinement_threshold and uses the claim_decomposition_llm to reconstruct the response from the retained claims. Only available for claim-level granularity. For more details, refer to Jiang et al., 2024: https://arxiv.org/abs/2410.20783
claim_filtering_scorer (Optional[str], default=None) – specifies which scorer to use to filter claims if response_refinement is True. If not provided, defaults to the first element of self.scorers.
claim_decomposition_llm (langchain BaseChatModel, default=None) – A langchain llm BaseChatModel to be used for decomposing responses into individual claims. Also used for claim refinement. If granularity=”claim” and claim_decomposition_llm is None, the provided llm will be used for claim decomposition.
claim_decomposition_prompt (Union[str, Callable], default="zhang_2025") – Specifies the prompt template used to decompose responses into atomic claims. Accepts one of the following string keys: "zhang_2025", "farquhar_2024", "mohri_2024", "jiang_2024", or a custom callable with signature (response: str) -> str. Only applies when granularity="claim".
question_generator_llm (langchain BaseChatModel, default=None) – A langchain llm BaseChatModel to be used for decomposing responses into individual claims. Used for generating questions from claims or sentences in claim-QA approach. If None, defaults to claim_decomposition_llm.
device (str or torch.device input or torch.device object, default="cpu") – Specifies the device that NLI model use for prediction. Applies to ‘luq’, ‘luq_atomic’ scorers. Pass a torch.device to leverage GPU.
nli_model_name (str, default="microsoft/deberta-large-mnli") – Specifies which NLI model to use. Must be acceptable input to AutoTokenizer.from_pretrained() and AutoModelForSequenceClassification.from_pretrained()
system_prompt (str or None, default="You are a helpful assistant.") – Optional argument for user to provide custom system prompt
max_calls_per_min (int, default=None) – Specifies how many api calls to make per minute to avoid a rate limit error. By default, no limit is specified.
sampling_temperature (float, default=1.0) – The ‘temperature’ parameter for llm model to generate sampled LLM responses. Must be greater than 0.
use_n_param (bool, default=False) – Specifies whether to use n parameter for BaseChatModel. Not compatible with all BaseChatModel classes. If used, it speeds up the generation process substantially when num_responses > 1.
max_length (int, default=2000) – Specifies the maximum allowed string length. Responses longer than this value will be truncated to avoid OutOfMemoryError

Methods

`__init__`(llm[, scorers, granularity, ...])	Implements a generalization of the longform semantic entropy approach by Farquhar et al. (2024): https://www.nature.com/articles/s41586-024-07421-0.
`generate_and_score`(prompts[, num_questions, ...])	Generate and score the responses.
`generate_candidate_responses`(prompts[, ...])	This method generates multiple responses for uncertainty estimation.
`generate_original_responses`(prompts[, ...])	This method generates original responses for uncertainty estimation.
`score`(prompts, responses[, num_questions, ...])	Decompose responses, generate questions for each claim/sentence, sample LLM responses to the questions, and score consistency on those generated answers to measure confidence.
`uncertainty_aware_decode`(claim_sets, ...[, ...])

async generate_and_score(prompts, num_questions=1, num_claim_qa_responses=5, response_refinement_threshold=0.3333333333333333, show_progress_bars=True)#

Generate and score the responses.

Return type:

UQResult

Parameters:

prompts (list of str) – A list of input prompts for the model.
num_questions (int, default=1) – The number of questions to generate for each claim/sentence.
num_claim_qa_responses (int, default=5) – The number of responses to generate for each claim-inverted question.
response_refinement_threshold (float, default=1/3) – Threshold for uncertainty-aware filtering. Claims with confidence scores below this threshold are dropped from the refined response. Only used if response_refinement is True.
show_progress_bars (bool, default=True) – If True, displays progress bars while generating and scoring responses.

async generate_candidate_responses(prompts, num_responses=5, progress_bar=None)#

This method generates multiple responses for uncertainty estimation. If specified in the child class, all responses are postprocessed using the callable function defined by the user.

Return type:

List[List[str]]

Parameters:

prompts (List[Union[str, List[BaseMessage]]]) – List of prompts from which LLM responses will be generated. Prompts in list may be strings or lists of BaseMessage. If providing input type List[List[BaseMessage]], refer to https://python.langchain.com/docs/concepts/messages/#langchain-messages for support.
num_responses (int, default=5) – The number of sampled responses used to compute consistency.
progress_bar (rich.progress.Progress, default=None) – A progress bar object to display progress.

Returns:

A list of sampled responses for each prompt.

Return type:

list of list of str

async generate_original_responses(prompts, top_k_logprobs=None, progress_bar=None)#

This method generates original responses for uncertainty estimation. If specified in the child class, all responses are postprocessed using the callable function defined by the user.

Return type:

List[str]

Parameters:

prompts (List[Union[str, List[BaseMessage]]]) – List of prompts from which LLM responses will be generated. Prompts in list may be strings or lists of BaseMessage. If providing input type List[List[BaseMessage]], refer to https://python.langchain.com/docs/concepts/messages/#langchain-messages for support.
progress_bar (rich.progress.Progress, default=None) – A progress bar object to display progress.

Returns:

A list of original responses for each prompt.

Return type:

list of str

async score(prompts, responses, num_questions=1, num_claim_qa_responses=5, response_refinement_threshold=0.3333333333333333, show_progress_bars=True)#

Decompose responses, generate questions for each claim/sentence, sample LLM responses to the questions, and score consistency on those generated answers to measure confidence.

Return type:

UQResult

Parameters:

prompts (list of str) – A list of input prompts for the model.
responses (list of str) – A list of model responses for the prompts.
num_questions (int, default=1) – The number of questions to generate for each claim/sentence.
num_claim_qa_responses (int, default=5) – The number of responses to generate for each claim-inverted question.
response_refinement_threshold (float, default=1/3) – Threshold for uncertainty-aware filtering. Claims with confidence scores below this threshold are dropped from the refined response. Only used if response_refinement is True.
show_progress_bars (bool, default=True) – If True, displays a progress bar while scoring responses

async uncertainty_aware_decode(claim_sets, claim_scores, response_refinement_threshold=0.3333333333333333, show_progress_bars=True)#

Return type:

List[str]

Parameters:

claim_sets (List[List[str]]) – List of original responses decomposed into lists of claims
claim_scores (List[List[float]]) – List of lists of claim-level confidence scores to be used for uncertainty-aware filtering
response_refinement_threshold (float, default=1/3) – Threshold for uncertainty-aware filtering. Claims with confidence scores below this threshold are dropped from the refined response. Only used if response_refinement is True.
progress_bar (rich.progress.Progress, default=None) – If provided, displays a progress bar while scoring responses

References

uqlm.scorers.qa.LongTextQA#

This Page