uqlm.scorers.graph.LongTextGraph#

class uqlm.scorers.graph.LongTextGraph(llm=None, scorers=None, aggregation='mean', response_refinement=False, claim_decomposition_llm=None, nli_llm=None, claim_filtering_scorer=None, device=None, nli_model_name='microsoft/deberta-large-mnli', system_prompt='You are a helpful assistant.', max_calls_per_min=None, sampling_temperature=1.0, use_n_param=False, max_length=2000)#

Bases: LongFormUQ

__init__(llm=None, scorers=None, aggregation='mean', response_refinement=False, claim_decomposition_llm=None, nli_llm=None, claim_filtering_scorer=None, device=None, nli_model_name='microsoft/deberta-large-mnli', system_prompt='You are a helpful assistant.', max_calls_per_min=None, sampling_temperature=1.0, use_n_param=False, max_length=2000)#

Class for Long-text Uncertainty Quantification (LUQ) scorers. :param llm: A langchain llm BaseChatModel. User is responsible for specifying temperature and other

relevant parameters to the constructor of their llm object.

Parameters:
  • scorers (List[str], default=None) – Specifies which graph-based scorers to include. Must be subset of [“degree_centrality”, “betweenness_centrality”, “closeness_centrality”, “page_rank”, “laplacian_centrality”, “harmonic_centrality”]. If None, defaults to [“closeness_centrality”].

  • granularity (str, default="claim") – Specifies whether to decompose and score at claim or sentence level granularity. Must be either “claim” or “sentence”

  • aggregation (str, default="mean") – Specifies how to aggregate claim/sentence-level scores to response-level scores. Must be one of ‘min’ or ‘mean’.

  • response_refinement (bool, default=False) – Specifies whether to refine responses with uncertainty-aware decoding. This approach removes claims with confidence scores below the response_refinement_threshold and uses the claim_decomposition_llm to reconstruct the response from the retained claims. Only available for claim-level granularity. For more details, refer to Jiang et al., 2024: https://arxiv.org/abs/2410.20783

  • claim_filtering_scorer (Optional[str], default=None) – specifies which scorer to use to filter claims if response_refinement is True. If not provided, defaults to the first element of self.scorers.

  • claim_decomposition_llm (langchain BaseChatModel, default=None) – A langchain llm BaseChatModel to be used for decomposing responses into individual claims. Also used for claim refinement. If granularity=”claim” and claim_decomposition_llm is None, the provided llm will be used for claim decomposition.

  • nli_llm (BaseChatModel, default=None) – A LangChain chat model for LLM-based NLI inference. If provided, takes precedence over nli_model_name. Only used for mode=”unit_response”

  • device (str or torch.device input or torch.device object, default="cpu") – Specifies the device that NLI model use for prediction. If None, detects and returns the best available PyTorch device. Prioritizes CUDA (NVIDIA GPU), then MPS (macOS), then CPU.

  • nli_model_name (str, default="microsoft/deberta-large-mnli") – Specifies which NLI model to use. Must be acceptable input to AutoTokenizer.from_pretrained() and AutoModelForSequenceClassification.from_pretrained()

  • system_prompt (str or None, default="You are a helpful assistant.") – Optional argument for user to provide custom system prompt

  • max_calls_per_min (int, default=None) – Specifies how many api calls to make per minute to avoid a rate limit error. By default, no limit is specified.

  • sampling_temperature (float, default=1.0) – The ‘temperature’ parameter for llm model to generate sampled LLM responses. Must be greater than 0.

  • use_n_param (bool, default=False) – Specifies whether to use n parameter for BaseChatModel. Not compatible with all BaseChatModel classes. If used, it speeds up the generation process substantially when num_responses > 1.

  • max_length (int, default=2000) – Specifies the maximum allowed string length. Responses longer than this value will be truncated to avoid OutOfMemoryError

Methods

__init__([llm, scorers, aggregation, ...])

Class for Long-text Uncertainty Quantification (LUQ) scorers. :param llm: A langchain llm BaseChatModel. User is responsible for specifying temperature and other relevant parameters to the constructor of their llm object. :type llm: langchain BaseChatModel, default=None :param scorers: Specifies which graph-based scorers to include. Must be subset of ["degree_centrality", "betweenness_centrality", "closeness_centrality", "page_rank", "laplacian_centrality", "harmonic_centrality"]. If None, defaults to ["closeness_centrality"]. :type scorers: List[str], default=None :param granularity: Specifies whether to decompose and score at claim or sentence level granularity. Must be either "claim" or "sentence" :type granularity: str, default="claim" :param aggregation: Specifies how to aggregate claim/sentence-level scores to response-level scores. Must be one of 'min' or 'mean'. :type aggregation: str, default="mean" :param response_refinement: Specifies whether to refine responses with uncertainty-aware decoding. This approach removes claims with confidence scores below the response_refinement_threshold and uses the claim_decomposition_llm to reconstruct the response from the retained claims. Only available for claim-level granularity. For more details, refer to Jiang et al., 2024: https://arxiv.org/abs/2410.20783 :type response_refinement: bool, default=False :param claim_filtering_scorer: specifies which scorer to use to filter claims if response_refinement is True. If not provided, defaults to the first element of self.scorers. :type claim_filtering_scorer: Optional[str], default=None :param claim_decomposition_llm: A langchain llm BaseChatModel to be used for decomposing responses into individual claims. Also used for claim refinement. If granularity="claim" and claim_decomposition_llm is None, the provided llm will be used for claim decomposition. :type claim_decomposition_llm: langchain BaseChatModel, default=None :param nli_llm: A LangChain chat model for LLM-based NLI inference. If provided, takes precedence over nli_model_name. Only used for mode="unit_response" :type nli_llm: BaseChatModel, default=None :param device: Specifies the device that NLI model use for prediction. If None, detects and returns the best available PyTorch device. Prioritizes CUDA (NVIDIA GPU), then MPS (macOS), then CPU. :type device: str or torch.device input or torch.device object, default="cpu" :param nli_model_name: Specifies which NLI model to use. Must be acceptable input to AutoTokenizer.from_pretrained() and AutoModelForSequenceClassification.from_pretrained() :type nli_model_name: str, default="microsoft/deberta-large-mnli" :param system_prompt: Optional argument for user to provide custom system prompt :type system_prompt: str or None, default="You are a helpful assistant." :param max_calls_per_min: Specifies how many api calls to make per minute to avoid a rate limit error. By default, no limit is specified. :type max_calls_per_min: int, default=None :param sampling_temperature: The 'temperature' parameter for llm model to generate sampled LLM responses. Must be greater than 0. :type sampling_temperature: float, default=1.0 :param use_n_param: Specifies whether to use n parameter for BaseChatModel. Not compatible with all BaseChatModel classes. If used, it speeds up the generation process substantially when num_responses > 1. :type use_n_param: bool, default=False :param max_length: Specifies the maximum allowed string length. Responses longer than this value will be truncated to avoid OutOfMemoryError :type max_length: int, default=2000.

generate_and_score(prompts[, num_responses, ...])

Generate LLM responses, sampled LLM (candidate) responses, and compute confidence scores with specified scorers for the provided prompts.

generate_candidate_responses(prompts[, ...])

This method generates multiple responses for uncertainty estimation.

generate_original_responses(prompts[, ...])

This method generates original responses for uncertainty estimation.

score(responses, sampled_responses[, ...])

Compute confidence scores with specified scorers on provided LLM responses.

uncertainty_aware_decode(claim_sets, ...[, ...])

async generate_and_score(prompts, num_responses=5, response_refinement_threshold=0.3333333333333333, show_progress_bars=True)#

Generate LLM responses, sampled LLM (candidate) responses, and compute confidence scores with specified scorers for the provided prompts. Parameters :rtype: UQResult

num_responsesint, default=5

The number of sampled responses used to compute consistency.

response_refinement_thresholdfloat, default=1/3

Threshold for uncertainty-aware filtering. Claims with confidence scores below this threshold are dropped from the refined response. Only used if response_refinement is True.

show_progress_barsbool, default=True

If True, displays progress bars while generating and scoring responses

Returns:

UQResult containing data (prompts, responses, and scores) and metadata

Return type:

UQResult

async generate_candidate_responses(prompts, num_responses=5, progress_bar=None)#

This method generates multiple responses for uncertainty estimation. If specified in the child class, all responses are postprocessed using the callable function defined by the user.

Return type:

List[List[str]]

Parameters:
  • prompts (List[Union[str, List[BaseMessage]]]) – List of prompts from which LLM responses will be generated. Prompts in list may be strings or lists of BaseMessage. If providing input type List[List[BaseMessage]], refer to https://python.langchain.com/docs/concepts/messages/#langchain-messages for support.

  • num_responses (int, default=5) – The number of sampled responses used to compute consistency.

  • progress_bar (rich.progress.Progress, default=None) – A progress bar object to display progress.

Returns:

A list of sampled responses for each prompt.

Return type:

list of list of str

async generate_original_responses(prompts, top_k_logprobs=None, progress_bar=None)#

This method generates original responses for uncertainty estimation. If specified in the child class, all responses are postprocessed using the callable function defined by the user.

Return type:

List[str]

Parameters:
  • prompts (List[Union[str, List[BaseMessage]]]) – List of prompts from which LLM responses will be generated. Prompts in list may be strings or lists of BaseMessage. If providing input type List[List[BaseMessage]], refer to https://python.langchain.com/docs/concepts/messages/#langchain-messages for support.

  • progress_bar (rich.progress.Progress, default=None) – A progress bar object to display progress.

Returns:

A list of original responses for each prompt.

Return type:

list of str

async score(responses, sampled_responses, response_refinement_threshold=0.3333333333333333, show_progress_bars=True)#

Compute confidence scores with specified scorers on provided LLM responses. Should only be used if responses and sampled responses are already generated. Otherwise, use generate_and_score. Parameters :rtype: UQResult

sampled_responseslist of list of str, default=None

A list of lists of sampled LLM responses for each prompt. These will be used to compute consistency scores by comparing to the corresponding response from responses.

response_refinement_thresholdfloat, default=1/3

Threshold for uncertainty-aware filtering. Claims with confidence scores below this threshold are dropped from the refined response. Only used if response_refinement is True.

show_progress_barsbool, default=True

If True, displays a progress bar while scoring responses

Returns:

UQResult containing data (responses and scores) and metadata

Return type:

UQResult

async uncertainty_aware_decode(claim_sets, claim_scores, response_refinement_threshold=0.3333333333333333, show_progress_bars=True)#
Return type:

List[str]

Parameters:
  • claim_sets (List[List[str]]) – List of original responses decomposed into lists of claims

  • claim_scores (List[List[float]]) – List of lists of claim-level confidence scores to be used for uncertainty-aware filtering

  • response_refinement_threshold (float, default=1/3) – Threshold for uncertainty-aware filtering. Claims with confidence scores below this threshold are dropped from the refined response. Only used if response_refinement is True.

  • progress_bar (rich.progress.Progress, default=None) – If provided, displays a progress bar while scoring responses

References