langfair.auto.auto.AutoEval#

class langfair.auto.auto.AutoEval(prompts, responses=None, langchain_llm=None, suppressed_exceptions=None, use_n_param=True, metrics=None, toxicity_device='cpu', neutralize_tokens=True, max_calls_per_min=None)#

Bases: object

__init__(prompts, responses=None, langchain_llm=None, suppressed_exceptions=None, use_n_param=True, metrics=None, toxicity_device='cpu', neutralize_tokens=True, max_calls_per_min=None)#

This class calculates all toxicity, stereotype, and counterfactual metrics support by langfair

Parameters:

prompts (list of strings or DataFrame of strings) – A list of input prompts for the model.
responses (list of strings or DataFrame of strings, default is None) – A list of generated output from an LLM. If not available, responses are generated using the model.
langchain_llm (langchain BaseChatModel, default=None) – A langchain llm BaseChatModel. User is responsible for specifying temperature and other relevant parameters to the constructor of their langchain_llm object.
suppressed_exceptions (tuple or dict, default=None) – If a tuple, specifies which exceptions to handle as ‘Unable to get response’ rather than raising the exception. If a dict, enables users to specify exception-specific failure messages with keys being subclasses of BaseException
use_n_param (bool, default=False) – Specifies whether to use n parameter for BaseChatModel. Not compatible with all BaseChatModel classes. If used, it speeds up the generation process substantially when count > 1.
metrics (dict or list of str, default option compute all supported metrics.) – Specifies which metrics to evaluate.
toxicity_device (str or torch.device input or torch.device object, default="cpu") – Specifies the device that toxicity classifiers use for prediction. Set to “cuda” for classifiers to be able to leverage the GPU. Currently, ‘detoxify_unbiased’ and ‘detoxify_original’ will use this parameter.
neutralize_tokens (boolean, default=True) – An indicator attribute to use masking for the computation of Blue and RougeL metrics. If True, counterfactual responses are masked using CounterfactualGenerator.neutralize_tokens method before computing the aforementioned metrics.
max_calls_per_min (int, default=None) – [Deprecated] Use LangChain’s InMemoryRateLimiter instead.

Methods

`__init__`(prompts[, responses, ...])	This class calculates all toxicity, stereotype, and counterfactual metrics support by langfair
`evaluate`([count, metrics, return_data])	Compute all the metrics based on the provided data.
`export_results`([file_name])	Export the evaluated metrics values in a text file.
`print_results`()	Print the evaluate metrics values in the desired format.

Attributes

`stereotype_data`
`toxicity_data`

async evaluate(count=25, metrics=None, return_data=False)#

Compute all the metrics based on the provided data.

Parameters:

count (int, default=25) – Specifies number of responses to generate for each prompt. The convention is to use 25 generations per prompt in evaluating toxicity. See, for example DecodingTrust (https://arxiv.org/abs//2306.11698) or Gehman et al., 2020 (https://aclanthology.org/2020.findings-emnlp.301/).
metrics (dict or list of str, optional) – Specifies which metrics to evaluate. If None, computes all supported metrics.
return_data (bool, default=False) – Indicates whether to include response-level scores in results dictionary returned by this method.

Returns:

A dictionary containing values of toxicity, stereotype, and counterfactual metrics and, optionally, response-level scores.

Return type:

dict

export_results(file_name='results.txt')#

Export the evaluated metrics values in a text file.

Parameters:: file_name (str, Default = "results.txt") – Name of the .txt file.
Return type:: None

print_results()#

Print the evaluate metrics values in the desired format.

Return type:: None

langfair.auto.auto.AutoEval#

This Page