langfair.auto.auto.AutoEval#
- class langfair.auto.auto.AutoEval(prompts, responses=None, langchain_llm=None, suppressed_exceptions=None, use_n_param=True, metrics=None, toxicity_device='cpu', neutralize_tokens=True, max_calls_per_min=None)#
Bases:
object
- __init__(prompts, responses=None, langchain_llm=None, suppressed_exceptions=None, use_n_param=True, metrics=None, toxicity_device='cpu', neutralize_tokens=True, max_calls_per_min=None)#
This class calculates all toxicity, stereotype, and counterfactual metrics support by langfair
- Parameters:
prompts (list of strings or DataFrame of strings) – A list of input prompts for the model.
responses (list of strings or DataFrame of strings, default is None) – A list of generated output from an LLM. If not available, responses are generated using the model.
langchain_llm (langchain BaseChatModel, default=None) – A langchain llm BaseChatModel. User is responsible for specifying temperature and other relevant parameters to the constructor of their langchain_llm object.
suppressed_exceptions (tuple or dict, default=None) – If a tuple, specifies which exceptions to handle as ‘Unable to get response’ rather than raising the exception. If a dict, enables users to specify exception-specific failure messages with keys being subclasses of BaseException
use_n_param (bool, default=False) – Specifies whether to use n parameter for BaseChatModel. Not compatible with all BaseChatModel classes. If used, it speeds up the generation process substantially when count > 1.
metrics (dict or list of str, default option compute all supported metrics.) – Specifies which metrics to evaluate.
toxicity_device (str or torch.device input or torch.device object, default="cpu") – Specifies the device that toxicity classifiers use for prediction. Set to “cuda” for classifiers to be able to leverage the GPU. Currently, ‘detoxify_unbiased’ and ‘detoxify_original’ will use this parameter.
neutralize_tokens (boolean, default=True) – An indicator attribute to use masking for the computation of Blue and RougeL metrics. If True, counterfactual responses are masked using CounterfactualGenerator.neutralize_tokens method before computing the aforementioned metrics.
max_calls_per_min (int, default=None) – [Deprecated] Use LangChain’s InMemoryRateLimiter instead.
Methods
__init__
(prompts[, responses, ...])This class calculates all toxicity, stereotype, and counterfactual metrics support by langfair
evaluate
([count, metrics, return_data])Compute all the metrics based on the provided data.
export_results
([file_name])Export the evaluated metrics values in a text file.
Print the evaluate metrics values in the desired format.
Attributes
stereotype_data
toxicity_data
- async evaluate(count=25, metrics=None, return_data=False)#
Compute all the metrics based on the provided data.
- Parameters:
count (int, default=25) – Specifies number of responses to generate for each prompt. The convention is to use 25 generations per prompt in evaluating toxicity. See, for example DecodingTrust (https://arxiv.org/abs//2306.11698) or Gehman et al., 2020 (https://aclanthology.org/2020.findings-emnlp.301/).
metrics (dict or list of str, optional) – Specifies which metrics to evaluate. If None, computes all supported metrics.
return_data (bool, default=False) – Indicates whether to include response-level scores in results dictionary returned by this method.
- Returns:
A dictionary containing values of toxicity, stereotype, and counterfactual metrics and, optionally, response-level scores.
- Return type:
dict
- export_results(file_name='results.txt')#
Export the evaluated metrics values in a text file.
- Parameters:
file_name (str, Default = "results.txt") – Name of the .txt file.
- Return type:
None
- print_results()#
Print the evaluate metrics values in the desired format.
- Return type:
None