langfair.metrics.toxicity.toxicity.ToxicityMetrics#

class langfair.metrics.toxicity.toxicity.ToxicityMetrics(classifiers=['detoxify_unbiased'], metrics=['Toxic Fraction', 'Expected Maximum Toxicity', 'Toxicity Probability'], toxic_threshold=0.3, batch_size=250, device='cpu', custom_classifier=None)#

Bases: object

__init__(classifiers=['detoxify_unbiased'], metrics=['Toxic Fraction', 'Expected Maximum Toxicity', 'Toxicity Probability'], toxic_threshold=0.3, batch_size=250, device='cpu', custom_classifier=None)#

Compute toxicity metrics for bias evaluation of language models. This class enables calculation of expected maximum toxicity, toxicity fraction, and toxicity probability. For more information on these metrics, refer to Gehman et al. (2020) (https://aclanthology.org/2020.findings-emnlp.301/) and Liang et al. (2023) (https://arxiv.org/abs/2211.09110)

Parameters:
  • classifiers (list containing subset of {'detoxify_unbiased', detoxify_original,)

  • 'roberta-hate-speech-dynabench-r4-target' – Specifies which toxicity classifiers to use. If custom_classifier is provided, this argument is not used.

  • 'toxigen'} – Specifies which toxicity classifiers to use. If custom_classifier is provided, this argument is not used.

  • ['detoxify_unbiased'] (default =) – Specifies which toxicity classifiers to use. If custom_classifier is provided, this argument is not used.

  • metrics (list of str, default = ["Toxic Fraction", "Expected Maximum Toxicity", "Toxicity Probability"]) – Specifies which metrics to use. This input will be ignored if method evaluate is called with `by_prompt`=False.

  • toxic_threshold (float, default=0.325) – Specifies the threshold to use for toxicity classification.

  • batch_size (int, default=250) – Specifies the batch size for scoring toxicity of texts. Avoid setting too large to prevent the kernel from dying.

  • device (str or torch.device input or torch.device object, default="cpu") – Specifies the device that classifiers use for prediction. Set to “cuda” for classifiers to be able to leverage the GPU. Currently, ‘detoxify_unbiased’ and ‘detoxify_original’ will use this parameter.

  • custom_classifier (class object having predict method) – A user-defined class for toxicity classification that contains a predict method. The predict method must accept a list of strings as an input and output a list of floats of equal length. If provided, this takes precedence over classifiers.

Methods

__init__([classifiers, metrics, ...])

Compute toxicity metrics for bias evaluation of language models.

evaluate(responses[, scores, prompts, ...])

Generate toxicity scores and calculate toxic fraction, expected maximum toxicity, and toxicity probability metrics.

get_toxicity_scores(responses)

Calculate ensemble toxicity scores for a list of outputs.

evaluate(responses, scores=None, prompts=None, return_data=False)#

Generate toxicity scores and calculate toxic fraction, expected maximum toxicity, and toxicity probability metrics.

Parameters:
  • responses (list of strings) – A list of generated output from an LLM

  • scores (list of float, default=None) – A list response-level toxicity score. If None, method will compute it first.

  • prompts (list of strings, default=None) – A list of prompts from which responses were generated. If provided, metrics should be calculated by prompt and averaged across prompts (recommend atleast 25 responses per prompt for Expected maximum and Probability metrics). Otherwise, metrics are applied as a single calculation over all responses (only toxicity fraction is calculated).

  • return_data (bool, default=False) – Indicates whether to include response-level toxicity scores in results dictionary returned by this method.

Returns:

Dictionary containing evaluated metric values and data used to compute metrics, including toxicity scores, corresponding responses, and prompts (if applicable).

Return type:

dict

get_toxicity_scores(responses)#

Calculate ensemble toxicity scores for a list of outputs.

Parameters:

responses (list of strings) – A list of generated outputs from a language model on which toxicity metrics will be calculated.

Returns:

List of toxicity scores corresponding to provided responses

Return type:

list of float