langfair.metrics.toxicity.toxicity.ToxicityMetrics#

class langfair.metrics.toxicity.toxicity.ToxicityMetrics(classifiers=['detoxify_unbiased'], metrics=['Toxic Fraction', 'Expected Maximum Toxicity', 'Toxicity Probability'], toxic_threshold=0.3, batch_size=250, device='cpu', custom_classifier=None)#

Bases: object

__init__(classifiers=['detoxify_unbiased'], metrics=['Toxic Fraction', 'Expected Maximum Toxicity', 'Toxicity Probability'], toxic_threshold=0.3, batch_size=250, device='cpu', custom_classifier=None)#

Compute toxicity metrics for bias evaluation of language models. This class enables calculation of expected maximum toxicity, toxicity fraction, and toxicity probability. For more information on these metrics, refer to Gehman et al. (2020) [1] and Liang et al. (2023) [2].

Parameters:

classifiers (list containing subset of {'detoxify_unbiased', detoxify_original,)
'roberta-hate-speech-dynabench-r4-target' – Specifies which toxicity classifiers to use. If custom_classifier is provided, this argument is not used.
'toxigen'} – Specifies which toxicity classifiers to use. If custom_classifier is provided, this argument is not used.
['detoxify_unbiased'] (default =) – Specifies which toxicity classifiers to use. If custom_classifier is provided, this argument is not used.
metrics (list of str, default = ["Toxic Fraction", "Expected Maximum Toxicity", "Toxicity Probability"]) – Specifies which metrics to use. This input will be ignored if method evaluate is called with `by_prompt`=False.
toxic_threshold (float, default=0.325) – Specifies the threshold to use for toxicity classification.
batch_size (int, default=250) – Specifies the batch size for scoring toxicity of texts. Avoid setting too large to prevent the kernel from dying.
device (str or torch.device input or torch.device object, default="cpu") – Specifies the device that classifiers use for prediction. Set to “cuda” for classifiers to be able to leverage the GPU. Currently, ‘detoxify_unbiased’ and ‘detoxify_original’ will use this parameter.
custom_classifier (class object having predict method) – A user-defined class for toxicity classification that contains a predict method. The predict method must accept a list of strings as an input and output a list of floats of equal length. If provided, this takes precedence over classifiers.

Methods

`__init__`([classifiers, metrics, ...])	Compute toxicity metrics for bias evaluation of language models.
`evaluate`(responses[, scores, prompts, ...])	Generate toxicity scores and calculate toxic fraction, expected maximum toxicity, and toxicity probability metrics.
`get_toxicity_scores`(responses)	Calculate ensemble toxicity scores for a list of outputs.

evaluate(responses, scores=None, prompts=None, return_data=False)#

Generate toxicity scores and calculate toxic fraction, expected maximum toxicity, and toxicity probability metrics.

Parameters:

responses (list of strings) – A list of generated output from an LLM
scores (list of float, default=None) – A list response-level toxicity score. If None, method will compute it first.
prompts (list of strings, default=None) – A list of prompts from which responses were generated. If provided, metrics should be calculated by prompt and averaged across prompts (recommend atleast 25 responses per prompt for Expected maximum and Probability metrics). Otherwise, metrics are applied as a single calculation over all responses (only toxicity fraction is calculated).
return_data (bool, default=False) – Indicates whether to include response-level toxicity scores in results dictionary returned by this method.

Returns:

Dictionary containing evaluated metric values and data used to compute metrics, including toxicity scores, corresponding responses, and prompts (if applicable).

Return type:

dict

get_toxicity_scores(responses)#

Calculate ensemble toxicity scores for a list of outputs.

Parameters:: responses (list of strings) – A list of generated outputs from a language model on which toxicity metrics will be calculated.
Returns:: List of toxicity scores corresponding to provided responses
Return type:: list of float

References

[2]

Percy Liang, Rishi Bommasani, Tony Lee, Dimitris Tsipras, Dilara Soylu, Michihiro Yasunaga, Yian Zhang, Deepak Narayanan, Yuhuai Wu, Ananya Kumar, Benjamin Newman, Binhang Yuan, Bobby Yan, Ce Zhang, Christian Cosgrove, Christopher D. Manning, Christopher Ré, Diana Acosta-Navas, Drew A. Hudson, Eric Zelikman, Esin Durmus, Faisal Ladhak, Frieda Rong, Hongyu Ren, Huaxiu Yao, Jue Wang, Keshav Santhanam, Laurel Orr, Lucia Zheng, Mert Yuksekgonul, Mirac Suzgun, Nathan Kim, Neel Guha, Niladri Chatterji, Omar Khattab, Peter Henderson, Qian Huang, Ryan Chi, Sang Michael Xie, Shibani Santurkar, Surya Ganguli, Tatsunori Hashimoto, Thomas Icard, Tianyi Zhang, Vishrav Chaudhary, William Wang, Xuechen Li, Yifan Mai, Yuhui Zhang, and Yuta Koreeda. Holistic evaluation of language models. 2023. URL: https://arxiv.org/abs/2211.09110, arXiv:2211.09110.

langfair.metrics.toxicity.toxicity.ToxicityMetrics#

This Page