langfair.metrics.stereotype.metrics.classifier.StereotypeClassifier#

class langfair.metrics.stereotype.metrics.classifier.StereotypeClassifier(metrics=['Stereotype Fraction', 'Expected Maximum Stereotype', 'Stereotype Probability'], categories=['Race', 'Gender'], threshold=0.5, batch_size=250)#

Bases: object

__init__(metrics=['Stereotype Fraction', 'Expected Maximum Stereotype', 'Stereotype Probability'], categories=['Race', 'Gender'], threshold=0.5, batch_size=250)#

Compute stereotype metrics for bias evaluation of language models. This class enables calculation of expected maximum stereotype, stereotype fraction, and stereotype probability. These metrics are an extension of those presented in: https://arxiv.org/pdf/2009.11462.pdf

Parameters:
  • metrics (list of str, default = ["Stereotype Fraction", "Expected Maximum Stereotype", "Stereotype Probability"]) – Specifies which metrics to use. This input will be ignored if method evaluate is called with prompts.

  • categories (list of str, default = ['Race', 'Gender']) – The classifier score the model responses based on four categories gender, race, profession, and religion.

  • threshold (float, default=0.5) – Specifies the threshold to use for stereotype classification.

  • batch_size (int, default=250) – Specifies the batch size for scoring stereotype of texts. Avoid setting too large to prevent the kernel from dying.

Methods

__init__([metrics, categories, threshold, ...])

Compute stereotype metrics for bias evaluation of language models.

evaluate(responses[, scores, prompts, ...])

Generate stereotype scores and calculate classifier-based stereotype metrics.

get_stereotype_scores(responses)

Calculate stereotype scores for a list of outputs.

evaluate(responses, scores=None, prompts=None, return_data=False, categories=['gender', 'race'])#

Generate stereotype scores and calculate classifier-based stereotype metrics.

Parameters:
  • responses (list of strings) – A list of generated output from an LLM.

  • scores (list of float, default=None) – A list response-level stereotype score. If None, method will compute it first.

  • prompts (list of strings, default=None) – A list of prompts from which responses were generated. If provided, metrics should be calculated by prompt and averaged across prompts (recommend atleast 25 responses per prompt for Expected maximum and Probability metrics). Otherwise, metrics are applied as a single calculation over all responses (only stereotype fraction is calculated).

  • return_df (bool, default=True) – Specifies whether to include a dictionary containing response-level stereotype scores in returned result.

Returns:

Dictionary containing two keys: ‘metrics’, containing all metric values, and ‘data’, containing response-level stereotype scores.

Return type:

dict

get_stereotype_scores(responses)#

Calculate stereotype scores for a list of outputs.

Parameters:

responses (list of strings) – A list of generated outputs from a language model on which classifier-based stereotype metrics will be calculated.

Returns:

Dictionary containing response-level stereotype scores returned by stereotype classifier

Return type:

dict