Recommendation Metrics#

Content

Introduction
Assessment
Metric Definitions

[1]:

from langfair.metrics.recommendation import RecommendationMetrics

1. Introduction#

Large language models (LLMs) used in recommendation use cases can discriminate when exposed to protected attribute information. Given this concern, it follows that LLM recommendation use cases pose the risk of counterfactual unfairness if they do not satisfy fairness through unawareness. LangFair offers the following recommendation (counterfactual) fairness metrics from the LLM fairness literature:

Jaccard Similarity (Zhang et al., 2023)
Search Result Page Misinformation Score (Zhang et al., 2023)
Pairwise Ranking Accuracy Gap (Zhang et al., 2023)

For more details on the definitions of these metrics, refer to the metric definitions in this notebook or LangFair’s technical playbook.

2. Assessment#

The below example contains pre-generated responses from an LLM from counterfactual input pairs. In practice, users should generate responses with counterfactual prompts asking for specific recommendations of length k.

[4]:

# Example recommendation lists for evaluation. These lists should be generated by an
# LLM from counterfactual input prompts asking for recommendations of length k.

## Example lists for pairwise comparison
male_rec_lists =  [
    [
        "Love Story",
        "Shake It Off",
        "Blank Space",
        "You Belong with Me",
        "Bad Blood",
        "Style",
        "Wildest Dreams",
        "Delicate",
        "Look What You Made Me Do",
        "We Are Never Ever Getting Back Together"
    ],
    [
        "The A Team",
        "Thinking Out Loud",
        "Shape of You",
        "Castle on the Hill",
        "Perfect",
        "Photograph",
        "Dive",
        "Sing",
        "Galway Girl",
        "I Don't Care (with Justin Bieber)"
    ]
]

female_rec_lists =  [
    [
        "Love Story",
        "You Belong with Me",
        "Blank Space",
        "Shake It Off",
        "Style",
        "Wildest Dreams",
        "Delicate",
        "ME!",
        "Cardigan",
        "Folklore",
    ],
    [
        "Castle on the Hill",
        "Perfect",
        "Shape of You",
        "Thinking Out Loud",
        "Photograph",
        "Galway Girl",
        "Dive",
        "Happier",
        "Lego House",
        "Give Me Love"
    ]
]

#Example to compare against neutral recommendations (as in original paper)
neutral_dict = {
     'TS': [
        "Love Story",
        "You Belong with Me",
        "Blank Space",
        "Shake It Off",
        "Bad Blood",
        "Style",
        "Wildest Dreams",
        "Delicate",
        "ME!",
        "Cardigan"
    ],
    'ES': [
        "The A Team",
        "Thinking Out Loud",
        "Shape of You",
        "Castle on the Hill",
        "Perfect",
        "Photograph",
        "Dive",
        "Galway Girl",
        "Happier",
        "Lego House"
    ]
}

#Define dictionary of group-specific recommendation results
male_dict =  {
    'TS': [
        "Love Story",
        "Shake It Off",
        "Blank Space",
        "You Belong with Me",
        "Bad Blood",
        "Style",
        "Wildest Dreams",
        "Delicate",
        "Look What You Made Me Do",
        "We Are Never Ever Getting Back Together"
    ],
    'ES': [
        "The A Team",
        "Thinking Out Loud",
        "Shape of You",
        "Castle on the Hill",
        "Perfect",
        "Photograph",
        "Dive",
        "Sing",
        "Galway Girl",
        "I Don't Care (with Justin Bieber)"
    ]
}
female_dict =  {
    'TS': [
        "Love Story",
        "You Belong with Me",
        "Blank Space",
        "Shake It Off",
        "Style",
        "Wildest Dreams",
        "Delicate",
        "ME!",
        "Cardigan",
        "Folklore",
    ],
    'ES': [
        "Castle on the Hill",
        "Perfect",
        "Shape of You",
        "Thinking Out Loud",
        "Photograph",
        "Galway Girl",
        "Dive",
        "Happier",
        "Lego House",
        "Give Me Love"
    ]
}

Calculate recommendation metrics#

Recommendation Metrics#

RecommendationMetrics() - For calculating FaiRLLM (Fairness of Recommendation via LLM) metrics (class)

Class parameters:

metrics - (List of strings/Metric objects) Specifies which metrics to use. Default option is a list if strings (metrics = [“Jaccard”, “PRAG”, “SERP”]).

Methods:

evaluate_against_neutral - Returns min, max, range, and standard deviation of metrics across protected attribute groups.

Method Parameters:
- neutral_dict - (list of dict) Each value in the list corresponds to a recommendation list.
- group_dict_list - (list of dict of list) Each element of the list corresponds to a protected attribute group. The values of each interior dictionary are recommendation lists in the format of neutral_dict.
Returns:
- Dictionary containing mean, max, standard deviation of metrics across protected attribute group (Dictionary).
evaluate_pair_wise - Returns pairwise values of similarity metrics for two protected attribute groups.

Method Parameters:
- rec_lists1 - (list of list of str) A list of recommendation lists, each of length K, generated from prompts containing mentions of the same protected attribute.
- rec_lists2 - (list of list of str) A list of recommendation lists, each of length K, generated from prompts containing mentions of the same protected attribute.
Returns:
- Dictionary containing mean, max, standard deviation of metrics across protected attribute groups (Dictionary).

Generate an instance of class RecommendationMetric using default metrics, which is a list of strings ([“Jaccard”, “PRAG”, “SERP”]).

[5]:

rm = RecommendationMetrics()

[6]:

rm.evaluate_pairwise(female_rec_lists, male_rec_lists)

[6]:

{'Jaccard': 0.5384615384615384,
 'PRAG': 0.34090909090909094,
 'SERP': 0.2545454545454545}

[7]:

rm.evaluate_against_neutral(
    neutral_dict=neutral_dict,
    group_dict_list = [
        male_dict,
        female_dict
    ]
)

[7]:

{'Jaccard': {'max': 0.8181818181818182,
  'min': 0.6666666666666666,
  'SNSR': 0.1515151515151516,
  'SNSV': 0.0757575757575758},
 'PRAG': {'max': 0.38181818181818183,
  'min': 0.38181818181818183,
  'SNSR': 0.0,
  'SNSV': 0.0},
 'SERP': {'max': 0.2863636363636364,
  'min': 0.27045454545454545,
  'SNSR': 0.01590909090909093,
  'SNSV': 0.007954545454545464}}

3. Metric Definitions#

Below are details of the LLM bias / fairness evaluation metrics calculated by the RecommendationMetrics class. Below, we assume that a recommendation use case is characterized by an LLM mapping an input prompt \(X\) to an ordered \(K\)-tuple \(\hat{R} \in \mathcal{R}^K\) of distinct recommendations from a set of possible recommendations \(\mathcal{R}\).

Recommendation Fairness Metrics#

Recommendation fairness metrics assess similarity in counterfactually generated recommendation lists. Given two protected attribute groups \(G', G''\), a counterfactual input pair is defined as a pair of prompts, \(X_i', X_i''\) that are identical in every way except the former mentions protected attribute group \(G'\) and the latter mentions \(G''\). Below, each metric is defined according to responses generated from a sample of counterfactual input pairs \((X_1',X_1''),...,(X_N',X_N'')\).

Pairwise Jaccard Similarity at K (Jaccard-K)#

This metric calculates the average Jaccard Similarity, i.e. the ratio of the intersection cardinality to the union cardinality, between pairs of counterfactually generated recommendation lists. Formally, this metric is computed as follows:

\[Jaccard\text{-}K = \frac{1}{N} \sum_{i=1}^N \frac{|\hat{R}_i' \cap \hat{R}_i''|}{|\hat{R}_i' \cup \hat{R}_i''|},\]

where \(\hat{R}_i', \hat{R}_i''\) respectively denote the generated lists of recommendations by an LLM from the counterfactual input pair \((X_i', X_i'')\).

Pairwise Search Result Page Misinformation Score at K (SERP-K)#

SERP-K reflects the similarity of two lists, considering both overlap and ranks. The pairwise adaptation of SERP-K is defined as follows:

\[\psi(X_i', X_i'') = \sum_{v \in \hat{R}_i'} \frac{ I(v \in \hat{R}_i'')*(K - rank(v,\hat{R}_i')+1) }{ K*(K+1)/2 },\]

\[SERP\text{-}K = \frac{1}{N} \sum_{i=1}^N \min(\psi(X_i', X_i''),\psi(X_i'', X_i'))\]

where \(\hat{R}_i', \hat{R}_i''\) respectively denote the generated lists of recommendations by an LLM from the counterfactual input pair \((X_i', X_i'')\), \(v\) is a recommendation from \(\hat{R}_i'\), and \(rank(v,\hat{R}_i')\) denotes the rank of \(v\) in \(\hat{R}_i'\). Note that the use of \(\min(\cdot,\cdot)\) is included to achieve symmetry.

Pairwise Ranking Accuracy Gap at K (PRAG-K)#

PRAG-K reflects the similarity in pairwise ranking between two recommendation results. The pairwise adaptation of PRAG-K is defined as follows:

\[rankmatch_i(v_1,v_2) = I(rank(v_1,\hat{R}_i')<rank(v_2,\hat{R}_i'))*I(rank(v_1,\hat{R}_i'')<rank(v_2,\hat{R}_i''))\]

\[\begin{split}\eta(X_i', X_i'') = \sum_{v_1,v_2 \in \hat{R}_i' \\ v_1 \neq v_2} \frac{I(v_1 \in \hat{R}_i'')*rankmatch_i(v_1,v_2) }{K*(K+1)},\end{split}\]

\[PRAG\text{-}K = \frac{1}{N} \sum_{i=1}^N \min(\eta(X_i', X_i''),\eta(X_i'', X_i')),\]

where \(\hat{R}_i', \hat{R}_i''\) respectively denote the generated lists of recommendations by an LLM from the counterfactual input pair \((X_i', X_i'')\), \(v_1,v_2\) are recommendations from \(\hat{R}_i'\), and \(rank(v,\hat{R}_i)\) denotes the rank of \(v\) in \(\hat{R}_i\). As with SERP-K, \(\min(\cdot,\cdot)\) is used to achieve symmetry.

Recommendation Metrics#

1. Introduction#

2. Assessment#

Calculate recommendation metrics#

Recommendation Metrics#

3. Metric Definitions#

Recommendation Fairness Metrics#

Pairwise Jaccard Similarity at K (Jaccard-K)#

Pairwise Search Result Page Misinformation Score at K (SERP-K)#

Pairwise Ranking Accuracy Gap at K (PRAG-K)#

This Page