{
"cells": [
{
"cell_type": "markdown",
"id": "ead0356d",
"metadata": {},
"source": [
"# Classification Metrics "
]
},
{
"cell_type": "code",
"execution_count": 1,
"id": "f694ef3c-96cb-472c-80c4-0409222fc4ac",
"metadata": {
"tags": []
},
"outputs": [],
"source": [
"import numpy as np\n",
"\n",
"from langfair.metrics.classification import ClassificationMetrics\n"
]
},
{
"cell_type": "markdown",
"id": "4b634110-1aa9-413d-908a-6ba61cde007e",
"metadata": {},
"source": [
"## 1. Introduction\n",
" "
]
},
{
"cell_type": "markdown",
"id": "7a159622-3c80-4efc-854e-d89aa1cf4d84",
"metadata": {},
"source": [
"Large language models (LLMs) used in classification use cases should be assessed for group fairness (if applicable). Similar to traditional person-level classification challenges in machine learning, these use cases present the risk of allocational harms. LangFair offers the following classification fairness metrics from the LLM fairness literature:\n",
"\n",
"* Predicted Prevalence Rate Disparity ([Feldman et al., 2015](https://arxiv.org/abs/1412.3756); [Bellamy et al., 2018](https://arxiv.org/abs/1810.01943); [Saleiro et al., 2019](https://arxiv.org/abs/1811.05577))\n",
"* False Negative Rate Disparity ([Bellamy et al., 2018](https://arxiv.org/abs/1810.01943); [Saleiro et al., 2019](https://arxiv.org/abs/1811.05577))\n",
"* False Omission Rate Disparity ([Bellamy et al., 2018](https://arxiv.org/abs/1810.01943); [Saleiro et al., 2019](https://arxiv.org/abs/1811.05577))\n",
"* False Positive Rate Disparity ([Bellamy et al., 2018](https://arxiv.org/abs/1810.01943); [Saleiro et al., 2019](https://arxiv.org/abs/1811.05577))\n",
"* False Discovery Rate Disparity ([Bellamy et al., 2018](https://arxiv.org/abs/1810.01943); [Saleiro et al., 2019](https://arxiv.org/abs/1811.05577))"
]
},
{
"cell_type": "markdown",
"id": "0a7059f0-cf44-437e-b0b9-12e33b6872ad",
"metadata": {},
"source": [
"## 2. Assessment\n",
" "
]
},
{
"cell_type": "code",
"execution_count": 2,
"id": "820e7afb-e66b-4716-bdbf-a53ffba4c4ae",
"metadata": {
"tags": []
},
"outputs": [],
"source": [
"# Simulate dataset for this example. In practice, users should replace this data with predicted classes generated by the LLM,\n",
"# corresponding ground truth values, and corresponding protected attribute group data.\n",
"sample_size = 10000\n",
"groups = np.random.binomial(n=1, p=0.5, size=sample_size)\n",
"y_pred = np.random.binomial(n=1, p=0.3, size=sample_size)\n",
"y_true = np.random.binomial(n=1, p=0.3, size=sample_size)"
]
},
{
"cell_type": "markdown",
"id": "42f834ce-792c-44b2-a0ee-5c8343365697",
"metadata": {},
"source": [
"## Classification Metrics\n",
"***\n",
"`ClassificationMetrics()` - Pairwise classification fairness metrics (class)\n",
"\n",
"**Class parameters:**\n",
"- `metric_type` - (**{'all', 'assistive', 'punitive', 'representation'}, default='all'**) Specifies which metrics to use.\n",
"\n",
"**Methods:**\n",
"1. `evaluate` - Returns min, max, range, and standard deviation of metrics across protected attribute groups.\n",
"\n",
" **Method Parameters:**\n",
" - `groups` - (**array-like**) Group indicators. Must contain exactly two unique values.\n",
" - `y_pred` - (**array-like**) Binary model predictions. Positive and negative predictions must be 1 and 0, respectively.\n",
" - `y_true` - (**array-like**) Binary labels (ground truth values). Positive and negative labels must be 1 and 0, respectively.\n",
" - `ratio` - (**boolean**) Indicates whether to compute the metric as a difference or a ratio\n",
"\n",
" Returns:\n",
" - Dictionary containing fairness metric values (**Dictionary**)."
]
},
{
"cell_type": "markdown",
"id": "133aaee2",
"metadata": {},
"source": [
"Generate an instance of class `ClassificationMetrics` using default `metric_type='all'`, which includes \"assistive\", \"punitive\", and \"representation\" metrics."
]
},
{
"cell_type": "code",
"execution_count": 3,
"id": "727a49a4-3067-4e7d-9de7-adfd29f4f6a8",
"metadata": {
"tags": []
},
"outputs": [],
"source": [
"cm = ClassificationMetrics(metric_type='all')"
]
},
{
"cell_type": "code",
"execution_count": 4,
"id": "33e61ded-b56f-42f3-897a-0df80d03b626",
"metadata": {
"tags": []
},
"outputs": [
{
"data": {
"text/plain": [
"{'FalseNegativeRateParity': 0.9683960547735326,\n",
" 'FalseOmissionRateParity': 0.9682772917805723,\n",
" 'FalsePositiveRateParity': 0.9832027144990514,\n",
" 'FalseDiscoveryRateParity': 0.9750294817188464,\n",
" 'PredictedPrevalenceRateParity': 1.010318056277584}"
]
},
"execution_count": 4,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Metrics expressed as ratios (target value of 1)\n",
"cm.evaluate(groups=groups, y_pred=y_pred, y_true=y_true, ratio=True)"
]
},
{
"cell_type": "code",
"execution_count": 5,
"id": "366c5853-ffb7-49c2-a9c5-37d38ba365e5",
"metadata": {
"tags": []
},
"outputs": [
{
"data": {
"text/plain": [
"{'FalseNegativeRateParity': 0.022435421009698087,\n",
" 'FalseOmissionRateParity': 0.009568167034658404,\n",
" 'FalsePositiveRateParity': 0.005089653684952955,\n",
" 'FalseDiscoveryRateParity': 0.01776013575685509,\n",
" 'PredictedPrevalenceRateParity': 0.0030867911196673647}"
]
},
"execution_count": 5,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Metrics expressed as differences (target value of 0)\n",
"cm.evaluate(groups=groups, y_pred=y_pred, y_true=y_true, ratio=False)"
]
}
],
"metadata": {
"environment": {
"kernel": "langfair",
"name": "workbench-notebooks.m125",
"type": "gcloud",
"uri": "us-docker.pkg.dev/deeplearning-platform-release/gcr.io/workbench-notebooks:m125"
},
"kernelspec": {
"display_name": "langfair-ZgpfWZGz-py3.9",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.9.4"
}
},
"nbformat": 4,
"nbformat_minor": 5
}