{ "cells": [ { "cell_type": "markdown", "id": "ead0356d", "metadata": {}, "source": [ "# Classification Metrics " ] }, { "cell_type": "code", "execution_count": 1, "id": "f694ef3c-96cb-472c-80c4-0409222fc4ac", "metadata": { "tags": [] }, "outputs": [], "source": [ "import numpy as np\n", "\n", "from langfair.metrics.classification import ClassificationMetrics\n" ] }, { "cell_type": "markdown", "id": "4b634110-1aa9-413d-908a-6ba61cde007e", "metadata": {}, "source": [ "## 1. Introduction\n", " " ] }, { "cell_type": "markdown", "id": "7a159622-3c80-4efc-854e-d89aa1cf4d84", "metadata": {}, "source": [ "Large language models (LLMs) used in classification use cases should be assessed for group fairness (if applicable). Similar to traditional person-level classification challenges in machine learning, these use cases present the risk of allocational harms. LangFair offers the following classification fairness metrics from the LLM fairness literature:\n", "\n", "* Predicted Prevalence Rate Disparity ([Feldman et al., 2015](https://arxiv.org/abs/1412.3756); [Bellamy et al., 2018](https://arxiv.org/abs/1810.01943); [Saleiro et al., 2019](https://arxiv.org/abs/1811.05577))\n", "* False Negative Rate Disparity ([Bellamy et al., 2018](https://arxiv.org/abs/1810.01943); [Saleiro et al., 2019](https://arxiv.org/abs/1811.05577))\n", "* False Omission Rate Disparity ([Bellamy et al., 2018](https://arxiv.org/abs/1810.01943); [Saleiro et al., 2019](https://arxiv.org/abs/1811.05577))\n", "* False Positive Rate Disparity ([Bellamy et al., 2018](https://arxiv.org/abs/1810.01943); [Saleiro et al., 2019](https://arxiv.org/abs/1811.05577))\n", "* False Discovery Rate Disparity ([Bellamy et al., 2018](https://arxiv.org/abs/1810.01943); [Saleiro et al., 2019](https://arxiv.org/abs/1811.05577))" ] }, { "cell_type": "markdown", "id": "0a7059f0-cf44-437e-b0b9-12e33b6872ad", "metadata": {}, "source": [ "## 2. Assessment\n", " " ] }, { "cell_type": "code", "execution_count": 2, "id": "820e7afb-e66b-4716-bdbf-a53ffba4c4ae", "metadata": { "tags": [] }, "outputs": [], "source": [ "# Simulate dataset for this example. In practice, users should replace this data with predicted classes generated by the LLM,\n", "# corresponding ground truth values, and corresponding protected attribute group data.\n", "sample_size = 10000\n", "groups = np.random.binomial(n=1, p=0.5, size=sample_size)\n", "y_pred = np.random.binomial(n=1, p=0.3, size=sample_size)\n", "y_true = np.random.binomial(n=1, p=0.3, size=sample_size)" ] }, { "cell_type": "markdown", "id": "42f834ce-792c-44b2-a0ee-5c8343365697", "metadata": {}, "source": [ "## Classification Metrics\n", "***\n", "`ClassificationMetrics()` - Pairwise classification fairness metrics (class)\n", "\n", "**Class parameters:**\n", "- `metric_type` - (**{'all', 'assistive', 'punitive', 'representation'}, default='all'**) Specifies which metrics to use.\n", "\n", "**Methods:**\n", "1. `evaluate` - Returns min, max, range, and standard deviation of metrics across protected attribute groups.\n", "\n", " **Method Parameters:**\n", " - `groups` - (**array-like**) Group indicators. Must contain exactly two unique values.\n", " - `y_pred` - (**array-like**) Binary model predictions. Positive and negative predictions must be 1 and 0, respectively.\n", " - `y_true` - (**array-like**) Binary labels (ground truth values). Positive and negative labels must be 1 and 0, respectively.\n", " - `ratio` - (**boolean**) Indicates whether to compute the metric as a difference or a ratio\n", "\n", " Returns:\n", " - Dictionary containing fairness metric values (**Dictionary**)." ] }, { "cell_type": "markdown", "id": "133aaee2", "metadata": {}, "source": [ "Generate an instance of class `ClassificationMetrics` using default `metric_type='all'`, which includes \"assistive\", \"punitive\", and \"representation\" metrics." ] }, { "cell_type": "code", "execution_count": 3, "id": "727a49a4-3067-4e7d-9de7-adfd29f4f6a8", "metadata": { "tags": [] }, "outputs": [], "source": [ "cm = ClassificationMetrics(metric_type='all')" ] }, { "cell_type": "code", "execution_count": 4, "id": "33e61ded-b56f-42f3-897a-0df80d03b626", "metadata": { "tags": [] }, "outputs": [ { "data": { "text/plain": [ "{'FalseNegativeRateParity': 0.9683960547735326,\n", " 'FalseOmissionRateParity': 0.9682772917805723,\n", " 'FalsePositiveRateParity': 0.9832027144990514,\n", " 'FalseDiscoveryRateParity': 0.9750294817188464,\n", " 'PredictedPrevalenceRateParity': 1.010318056277584}" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Metrics expressed as ratios (target value of 1)\n", "cm.evaluate(groups=groups, y_pred=y_pred, y_true=y_true, ratio=True)" ] }, { "cell_type": "code", "execution_count": 5, "id": "366c5853-ffb7-49c2-a9c5-37d38ba365e5", "metadata": { "tags": [] }, "outputs": [ { "data": { "text/plain": [ "{'FalseNegativeRateParity': 0.022435421009698087,\n", " 'FalseOmissionRateParity': 0.009568167034658404,\n", " 'FalsePositiveRateParity': 0.005089653684952955,\n", " 'FalseDiscoveryRateParity': 0.01776013575685509,\n", " 'PredictedPrevalenceRateParity': 0.0030867911196673647}" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Metrics expressed as differences (target value of 0)\n", "cm.evaluate(groups=groups, y_pred=y_pred, y_true=y_true, ratio=False)" ] } ], "metadata": { "environment": { "kernel": "langfair", "name": "workbench-notebooks.m125", "type": "gcloud", "uri": "us-docker.pkg.dev/deeplearning-platform-release/gcr.io/workbench-notebooks:m125" }, "kernelspec": { "display_name": "langfair-ZgpfWZGz-py3.9", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.9.4" } }, "nbformat": 4, "nbformat_minor": 5 }