{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 🎯 LLM-as-a-Judge\n",
    "\n",
    "<div style=\"background-color: rgba(200, 200, 200, 0.1); padding: 20px; border-radius: 8px; margin-bottom: 20px; border: 1px solid rgba(127, 127, 127, 0.2); max-width: 100%; overflow-wrap: break-word;\">\n",
    "  <p style=\"font-size: 16px; line-height: 1.6\">\n",
    "   LLM-as-a-Judge scorers use one or more LLMs to evaluate the reliability of the original LLM's response. They offer high customizability through prompt engineering and the choice of judge LLM(s). Below is a list of the available scorers:\n",
    "  </p>\n",
    "\n",
    "*   Categorical LLM-as-a-Judge ([Manakul et al., 2023](https://arxiv.org/abs/2303.08896); [Chen & Mueller, 2023](https://arxiv.org/abs/2308.16175); [Luo et al., 2023](https://arxiv.org/pdf/2303.15621))\n",
    "*   Continuous LLM-as-a-Judge ([Xiong et al., 2024](https://arxiv.org/pdf/2306.13063))\n",
    "*   Panel of LLM Judges ([Verga et al., 2024](https://arxiv.org/abs/2404.18796))\n",
    "    \n",
    "</div>\n",
    "\n",
    "## 📊 What You'll Do in This Demo\n",
    "\n",
    "\n",
    "<div style=\"display: flex; margin-bottom: 15px; align-items: center\">\n",
    "  <div style=\"background-color: #34a853; color: white; border-radius: 50%; width: 30px; height: 30px; display: flex; justify-content: center; align-items: center; margin-right: 15px; flex-shrink: 0\"><strong>1</strong></div>\n",
    "  <div>\n",
    "    <p style=\"margin: 0; font-weight: bold\"><a href=#section1>Set up LLM and prompts.</a></p>\n",
    "    <p style=\"margin: 0; color: rgba(95, 99, 104, 0.8)\">Set up LLM instance and load example data prompts.</p>\n",
    "  </div>\n",
    "</div>\n",
    "\n",
    "<div style=\"display: flex; margin-bottom: 15px; align-items: center\">\n",
    "  <div style=\"background-color: #34a853; color: white; border-radius: 50%; width: 30px; height: 30px; display: flex; justify-content: center; align-items: center; margin-right: 15px; flex-shrink: 0\"><strong>2</strong></div>\n",
    "  <div>\n",
    "    <p style=\"margin: 0; font-weight: bold\"><a href=#section2>Generate LLM Responses and Confidence Scores</a></p>\n",
    "    <p style=\"margin: 0; color: rgba(95, 99, 104, 0.8)\">Generate and score LLM responses to the example questions using the <code>LLMPanel()</code> class.</p>\n",
    "  </div>\n",
    "</div>\n",
    "\n",
    "<div style=\"display: flex; margin-bottom: 25px; align-items: center\">\n",
    "  <div style=\"background-color: #34a853; color: white; border-radius: 50%; width: 30px; height: 30px; display: flex; justify-content: center; align-items: center; margin-right: 15px; flex-shrink: 0\"><strong>3</strong></div>\n",
    "  <div>\n",
    "    <p style=\"margin: 0; font-weight: bold\"><a href=#section3>Evaluate Hallucination Detection Performance</a></p>\n",
    "    <p style=\"margin: 0; color: rgba(95, 99, 104, 0.8)\">Compute precision, recall, and F1-score of hallucination detection.</p>\n",
    "  </div>\n",
    "</div>\n",
    "\n",
    "## ⚖️ Advantages & Limitations\n",
    "\n",
    "<div style=\"display: flex; gap: 20px\">\n",
    "  <div style=\"flex: 1; background-color: rgba(0, 200, 0, 0.1); padding: 15px; border-radius: 8px; border: 1px solid rgba(0, 200, 0, 0.2)\">\n",
    "    <h3 style=\"color: #2e8b57; margin-top: 0\">Pros</h3>\n",
    "    <ul style=\"margin-bottom: 0\">\n",
    "      <li><strong>Universal Compatibility:</strong> Works with any LLM.</li>\n",
    "      <li><strong>Highly Customizable:</strong> Use any LLM as a judge and tailor instruction prompts for specific use cases.</li>\n",
    "    </ul>\n",
    "  </div>\n",
    "  \n",
    "  <div style=\"flex: 1; background-color: rgba(200, 0, 0, 0.1); padding: 15px; border-radius: 8px; border: 1px solid rgba(200, 0, 0, 0.2)\">\n",
    "    <h3 style=\"color: #b22222; margin-top: 0\">Cons</h3>\n",
    "    <ul style=\"margin-bottom: 0\">\n",
    "      <li><strong>Added cost:</strong> Requires additional LLM calls for the judge LLM(s).</li>\n",
    "    </ul>\n",
    "  </div>\n",
    "</div>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "import os\n",
    "from uqlm import LLMPanel\n",
    "from uqlm.judges import LLMJudge\n",
    "from uqlm.utils import load_example_dataset, math_postprocessor"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<a id='section1'></a>\n",
    "## 1. Set up LLM and Prompts"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In this demo, we will illustrate this approach using a set of math questions from the [SVAMP benchmark](https://arxiv.org/abs/2103.07191). To implement with your use case, simply **replace the example prompts with your data**.  "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {
    "tags": []
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Loading dataset - svamp...\n",
      "Processing dataset...\n",
      "Dataset ready!\n"
     ]
    },
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>question</th>\n",
       "      <th>answer</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>There are 87 oranges and 290 bananas in Philip...</td>\n",
       "      <td>145</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>Marco and his dad went strawberry picking. Mar...</td>\n",
       "      <td>19</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>Edward spent $ 6 to buy 2 books each book cost...</td>\n",
       "      <td>3</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>Frank was reading through his favorite book. T...</td>\n",
       "      <td>198</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>There were 78 dollars in Olivia's wallet. She ...</td>\n",
       "      <td>63</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                                            question answer\n",
       "0  There are 87 oranges and 290 bananas in Philip...    145\n",
       "1  Marco and his dad went strawberry picking. Mar...     19\n",
       "2  Edward spent $ 6 to buy 2 books each book cost...      3\n",
       "3  Frank was reading through his favorite book. T...    198\n",
       "4  There were 78 dollars in Olivia's wallet. She ...     63"
      ]
     },
     "execution_count": 2,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Load example dataset (SVAMP)\n",
    "svamp = load_example_dataset(\"svamp\", n=75)\n",
    "svamp.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "# Define prompts\n",
    "MATH_INSTRUCTION = \"When you solve this math problem only return the answer with no additional text.\\n\"\n",
    "prompts = [MATH_INSTRUCTION + prompt for prompt in svamp.question]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In this example, we use `ChatVertexAI` and `AzureChatOpenAI` to instantiate our LLMs, but any [LangChain Chat Model](https://js.langchain.com/docs/integrations/chat/) may be used. Be sure to **replace with your LLM of choice.**"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "# import sys\n",
    "# !{sys.executable} -m pip install python-dotenv\n",
    "# !{sys.executable} -m pip install langchain-openai\n",
    "\n",
    "# # User to populate .env file with API credentials\n",
    "from dotenv import load_dotenv, find_dotenv\n",
    "from langchain_openai import AzureChatOpenAI\n",
    "\n",
    "load_dotenv(find_dotenv())\n",
    "original_llm = AzureChatOpenAI(\n",
    "    deployment_name=os.getenv(\"DEPLOYMENT_NAME\"),\n",
    "    openai_api_key=os.getenv(\"API_KEY\"),\n",
    "    azure_endpoint=os.getenv(\"API_BASE\"),\n",
    "    openai_api_type=os.getenv(\"API_TYPE\"),\n",
    "    openai_api_version=os.getenv(\"API_VERSION\"),\n",
    "    temperature=1,  # User to set temperature\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "# import sys\n",
    "# !{sys.executable} -m pip install langchain-google-vertexai\n",
    "from langchain_google_vertexai import ChatVertexAI\n",
    "\n",
    "gemini_pro = ChatVertexAI(model_name=\"gemini-1.5-pro\")\n",
    "gemini_flash = ChatVertexAI(model_name=\"gemini-1.5-flash\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<a id='section2'></a>\n",
    "## 2. Generate responses and confidence scores"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### `LLMPanel()` - Class for aggregating multiple instances of LLMJudge using average, min, max, or majority voting\n",
    "\n",
    "![Sample Image](https://raw.githubusercontent.com/cvs-health/uqlm/develop/assets/images/judges_graphic.png)\n",
    "\n",
    "#### 📋 Class Attributes\n",
    "\n",
    "<table style=\"border-collapse: collapse; width: 100%; border: 1px solid rgba(127, 127, 127, 0.2);\">\n",
    "  <tr>\n",
    "    <th style=\"background-color: rgba(200, 200, 200, 0.2); width: 20%; padding: 8px; text-align: left; border: 1px solid rgba(127, 127, 127, 0.2);\">Parameter</th>\n",
    "    <th style=\"background-color: rgba(200, 200, 200, 0.2); width: 25%; padding: 8px; text-align: left; border: 1px solid rgba(127, 127, 127, 0.2);\">Type & Default</th>\n",
    "    <th style=\"background-color: rgba(200, 200, 200, 0.2); width: 55%; padding: 8px; text-align: left; border: 1px solid rgba(127, 127, 127, 0.2);\">Description</th>\n",
    "  </tr>\n",
    "  <tr>\n",
    "    <td style=\"font-weight: bold; padding: 8px; border: 1px solid rgba(127, 127, 127, 0.2);\">judges</td>\n",
    "    <td style=\"padding: 8px; border: 1px solid rgba(127, 127, 127, 0.2);\">list of LLMJudge or BaseChatModel<br><code></code></td>\n",
    "    <td style=\"padding: 8px; border: 1px solid rgba(127, 127, 127, 0.2);\">Judges to use. If BaseChatModel, LLMJudge is instantiated using default parameters.</td>\n",
    "  </tr>    \n",
    "  <tr>\n",
    "    <td style=\"font-weight: bold; padding: 8px; border: 1px solid rgba(127, 127, 127, 0.2);\">llm</td>\n",
    "    <td style=\"padding: 8px; border: 1px solid rgba(127, 127, 127, 0.2);\">BaseChatModel<br><code>default=None</code></td>\n",
    "    <td style=\"padding: 8px; border: 1px solid rgba(127, 127, 127, 0.2);\">A langchain llm `BaseChatModel`. User is responsible for specifying temperature and other relevant parameters to the constructor of the provided `llm` object.</td>\n",
    "  </tr>\n",
    "  <tr>\n",
    "    <td style=\"font-weight: bold; padding: 8px; border: 1px solid rgba(127, 127, 127, 0.2);\">system_prompt</td>\n",
    "    <td style=\"padding: 8px; border: 1px solid rgba(127, 127, 127, 0.2);\">str or None<br><code>default=\"You are a helpful assistant.\"</code></td>\n",
    "    <td style=\"padding: 8px; border: 1px solid rgba(127, 127, 127, 0.2);\">Optional argument for user to provide custom system prompt for the LLM.</td>\n",
    "  </tr>\n",
    "  <tr>\n",
    "    <td style=\"font-weight: bold; padding: 8px; border: 1px solid rgba(127, 127, 127, 0.2);\">max_calls_per_min</td>\n",
    "    <td style=\"padding: 8px; border: 1px solid rgba(127, 127, 127, 0.2);\">int<br><code>default=None</code></td>\n",
    "    <td style=\"padding: 8px; border: 1px solid rgba(127, 127, 127, 0.2);\">Specifies how many API calls to make per minute to avoid rate limit errors. By default, no limit is specified.</td>\n",
    "  </tr>\n",
    "  <tr>\n",
    "    <td style=\"font-weight: bold; padding: 8px; border: 1px solid rgba(127, 127, 127, 0.2);\">scoring_templates</td>\n",
    "    <td style=\"padding: 8px; border: 1px solid rgba(127, 127, 127, 0.2);\">int<br><code>default=None</code></td>\n",
    "    <td style=\"padding: 8px; border: 1px solid rgba(127, 127, 127, 0.2);\">Specifies which off-the-shelf template to use for each judge. Four off-the-shelf templates offered: incorrect/uncertain/correct (0/0.5/1), incorrect/correct (0/1), continuous score (0 to 1), and likert scale score (1-5 scale, normalized to 0/0.25/0.5/0.75/1). These templates are respectively specified as 'true_false_uncertain', 'true_false', 'continuous', and 'likert'. If specified, must be of equal length to `judges` list. Defaults to 'true_false_uncertain' template used by Chen and Mueller (2023) for each judge.</td>\n",
    "  </tr>\n",
    "</table>\n",
    "\n",
    "#### 🔍 Parameter Groups\n",
    "\n",
    "<div style=\"display: flex; gap: 20px; margin-bottom: 20px\">\n",
    "  <div style=\"flex: 1; padding: 10px; background-color: rgba(0, 100, 200, 0.1); border-radius: 5px; border: 1px solid rgba(0, 100, 200, 0.2);\">\n",
    "    <p style=\"font-weight: bold\">🧠 LLM-Specific</p>\n",
    "    <ul>\n",
    "      <li><code>llm</code></li>\n",
    "      <li><code>system_prompt</code></li>\n",
    "    </ul>\n",
    "  </div>\n",
    "  <div style=\"flex: 1; padding: 10px; background-color: rgba(0, 200, 0, 0.1); border-radius: 5px; border: 1px solid rgba(0, 200, 0, 0.2);\">\n",
    "    <p style=\"font-weight: bold\">📊 Confidence Scores</p>\n",
    "    <ul>\n",
    "      <li><code>judges</code></li>\n",
    "      <li><code>scoring_templates</code></li>        \n",
    "    </ul>\n",
    "  </div>\n",
    "  <div style=\"flex: 1; padding: 10px; background-color: rgba(200, 0, 200, 0.1); border-radius: 5px; border: 1px solid rgba(200, 0, 200, 0.2);\">\n",
    "    <p style=\"font-weight: bold\">⚡ Performance</p>\n",
    "    <ul>\n",
    "      <li><code>max_calls_per_min</code></li>\n",
    "    </ul>\n",
    "  </div>\n",
    "</div>\n",
    "\n",
    "#### 💻 Usage Examples\n",
    "\n",
    "```python\n",
    "# Basic usage with single self-judge parameters\n",
    "panel = LLMPanel(llm=llm, judges=[llm])\n",
    "\n",
    "# Using two judges with default parameters\n",
    "panel = LLMPanel(llm=llm, judges=[llm, llm2])\n",
    "\n",
    "# Using two judges, one with continuous template\n",
    "panel = LLMPanel(\n",
    "    llm=llm, judges=[llm, llm2], scoring_templates=['true_false_uncertain', 'continuous']\n",
    ")\n",
    "```"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "panel = LLMPanel(llm=original_llm, judges=[original_llm, gemini_pro, gemini_flash])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 🔄 Class Methods\n",
    "\n",
    "<table style=\"border-collapse: collapse; width: 100%; border: 1px solid rgba(127, 127, 127, 0.2);\">\n",
    "  <tr>\n",
    "    <th style=\"background-color: rgba(200, 200, 200, 0.2); width: 25%; padding: 8px; text-align: left; border: 1px solid rgba(127, 127, 127, 0.2);\">Method</th>\n",
    "    <th style=\"background-color: rgba(200, 200, 200, 0.2); width: 75%; padding: 8px; text-align: left; border: 1px solid rgba(127, 127, 127, 0.2);\">Description & Parameters</th>\n",
    "  </tr>\n",
    "  <tr>\n",
    "    <td style=\"font-weight: bold; vertical-align: top; padding: 8px; border: 1px solid rgba(127, 127, 127, 0.2);\">LLMPanel.generate_and_score</td>\n",
    "    <td style=\"padding: 8px; border: 1px solid rgba(127, 127, 127, 0.2);\">\n",
    "      <p>Generate responses to provided prompts and use panel to of judges to score responses for correctness.</p>\n",
    "      <p><strong>Parameters:</strong></p>\n",
    "      <ul>\n",
    "        <li><code>prompts</code> - (<strong>list of str</strong>) A list of input prompts for the model.</li>\n",
    "      </ul>\n",
    "      <p><strong>Returns:</strong> <code>UQResult</code> containing data (prompts, responses, sampled responses, and confidence scores) and metadata</p>\n",
    "      <div style=\"background-color: rgba(0, 200, 0, 0.1); padding: 8px; border-radius: 3px; margin-top: 10px; border: 1px solid rgba(0, 200, 0, 0.2); margin-right: 5px; box-sizing: border-box; width: 100%;\">\n",
    "        <strong>💡 Best For:</strong> Complete end-to-end uncertainty quantification when starting with prompts.\n",
    "      </div>\n",
    "    </td>\n",
    "  </tr>\n",
    "  <tr>\n",
    "    <td style=\"font-weight: bold; vertical-align: top; padding: 8px; border: 1px solid rgba(127, 127, 127, 0.2);\">LLMPanel.score</td>\n",
    "    <td style=\"padding: 8px; border: 1px solid rgba(127, 127, 127, 0.2);\">\n",
    "      <p>Use panel to of judges to score provided responses for correctness. Use if responses are already generated. Otherwise, use `generate_and_score`.</p>\n",
    "      <p><strong>Parameters:</strong></p>\n",
    "      <ul>\n",
    "        <li><code>prompts</code> - (<strong>list of str</strong>) A list of input prompts for the model.</li>\n",
    "        <li><code>responses</code> - (<strong>list of str</strong>) A list of LLM responses for the prompts.</li>\n",
    "      </ul>\n",
    "      <p><strong>Returns:</strong> <code>UQResult</code> containing data (responses and confidence scores) and metadata</p>\n",
    "      <div style=\"background-color: rgba(0, 200, 0, 0.1); padding: 8px; border-radius: 3px; margin-top: 10px; border: 1px solid rgba(0, 200, 0, 0.2); margin-right: 5px; box-sizing: border-box; width: 100%;\">\n",
    "        <strong>💡 Best For:</strong> Computing uncertainty scores when responses are already generated elsewhere.\n",
    "      </div>\n",
    "    </td>\n",
    "  </tr>\n",
    "</table>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {
    "tags": []
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Generating responses...\n",
      "Generating LLMJudge scores...\n",
      "Generating LLMJudge scores...\n",
      "Generating LLMJudge scores...\n"
     ]
    }
   ],
   "source": [
    "result = await panel.generate_and_score(prompts=prompts)\n",
    "\n",
    "# option 2: provide pre-generated responses with score method\n",
    "# result = await panel.score(prompts=prompts, responses=responses)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {
    "tags": []
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>prompt</th>\n",
       "      <th>response</th>\n",
       "      <th>judge_1</th>\n",
       "      <th>judge_2</th>\n",
       "      <th>judge_3</th>\n",
       "      <th>avg</th>\n",
       "      <th>max</th>\n",
       "      <th>min</th>\n",
       "      <th>median</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>When you solve this math problem only return t...</td>\n",
       "      <td>145</td>\n",
       "      <td>1.0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>1.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>When you solve this math problem only return t...</td>\n",
       "      <td>19 pounds</td>\n",
       "      <td>1.0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>1.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>When you solve this math problem only return t...</td>\n",
       "      <td>$3</td>\n",
       "      <td>1.0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>1.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>When you solve this math problem only return t...</td>\n",
       "      <td>198</td>\n",
       "      <td>1.0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>1.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>When you solve this math problem only return t...</td>\n",
       "      <td>63</td>\n",
       "      <td>1.0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>1.0</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                                              prompt   response  judge_1  \\\n",
       "0  When you solve this math problem only return t...        145      1.0   \n",
       "1  When you solve this math problem only return t...  19 pounds      1.0   \n",
       "2  When you solve this math problem only return t...         $3      1.0   \n",
       "3  When you solve this math problem only return t...        198      1.0   \n",
       "4  When you solve this math problem only return t...         63      1.0   \n",
       "\n",
       "   judge_2  judge_3  avg  max  min  median  \n",
       "0      1.0      1.0  1.0  1.0  1.0     1.0  \n",
       "1      1.0      1.0  1.0  1.0  1.0     1.0  \n",
       "2      1.0      1.0  1.0  1.0  1.0     1.0  \n",
       "3      1.0      1.0  1.0  1.0  1.0     1.0  \n",
       "4      1.0      1.0  1.0  1.0  1.0     1.0  "
      ]
     },
     "execution_count": 9,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "result_df = result.to_df()\n",
    "result_df.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<a id='section3'></a>\n",
    "## 3. Evaluate Hallucination Detection Performance"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "To evaluate hallucination detection performance, we 'grade' the responses against an answer key. Note the `math_postprocessor` is specific to our use case (math questions). **If you are using your own prompts/questions, update the grading method accordingly**."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {
    "tags": []
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>prompt</th>\n",
       "      <th>response</th>\n",
       "      <th>judge_1</th>\n",
       "      <th>judge_2</th>\n",
       "      <th>judge_3</th>\n",
       "      <th>avg</th>\n",
       "      <th>max</th>\n",
       "      <th>min</th>\n",
       "      <th>median</th>\n",
       "      <th>answer</th>\n",
       "      <th>response_correct</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>When you solve this math problem only return t...</td>\n",
       "      <td>145</td>\n",
       "      <td>1.0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>145</td>\n",
       "      <td>True</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>When you solve this math problem only return t...</td>\n",
       "      <td>19 pounds</td>\n",
       "      <td>1.0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>19</td>\n",
       "      <td>True</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>When you solve this math problem only return t...</td>\n",
       "      <td>$3</td>\n",
       "      <td>1.0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>3</td>\n",
       "      <td>True</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>When you solve this math problem only return t...</td>\n",
       "      <td>198</td>\n",
       "      <td>1.0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>198</td>\n",
       "      <td>True</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>When you solve this math problem only return t...</td>\n",
       "      <td>63</td>\n",
       "      <td>1.0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>63</td>\n",
       "      <td>True</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                                              prompt   response  judge_1  \\\n",
       "0  When you solve this math problem only return t...        145      1.0   \n",
       "1  When you solve this math problem only return t...  19 pounds      1.0   \n",
       "2  When you solve this math problem only return t...         $3      1.0   \n",
       "3  When you solve this math problem only return t...        198      1.0   \n",
       "4  When you solve this math problem only return t...         63      1.0   \n",
       "\n",
       "   judge_2  judge_3  avg  max  min  median answer  response_correct  \n",
       "0      1.0      1.0  1.0  1.0  1.0     1.0    145              True  \n",
       "1      1.0      1.0  1.0  1.0  1.0     1.0     19              True  \n",
       "2      1.0      1.0  1.0  1.0  1.0     1.0      3              True  \n",
       "3      1.0      1.0  1.0  1.0  1.0     1.0    198              True  \n",
       "4      1.0      1.0  1.0  1.0  1.0     1.0     63              True  "
      ]
     },
     "execution_count": 10,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Populate correct answers and grade responses\n",
    "result_df[\"answer\"] = svamp.answer\n",
    "result_df[\"response_correct\"] = [math_postprocessor(r) == a for r, a in zip(result_df[\"response\"], svamp[\"answer\"])]\n",
    "result_df.head(5)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {
    "tags": []
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Judge 1 precision: 0.8620689655172413\n",
      "Judge 1 recall: 0.78125\n",
      "Judge 1 f1-score: 0.819672131147541\n",
      " \n",
      "Judge 2 precision: 0.9375\n",
      "Judge 2 recall: 0.9375\n",
      "Judge 2 f1-score: 0.9375\n",
      " \n",
      "Judge 3 precision: 0.9090909090909091\n",
      "Judge 3 recall: 0.9375\n",
      "Judge 3 f1-score: 0.9230769230769231\n",
      " \n"
     ]
    }
   ],
   "source": [
    "# evaluate precision, recall, and f1-score of Semantic Entropy's predictions of correctness\n",
    "from sklearn.metrics import precision_score, recall_score, f1_score\n",
    "\n",
    "for ind in [1, 2, 3]:\n",
    "    y_pred = [(s > 0) * 1 for s in result_df[f\"judge_{str(ind)}\"]]\n",
    "    y_true = result_df.response_correct\n",
    "    print(f\"Judge {ind} precision: {precision_score(y_true=y_true, y_pred=y_pred)}\")\n",
    "    print(f\"Judge {ind} recall: {recall_score(y_true=y_true, y_pred=y_pred)}\")\n",
    "    print(f\"Judge {ind} f1-score: {f1_score(y_true=y_true, y_pred=y_pred)}\")\n",
    "    print(\" \")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<a id='section5'></a>\n",
    "## 5. Scorer Definitions\n",
    "Under the LLM-as-a-Judge approach, either the same LLM that was used for generating the original responses or a different LLM is asked to form a judgment about a pre-generated response. Below, we define two LLM-as-a-Judge scorer templates. \n",
    "### Categorical Judge Template (`true_false_uncertain`)\n",
    "We follow the approach proposed by [Chen & Mueller, 2023](https://arxiv.org/abs/2308.16175) in which an LLM is instructed to score a question-answer concatenation as either  *incorrect*, *uncertain*, or *correct* using a carefully constructed prompt. These categories are respectively mapped to numerical scores of 0, 0.5, and 1. We denote the LLM-as-a-judge scorers as $J: \\mathcal{Y} \\xrightarrow[]{} \\{0, 0.5, 1\\}$. Formally, we can write this scorer function as follows:\n",
    "\n",
    "\\begin{equation}\n",
    "J(y_i) = \\begin{cases}\n",
    "    0 & \\text{LLM states response is incorrect} \\\\\n",
    "    0.5 & \\text{LLM states that it is uncertain} \\\\\n",
    "    1 & \\text{LLM states response is correct}.\n",
    "\\end{cases}\n",
    "\\end{equation}\n",
    "\n",
    "### Continuous Judge Template (`continuous`)\n",
    "For the continuous template, the LLM is asked to directly score a question-answer concatenation's correctness on a scale of 0 to 1. "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "© 2025 CVS Health and/or one of its affiliates. All rights reserved."
   ]
  }
 ],
 "metadata": {
  "environment": {
   "kernel": "uqlm_my_test",
   "name": "workbench-notebooks.m126",
   "type": "gcloud",
   "uri": "us-docker.pkg.dev/deeplearning-platform-release/gcr.io/workbench-notebooks:m126"
  },
  "kernelspec": {
   "display_name": "uqlm_my_test",
   "language": "python",
   "name": "uqlm_my_test"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.11.12"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}