{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# 🎯 Semantic Density\n", "\n", "
\n", "

\n", "This demo illustrates a state-of-the-art uncertainty quantification (UQ) approach known as semantic density. The semantic density method combines elements of black-box UQ (which generates multiple responses from the same prompt) and white-box UQ (which uses token probabilities of those generated responses) to compute density values. Intuitively, semantic density combines both signals to estimate a probability distribution for scoring each response. This method was proposed by Qiu et al. (2024) and is demonstrated in this notebook.\n", "

\n", "
\n", " \n", "## 📊 What You'll Do in This Demo\n", "\n", "
\n", "
1
\n", "
\n", "

Set up LLM and prompts.

\n", "

Set up LLM instance and load example data prompts.

\n", "
\n", "
\n", "\n", "
\n", "
2
\n", "
\n", "

Generate LLM Responses and Confidence Scores

\n", "

Generate and score LLM responses to the example questions using the SemanticDensity() class.

\n", "
\n", "
\n", "\n", "
\n", "
3
\n", "
\n", "

Evaluate Hallucination Detection Performance

\n", "

Visualize model accuracy at different thresholds of the semantic density score. Compute precision, recall, and F1-score of hallucination detection.

\n", "
\n", "
\n", "\n", "## ⚖️ Advantages & Limitations\n", "\n", "
\n", "
\n", "

Pros

\n", " \n", "
\n", " \n", "
\n", "

Cons

\n", " \n", "
\n", "
" ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "tags": [] }, "outputs": [], "source": [ "import numpy as np\n", "from sklearn.metrics import precision_score, recall_score, f1_score\n", "\n", "from uqlm.utils import load_example_dataset, plot_model_accuracies, LLMGrader, Tuner\n", "from uqlm.scorers import SemanticDensity" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "## 1. Set up LLM and Prompts" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In this demo, we will illustrate this approach using a set of short answer questions from the [SimpleQA benchmark](https://openai.com/index/introducing-simpleqa/). To implement with your use case, simply **replace the example prompts with your data**. " ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "tags": [] }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Loading dataset - simpleqa...\n", "Processing dataset...\n", "Dataset ready!\n" ] }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
questionanswer
0How much money, in euros, was the surgeon held...120,000 euros
1What is the name of the former Prime Minister ...Jóhanna Sigurðardóttir
2To whom did Mehbooba Mufti Sayed contest the 2...Hasnain Masoodi
3In which year did Melbourne's Monash Gallery o...2023
4Who requested the Federal Aviation Administrat...The Coast Guard
\n", "
" ], "text/plain": [ " question answer\n", "0 How much money, in euros, was the surgeon held... 120,000 euros\n", "1 What is the name of the former Prime Minister ... Jóhanna Sigurðardóttir\n", "2 To whom did Mehbooba Mufti Sayed contest the 2... Hasnain Masoodi\n", "3 In which year did Melbourne's Monash Gallery o... 2023\n", "4 Who requested the Federal Aviation Administrat... The Coast Guard" ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Load example dataset (simpleqa)\n", "simpleqa = load_example_dataset(\"simpleqa\", n=200)\n", "simpleqa.head()" ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "tags": [] }, "outputs": [], "source": [ "# Define prompts\n", "INSTRUCTION = \"You will be given a question. Return only the answer as concisely as possible without providing an explanation.\\n\"\n", "prompts = [INSTRUCTION + prompt for prompt in simpleqa.question]" ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "tags": [] }, "outputs": [], "source": [ "# import sys\n", "# !{sys.executable} -m pip install langchain-google-vertexai\n", "from langchain_google_vertexai import ChatVertexAI\n", "\n", "llm = ChatVertexAI(model_name=\"gemini-2.5-pro\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "## 2. Generate responses and confidence scores" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### `SemanticDensity()` - Generate LLM responses and compute consistency-based confidence scores for each response.\n", "\n", "#### 📋 Class Attributes\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
ParameterType & DefaultDescription
llmBaseChatModel
default=None
A langchain llm `BaseChatModel`. User is responsible for specifying temperature and other relevant parameters to the constructor of the provided `llm` object.
devicestr or torch.device
default=\"cpu\"
Specifies the device that NLI model use for prediction. Only applies to 'noncontradiction' scorer. Pass a torch.device to leverage GPU.
system_promptstr or None
default=\"You are a helpful assistant.\"
Optional argument for user to provide custom system prompt for the LLM.
max_calls_per_minint
default=None
Specifies how many API calls to make per minute to avoid rate limit errors. By default, no limit is specified.
length_normalizebool
bool, default=True
Determines whether response probabilities are length-normalized. Recommended to set as True when longer responses are expected.
use_n_parambool
default=False
Specifies whether to use n parameter for BaseChatModel. Not compatible with all BaseChatModel classes. If used, it speeds up the generation process substantially when num_responses is large.
postprocessorcallable
default=None
A user-defined function that takes a string input and returns a string. Used for postprocessing outputs.
sampling_temperaturefloat
default=1
The 'temperature' parameter for LLM model to generate sampled LLM responses. Must be greater than 0.
nli_model_namestr
default=\"microsoft/deberta-large-mnli\"
Specifies which NLI model to use. Must be acceptable input to AutoTokenizer.from_pretrained() and AutoModelForSequenceClassification.from_pretrained().
max_lengthint
default=2000
Specifies the maximum allowed string length for LLM responses for NLI computation. Responses longer than this value will be truncated in NLI computations to avoid OutOfMemoryError.
return_responsesstr
default=\"all\"
If a postprocessor is used, specifies whether to return only postprocessed responses, only raw responses, or both. Specified with 'postprocessed', 'raw', or 'all', respectively.
\n", "\n", "#### 🔍 Parameter Groups\n", "\n", "
\n", "
\n", "

🧠 LLM-Specific

\n", " \n", "
\n", "
\n", "

📊 Confidence Scores

\n", " \n", "
\n", "
\n", "

🖥️ Hardware

\n", " \n", "
\n", "
\n", "

⚡ Performance

\n", " \n", "
\n", "
\n", "\n", "#### 💻 Usage Examples\n", "\n", "```python\n", "# Basic usage with default parameters\n", "sd = SemanticDensity(llm=llm)\n", "\n", "# Using GPU acceleration, default scorers\n", "sd = SemanticDensity(llm=llm, device=torch.device(\"cuda\"))\n", "\n", "# High-throughput configuration with rate limiting\n", "sd = SemanticDensity(llm=llm, max_calls_per_min=200, use_n_param=True) \n", "```" ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "tags": [] }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Using cuda device\n" ] } ], "source": [ "import torch\n", "\n", "# Set the torch device\n", "if torch.cuda.is_available(): # NVIDIA GPU\n", " device = torch.device(\"cuda\")\n", "elif torch.backends.mps.is_available(): # macOS\n", " device = torch.device(\"mps\")\n", "else:\n", " device = torch.device(\"cpu\") # CPU\n", "print(f\"Using {device.type} device\")" ] }, { "cell_type": "code", "execution_count": 6, "metadata": { "tags": [] }, "outputs": [], "source": [ "sd = SemanticDensity(\n", " llm=llm,\n", " max_calls_per_min=250, # set value to avoid rate limit error\n", " device=device, # use if GPU available\n", " length_normalize=True,\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 🔄 Class Methods\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
MethodDescription & Parameters
SemanticDensity.generate_and_score\n", "

Generate LLM responses, sampled LLM (candidate) responses, and compute density score for the provided prompts.

\n", "

Parameters:

\n", "
    \n", "
  • prompts - (List[str]) A list of input prompts for the model.
  • \n", "
  • num_responses - (int, default=5) The number of sampled responses used to compute consistency.
  • \n", "
  • show_progress_bars - (bool, default=True) If True, displays a progress bar while generating and scoring responses.
  • \n", "
\n", "

Returns: UQResult containing data (prompts, responses, sampled responses, and density score) and metadata

\n", "
\n", " 💡 Best For: Complete end-to-end uncertainty quantification when starting with prompts.\n", "
\n", "
SemanticDensity.score\n", "

Compute density score on provided LLM responses. Should only be used if responses and sampled responses are already generated.

\n", "

Parameters:

\n", "
    \n", "
  • prompts - (List[List[str]]) List of prompts from which responses were generated.
  • \n", "
  • responses - (List[List[str]]) A list of LLM responses for the prompts.
  • \n", "
  • sampled_responses - (List[List[str]]) A list of lists of sampled LLM responses for each prompt. These will be used to compute consistency scores by comparing to the corresponding response from responses.
  • \n", "
  • logprob_results - (List[logprob_result]) A list of dictionaries, each returned by BaseChatModel.agenerate corresponding to responses.
  • \n", "
  • sampled_logprob_results - (List[List[logprob_result]], default=None) List of list of dictionaries, each returned by BaseChatModel.agenerate. These must correspond to sampled_responses.
  • \n", "
  • show_progress_bars - (bool, default=True) If True, displays a progress bar while scoring responses.
  • \n", "
\n", "

Returns: UQResult containing data (responses, sampled responses, and density score) and metadata

\n", "
\n", " 💡 Best For: Computing uncertainty scores when responses, sampled responses, and logprobs are already generated elsewhere.\n", "
\n", "
" ] }, { "cell_type": "code", "execution_count": 11, "metadata": { "tags": [] }, "outputs": [], "source": [ "results = await sd.generate_and_score(prompts=prompts, num_responses=5)\n", "\n", "# # alternative approach: directly score if responses already generated\n", "# results = sd.score(responses=responses, sampled_responses=sampled_responses)" ] }, { "cell_type": "code", "execution_count": 12, "metadata": { "tags": [] }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
responsesampled_responsespromptsemantic_density_valuemultiple_logprob
0€120,000[€120,000, €120,000, €136,000, €120,000, €120,...You will be given a question. Return only the ...0.865526[[{'token': '€', 'logprob': -4.172499757260084...
1Jóhanna Sigurðardóttir[Jóhanna Sigurðardóttir, Jóhanna Sigurðardótti...You will be given a question. Return only the ...0.992922[[{'token': 'J', 'logprob': -9.536738616588991...
2Hasnain Masoodi[Hasnain Masoodi, Hasnain Masoodi, Hasnain Mas...You will be given a question. Return only the ...0.993829[[{'token': 'Has', 'logprob': -2.5987286790041...
32023[2023, 2023, 2022, 2022, 2022]You will be given a question. Return only the ...0.941380[[{'token': '2', 'logprob': -1.430510337740997...
4BP and the U.S. Coast Guard[BP, BP, BP and the U.S. Coast Guard, BP (Brit...You will be given a question. Return only the ...0.978186[[{'token': 'BP', 'logprob': -4.20799915445968...
\n", "
" ], "text/plain": [ " response \\\n", "0 €120,000 \n", "1 Jóhanna Sigurðardóttir \n", "2 Hasnain Masoodi \n", "3 2023 \n", "4 BP and the U.S. Coast Guard \n", "\n", " sampled_responses \\\n", "0 [€120,000, €120,000, €136,000, €120,000, €120,... \n", "1 [Jóhanna Sigurðardóttir, Jóhanna Sigurðardótti... \n", "2 [Hasnain Masoodi, Hasnain Masoodi, Hasnain Mas... \n", "3 [2023, 2023, 2022, 2022, 2022] \n", "4 [BP, BP, BP and the U.S. Coast Guard, BP (Brit... \n", "\n", " prompt semantic_density_value \\\n", "0 You will be given a question. Return only the ... 0.865526 \n", "1 You will be given a question. Return only the ... 0.992922 \n", "2 You will be given a question. Return only the ... 0.993829 \n", "3 You will be given a question. Return only the ... 0.941380 \n", "4 You will be given a question. Return only the ... 0.978186 \n", "\n", " multiple_logprob \n", "0 [[{'token': '€', 'logprob': -4.172499757260084... \n", "1 [[{'token': 'J', 'logprob': -9.536738616588991... \n", "2 [[{'token': 'Has', 'logprob': -2.5987286790041... \n", "3 [[{'token': '2', 'logprob': -1.430510337740997... \n", "4 [[{'token': 'BP', 'logprob': -4.20799915445968... " ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "result_df = results.to_df()\n", "result_df.head(5)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "## 3. Evaluate Hallucination Detection Performance" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To evaluate hallucination detection performance, we 'grade' the responses against an answer key. Here, we use UQLM's out-of-the-box LLM Grader, which can be used with [LangChain Chat Model](https://js.langchain.com/docs/integrations/chat/), but you may replace this with a grading method of your choice. Some notable alternatives are [Vectara HHEM](https://huggingface.co/vectara/hallucination_evaluation_model) and [AlignScore](https://github.com/yuh-zha/AlignScore). **If you are using your own prompts/questions, be sure to update the grading method accordingly**." ] }, { "cell_type": "code", "execution_count": 14, "metadata": { "tags": [] }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
responsesampled_responsespromptsemantic_density_valuemultiple_logprobanswerresponse_correct
0€120,000[€120,000, €120,000, €136,000, €120,000, €120,...You will be given a question. Return only the ...0.865526[[{'token': '€', 'logprob': -4.172499757260084...120,000 eurosTrue
1Jóhanna Sigurðardóttir[Jóhanna Sigurðardóttir, Jóhanna Sigurðardótti...You will be given a question. Return only the ...0.992922[[{'token': 'J', 'logprob': -9.536738616588991...Jóhanna SigurðardóttirTrue
2Hasnain Masoodi[Hasnain Masoodi, Hasnain Masoodi, Hasnain Mas...You will be given a question. Return only the ...0.993829[[{'token': 'Has', 'logprob': -2.5987286790041...Hasnain MasoodiTrue
32023[2023, 2023, 2022, 2022, 2022]You will be given a question. Return only the ...0.941380[[{'token': '2', 'logprob': -1.430510337740997...2023True
4BP and the U.S. Coast Guard[BP, BP, BP and the U.S. Coast Guard, BP (Brit...You will be given a question. Return only the ...0.978186[[{'token': 'BP', 'logprob': -4.20799915445968...The Coast GuardFalse
\n", "
" ], "text/plain": [ " response \\\n", "0 €120,000 \n", "1 Jóhanna Sigurðardóttir \n", "2 Hasnain Masoodi \n", "3 2023 \n", "4 BP and the U.S. Coast Guard \n", "\n", " sampled_responses \\\n", "0 [€120,000, €120,000, €136,000, €120,000, €120,... \n", "1 [Jóhanna Sigurðardóttir, Jóhanna Sigurðardótti... \n", "2 [Hasnain Masoodi, Hasnain Masoodi, Hasnain Mas... \n", "3 [2023, 2023, 2022, 2022, 2022] \n", "4 [BP, BP, BP and the U.S. Coast Guard, BP (Brit... \n", "\n", " prompt semantic_density_value \\\n", "0 You will be given a question. Return only the ... 0.865526 \n", "1 You will be given a question. Return only the ... 0.992922 \n", "2 You will be given a question. Return only the ... 0.993829 \n", "3 You will be given a question. Return only the ... 0.941380 \n", "4 You will be given a question. Return only the ... 0.978186 \n", "\n", " multiple_logprob answer \\\n", "0 [[{'token': '€', 'logprob': -4.172499757260084... 120,000 euros \n", "1 [[{'token': 'J', 'logprob': -9.536738616588991... Jóhanna Sigurðardóttir \n", "2 [[{'token': 'Has', 'logprob': -2.5987286790041... Hasnain Masoodi \n", "3 [[{'token': '2', 'logprob': -1.430510337740997... 2023 \n", "4 [[{'token': 'BP', 'logprob': -4.20799915445968... The Coast Guard \n", "\n", " response_correct \n", "0 True \n", "1 True \n", "2 True \n", "3 True \n", "4 False " ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Populate correct answers and grade responses\n", "gemini_flash = ChatVertexAI(model=\"gemini-2.5-flash\")\n", "grader = LLMGrader(llm=gemini_flash)\n", "\n", "result_df[\"answer\"] = simpleqa[\"answer\"]\n", "result_df[\"response_correct\"] = await grader.grade_responses(prompts=simpleqa[\"question\"].to_list(), responses=result_df[\"response\"].to_list(), answers=simpleqa[\"answer\"].to_list())\n", "result_df.head(5)" ] }, { "cell_type": "code", "execution_count": 15, "metadata": { "tags": [] }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Baseline LLM accuracy: 0.54\n" ] } ], "source": [ "print(f\"\"\"Baseline LLM accuracy: {np.mean(result_df[\"response_correct\"])}\"\"\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### 3.1 Filtered LLM Accuracy Evaluation" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Here, we explore ‘filtered accuracy’ as a metric for evaluating the performance of our density scores. Filtered accuracy measures the change in LLM performance when responses with density scores below a specified threshold are excluded. By adjusting the density score threshold, we can observe how the accuracy of the LLM improves as less certain responses are filtered out.\n", "\n", "We will plot the filtered accuracy across various density score thresholds to visualize the relationship between density and LLM accuracy. This analysis helps in understanding the trade-off between response coverage (measured by sample size below) and LLM accuracy, providing insights into the reliability of the LLM’s outputs. " ] }, { "cell_type": "code", "execution_count": 16, "metadata": { "tags": [] }, "outputs": [ { "data": { "image/png": "", "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "# Semantic Density\n", "plot_model_accuracies(scores=result_df.semantic_density_value, correct_indicators=result_df.response_correct, display_percentage=True)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### 3.2 Precision, Recall, F1-Score of Hallucination Detection" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Lastly, we compute the optimal threshold for binarizing confidence scores, using F1-score as the objective. Using these thresholds, we compute precision, recall, and F1-score for our semantic density-based scorer predictions of whether responses are correct." ] }, { "cell_type": "code", "execution_count": 17, "metadata": { "tags": [] }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "============================================================\n", "Metrics semantic_density_value \n", "------------------------------------------------------------\n", "Precision 0.765 \n", "Recall 0.765 \n", "F1-score 0.765 \n", "------------------------------------------------------------\n", "F-1 optimal threshold 0.61 \n", "============================================================\n" ] } ], "source": [ "# instantiate UQLM tuner object for threshold selection\n", "split = len(result_df) // 2\n", "t = Tuner()\n", "\n", "correct_indicators = (result_df.response_correct) * 1 # Whether responses is actually correct\n", "metric_values = {\"Precision\": [], \"Recall\": [], \"F1-score\": []}\n", "optimal_thresholds = []\n", "\n", "# tune threshold on first half\n", "y_scores = result_df[\"semantic_density_value\"]\n", "y_scores_tune = y_scores[0:split]\n", "y_true_tune = correct_indicators[0:split]\n", "best_threshold = t.tune_threshold(y_scores=y_scores_tune, correct_indicators=y_true_tune, thresh_objective=\"fbeta_score\")\n", "\n", "y_pred = [(s > best_threshold) * 1 for s in y_scores] # predicts whether response is correct based on confidence score\n", "optimal_thresholds.append(best_threshold)\n", "\n", "# evaluate on last half\n", "y_true_eval = correct_indicators[split:]\n", "y_pred_eval = y_pred[split:]\n", "metric_values[\"Precision\"].append(precision_score(y_true=y_true_eval, y_pred=y_pred_eval))\n", "metric_values[\"Recall\"].append(recall_score(y_true=y_true_eval, y_pred=y_pred_eval))\n", "metric_values[\"F1-score\"].append(f1_score(y_true=y_true_eval, y_pred=y_pred_eval))\n", "\n", "# print results\n", "header = f\"{'Metrics':<25}\" + f\"{'semantic_density_value':<35}\"\n", "print(\"=\" * len(header) + \"\\n\" + header + \"\\n\" + \"-\" * len(header))\n", "for metric in metric_values.keys():\n", " print(f\"{metric:<25}\" + \"\".join([f\"{round(x_, 3):<35}\" for x_ in metric_values[metric]]))\n", "print(\"-\" * len(header))\n", "print(f\"{'F-1 optimal threshold':<25}\" + \"\".join([f\"{round(x_, 3):<35}\" for x_ in optimal_thresholds]))\n", "print(\"=\" * len(header))" ] }, { "cell_type": "markdown", "metadata": { "tags": [] }, "source": [ "## 4. Scorer Definition\n", "### Semantic Density\n", "Semantic Density (SD) approximates a probability density function (PDF) in semantic space for estimating response correctness. Given a prompt $x$ with candidate response $y_*$, the objective is to construct a PDF that assigns higher density to regions in the semantic space that correspond to correct responses. We begin by sampling $M$ unique reference responses $y_i$ (for $i = 1, 2, \\dots, M$) conditioned on $x$. For any pair of responses $y_i, y_j$ with corresponding embeddings $v_i, v_j$, the semantic distance is estimated as\n", "\n", "\n", "$$ \\mathbb{E}(\\Vert v_i,v_j \\Vert^2) = p_c(y_i, y_j | x) + \\dfrac{1}{2} \\cdot p_n(y_i, y_j | x)$$ \n", "\n", "where $p_c, p_n$ denote the contradiction and neutrality scores returned by a natural language inference (NLI) model, respectively. This estimated distance is incorporated in the kernel function $K$ to smooth out the reference responses into a continuous distribution. The kernel function value can be obtained as\n", "\n", "\n", "$$ K(v_*, v_i) = (1 - \\mathbb{E}(\\Vert v_* - v_i \\Vert^2))\\mathbf{1}_{\\mathbb{E}(\\Vert v_* - v_i \\Vert) \\leq 1}$$ \n", "\n", "\n", "where $\\bf{1}$ is the indicator function such that $\\bf{1}_{\\text{condition}} = 1$ when the condition holds and $0$ otherwise. The final semantic density score is computed as\n", "\n", "\n", "$$ SD(y_* | x) = \\dfrac{1}{\\sum^M_{i=1}\\sqrt[L_i]{p(y_i|x)}}\\sum^M_{i=1}\\sqrt[L_i]{p(y_i|x)}K(v_* - v_i)$$ \n", "\n", "\n", "where $L_i$ denotes the length of $y_i$." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "© 2025 CVS Health and/or one of its affiliates. All rights reserved." ] } ], "metadata": { "environment": { "kernel": "uqlm_my_test", "name": "workbench-notebooks.m126", "type": "gcloud", "uri": "us-docker.pkg.dev/deeplearning-platform-release/gcr.io/workbench-notebooks:m126" }, "kernelspec": { "display_name": "uqlm_my_test", "language": "python", "name": "uqlm_my_test" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.11.12" } }, "nbformat": 4, "nbformat_minor": 4 }