๐ฏ LLM-as-a-Judge#
LLM-as-a-Judge scorers use one or more LLMs to evaluate the reliability of the original LLMโs response. They offer high customizability through prompt engineering and the choice of judge LLM(s). Below is a list of the available scorers:
Categorical LLM-as-a-Judge (Manakul et al., 2023; Chen & Mueller, 2023; Luo et al., 2023)
Continuous LLM-as-a-Judge (Xiong et al., 2024)
Panel of LLM Judges (Verga et al., 2024)
๐ What Youโll Do in This Demo#
1
Set up LLM and prompts.
Set up LLM instance and load example data prompts.
2
Generate LLM Responses and Confidence Scores
Generate and score LLM responses to the example questions using the LLMPanel() class.
3
Evaluate Hallucination Detection Performance
Compute precision, recall, and F1-score of hallucination detection.
โ๏ธ Advantages & Limitations#
Pros
Universal Compatibility: Works with any LLM.
Highly Customizable: Use any LLM as a judge and tailor instruction prompts for specific use cases.
Cons
Added cost: Requires additional LLM calls for the judge LLM(s).
[1]:
import os
from uqlm import LLMPanel
from uqlm.judges import LLMJudge
from uqlm.utils import load_example_dataset, math_postprocessor
## 1. Set up LLM and Prompts
In this demo, we will illustrate this approach using a set of math questions from the SVAMP benchmark. To implement with your use case, simply replace the example prompts with your data.
[2]:
# Load example dataset (SVAMP)
svamp = load_example_dataset("svamp", n=75)
svamp.head()
Loading dataset - svamp...
Processing dataset...
Dataset ready!
[2]:
question | answer | |
---|---|---|
0 | There are 87 oranges and 290 bananas in Philip... | 145 |
1 | Marco and his dad went strawberry picking. Mar... | 19 |
2 | Edward spent $ 6 to buy 2 books each book cost... | 3 |
3 | Frank was reading through his favorite book. T... | 198 |
4 | There were 78 dollars in Olivia's wallet. She ... | 63 |
[3]:
# Define prompts
MATH_INSTRUCTION = "When you solve this math problem only return the answer with no additional text.\n"
prompts = [MATH_INSTRUCTION + prompt for prompt in svamp.question]
In this example, we use ChatVertexAI
and AzureChatOpenAI
to instantiate our LLMs, but any LangChain Chat Model may be used. Be sure to replace with your LLM of choice.
[4]:
# import sys
# !{sys.executable} -m pip install python-dotenv
# !{sys.executable} -m pip install langchain-openai
# # User to populate .env file with API credentials
from dotenv import load_dotenv, find_dotenv
from langchain_openai import AzureChatOpenAI
load_dotenv(find_dotenv())
original_llm = AzureChatOpenAI(
deployment_name=os.getenv("DEPLOYMENT_NAME"),
openai_api_key=os.getenv("API_KEY"),
azure_endpoint=os.getenv("API_BASE"),
openai_api_type=os.getenv("API_TYPE"),
openai_api_version=os.getenv("API_VERSION"),
temperature=1, # User to set temperature
)
[5]:
# import sys
# !{sys.executable} -m pip install langchain-google-vertexai
from langchain_google_vertexai import ChatVertexAI
gemini_pro = ChatVertexAI(model_name="gemini-1.5-pro")
gemini_flash = ChatVertexAI(model_name="gemini-1.5-flash")
## 2. Generate responses and confidence scores
LLMPanel()
- Class for aggregating multiple instances of LLMJudge using average, min, max, or majority voting#
๐ Class Attributes#
Parameter | Type & Default | Description |
---|---|---|
judges | list of LLMJudge or BaseChatModel | Judges to use. If BaseChatModel, LLMJudge is instantiated using default parameters. |
llm | BaseChatModeldefault=None | A langchain llm |
system_prompt | str or Nonedefault=โYou are a helpful assistant.โ | Optional argument for user to provide custom system prompt for the LLM. |
max_calls_per_min | intdefault=None | Specifies how many API calls to make per minute to avoid rate limit errors. By default, no limit is specified. |
scoring_templates | intdefault=None | Specifies which off-the-shelf template to use for each judge. Four off-the-shelf templates offered: incorrect/uncertain/correct (0/0.5/1), incorrect/correct (0/1), continuous score (0 to 1), and likert scale score (1-5 scale, normalized to 0/0.25/0.5/0.75/1). These templates are respectively specified as โtrue_false_uncertainโ, โtrue_falseโ, โcontinuousโ, and โlikertโ. If specified, must be of equal length to |
๐ Parameter Groups#
๐ง LLM-Specific
llm
system_prompt
๐ Confidence Scores
judges
scoring_templates
โก Performance
max_calls_per_min
๐ป Usage Examples#
# Basic usage with single self-judge parameters
panel = LLMPanel(llm=llm, judges=[llm])
# Using two judges with default parameters
panel = LLMPanel(llm=llm, judges=[llm, llm2])
# Using two judges, one with continuous template
panel = LLMPanel(
llm=llm, judges=[llm, llm2], scoring_templates=['true_false_uncertain', 'continuous']
)
[7]:
panel = LLMPanel(llm=original_llm, judges=[original_llm, gemini_pro, gemini_flash])
๐ Class Methods#
Method | Description & Parameters |
---|---|
LLMPanel.generate_and_score | Generate responses to provided prompts and use panel to of judges to score responses for correctness. Parameters:
Returns: UQResult containing data (prompts, responses, sampled responses, and confidence scores) and metadata ๐ก Best For: Complete end-to-end uncertainty quantification when starting with prompts. |
LLMPanel.score | Use panel to of judges to score provided responses for correctness. Use if responses are already generated. Otherwise, use Parameters:
Returns: UQResult containing data (responses and confidence scores) and metadata ๐ก Best For: Computing uncertainty scores when responses are already generated elsewhere. |
[8]:
result = await panel.generate_and_score(prompts=prompts)
# option 2: provide pre-generated responses with score method
# result = await panel.score(prompts=prompts, responses=responses)
Generating responses...
Generating LLMJudge scores...
Generating LLMJudge scores...
Generating LLMJudge scores...
[9]:
result_df = result.to_df()
result_df.head()
[9]:
prompt | response | judge_1 | judge_2 | judge_3 | avg | max | min | median | |
---|---|---|---|---|---|---|---|---|---|
0 | When you solve this math problem only return t... | 145 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 |
1 | When you solve this math problem only return t... | 19 pounds | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 |
2 | When you solve this math problem only return t... | $3 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 |
3 | When you solve this math problem only return t... | 198 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 |
4 | When you solve this math problem only return t... | 63 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 |
## 3. Evaluate Hallucination Detection Performance
To evaluate hallucination detection performance, we โgradeโ the responses against an answer key. Note the math_postprocessor
is specific to our use case (math questions). If you are using your own prompts/questions, update the grading method accordingly.
[10]:
# Populate correct answers and grade responses
result_df["answer"] = svamp.answer
result_df["response_correct"] = [math_postprocessor(r) == a for r, a in zip(result_df["response"], svamp["answer"])]
result_df.head(5)
[10]:
prompt | response | judge_1 | judge_2 | judge_3 | avg | max | min | median | answer | response_correct | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | When you solve this math problem only return t... | 145 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | 145 | True |
1 | When you solve this math problem only return t... | 19 pounds | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | 19 | True |
2 | When you solve this math problem only return t... | $3 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | 3 | True |
3 | When you solve this math problem only return t... | 198 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | 198 | True |
4 | When you solve this math problem only return t... | 63 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | 63 | True |
[11]:
# evaluate precision, recall, and f1-score of Semantic Entropy's predictions of correctness
from sklearn.metrics import precision_score, recall_score, f1_score
for ind in [1, 2, 3]:
y_pred = [(s > 0) * 1 for s in result_df[f"judge_{str(ind)}"]]
y_true = result_df.response_correct
print(f"Judge {ind} precision: {precision_score(y_true=y_true, y_pred=y_pred)}")
print(f"Judge {ind} recall: {recall_score(y_true=y_true, y_pred=y_pred)}")
print(f"Judge {ind} f1-score: {f1_score(y_true=y_true, y_pred=y_pred)}")
print(" ")
Judge 1 precision: 0.8620689655172413
Judge 1 recall: 0.78125
Judge 1 f1-score: 0.819672131147541
Judge 2 precision: 0.9375
Judge 2 recall: 0.9375
Judge 2 f1-score: 0.9375
Judge 3 precision: 0.9090909090909091
Judge 3 recall: 0.9375
Judge 3 f1-score: 0.9230769230769231
## 5. Scorer Definitions Under the LLM-as-a-Judge approach, either the same LLM that was used for generating the original responses or a different LLM is asked to form a judgment about a pre-generated response. Below, we define two LLM-as-a-Judge scorer templates. ### Categorical Judge Template (true_false_uncertain
) We follow the approach proposed by Chen & Mueller, 2023 in which an LLM is instructed to score a question-answer concatenation as either
incorrect, uncertain, or correct using a carefully constructed prompt. These categories are respectively mapped to numerical scores of 0, 0.5, and 1. We denote the LLM-as-a-judge scorers as \(J: \mathcal{Y} \xrightarrow[]{} \{0, 0.5, 1\}\). Formally, we can write this scorer function as follows:
:nbsphinx-math:`begin{equation} J(y_i) = begin{cases}
0 & text{LLM states response is incorrect} \ 0.5 & text{LLM states that it is uncertain} \ 1 & text{LLM states response is correct}.
end{cases} end{equation}`
Continuous Judge Template (continuous
)#
For the continuous template, the LLM is asked to directly score a question-answer concatenationโs correctness on a scale of 0 to 1.
ยฉ 2025 CVS Health and/or one of its affiliates. All rights reserved.