๐ŸŽฏ LLM-as-a-Judge#

LLM-as-a-Judge scorers use one or more LLMs to evaluate the reliability of the original LLMโ€™s response. They offer high customizability through prompt engineering and the choice of judge LLM(s). Below is a list of the available scorers:

๐Ÿ“Š What Youโ€™ll Do in This Demo#

1

Set up LLM and prompts.

Set up LLM instance and load example data prompts.

2

Generate LLM Responses and Confidence Scores

Generate and score LLM responses to the example questions using the LLMPanel() class.

3

Evaluate Hallucination Detection Performance

Compute precision, recall, and F1-score of hallucination detection.

โš–๏ธ Advantages & Limitations#

Pros

  • Universal Compatibility: Works with any LLM.

  • Highly Customizable: Use any LLM as a judge and tailor instruction prompts for specific use cases.

Cons

  • Added cost: Requires additional LLM calls for the judge LLM(s).

[1]:
import os
from uqlm import LLMPanel
from uqlm.judges import LLMJudge
from uqlm.utils import load_example_dataset, math_postprocessor

## 1. Set up LLM and Prompts

In this demo, we will illustrate this approach using a set of math questions from the SVAMP benchmark. To implement with your use case, simply replace the example prompts with your data.

[2]:
# Load example dataset (SVAMP)
svamp = load_example_dataset("svamp", n=75)
svamp.head()
Loading dataset - svamp...
Processing dataset...
Dataset ready!
[2]:
question answer
0 There are 87 oranges and 290 bananas in Philip... 145
1 Marco and his dad went strawberry picking. Mar... 19
2 Edward spent $ 6 to buy 2 books each book cost... 3
3 Frank was reading through his favorite book. T... 198
4 There were 78 dollars in Olivia's wallet. She ... 63
[3]:
# Define prompts
MATH_INSTRUCTION = "When you solve this math problem only return the answer with no additional text.\n"
prompts = [MATH_INSTRUCTION + prompt for prompt in svamp.question]

In this example, we use ChatVertexAI and AzureChatOpenAI to instantiate our LLMs, but any LangChain Chat Model may be used. Be sure to replace with your LLM of choice.

[4]:
# import sys
# !{sys.executable} -m pip install python-dotenv
# !{sys.executable} -m pip install langchain-openai

# # User to populate .env file with API credentials
from dotenv import load_dotenv, find_dotenv
from langchain_openai import AzureChatOpenAI

load_dotenv(find_dotenv())
original_llm = AzureChatOpenAI(
    deployment_name=os.getenv("DEPLOYMENT_NAME"),
    openai_api_key=os.getenv("API_KEY"),
    azure_endpoint=os.getenv("API_BASE"),
    openai_api_type=os.getenv("API_TYPE"),
    openai_api_version=os.getenv("API_VERSION"),
    temperature=1,  # User to set temperature
)
[5]:
# import sys
# !{sys.executable} -m pip install langchain-google-vertexai
from langchain_google_vertexai import ChatVertexAI

gemini_pro = ChatVertexAI(model_name="gemini-1.5-pro")
gemini_flash = ChatVertexAI(model_name="gemini-1.5-flash")

## 2. Generate responses and confidence scores

LLMPanel() - Class for aggregating multiple instances of LLMJudge using average, min, max, or majority voting#

Sample Image

๐Ÿ“‹ Class Attributes#

Parameter

Type & Default

Description

judges

list of LLMJudge or BaseChatModel

Judges to use. If BaseChatModel, LLMJudge is instantiated using default parameters.

llm

BaseChatModeldefault=None

A langchain llm BaseChatModel. User is responsible for specifying temperature and other relevant parameters to the constructor of the provided llm object.

system_prompt

str or Nonedefault=โ€You are a helpful assistant.โ€

Optional argument for user to provide custom system prompt for the LLM.

max_calls_per_min

intdefault=None

Specifies how many API calls to make per minute to avoid rate limit errors. By default, no limit is specified.

scoring_templates

intdefault=None

Specifies which off-the-shelf template to use for each judge. Four off-the-shelf templates offered: incorrect/uncertain/correct (0/0.5/1), incorrect/correct (0/1), continuous score (0 to 1), and likert scale score (1-5 scale, normalized to 0/0.25/0.5/0.75/1). These templates are respectively specified as โ€˜true_false_uncertainโ€™, โ€˜true_falseโ€™, โ€˜continuousโ€™, and โ€˜likertโ€™. If specified, must be of equal length to judges list. Defaults to โ€˜true_false_uncertainโ€™ template used by Chen and Mueller (2023) for each judge.

๐Ÿ” Parameter Groups#

๐Ÿง  LLM-Specific

  • llm

  • system_prompt

๐Ÿ“Š Confidence Scores

  • judges

  • scoring_templates

โšก Performance

  • max_calls_per_min

๐Ÿ’ป Usage Examples#

# Basic usage with single self-judge parameters
panel = LLMPanel(llm=llm, judges=[llm])

# Using two judges with default parameters
panel = LLMPanel(llm=llm, judges=[llm, llm2])

# Using two judges, one with continuous template
panel = LLMPanel(
    llm=llm, judges=[llm, llm2], scoring_templates=['true_false_uncertain', 'continuous']
)
[7]:
panel = LLMPanel(llm=original_llm, judges=[original_llm, gemini_pro, gemini_flash])

๐Ÿ”„ Class Methods#

Method

Description & Parameters

LLMPanel.generate_and_score

Generate responses to provided prompts and use panel to of judges to score responses for correctness.

Parameters:

  • prompts - (list of str) A list of input prompts for the model.

Returns: UQResult containing data (prompts, responses, sampled responses, and confidence scores) and metadata

๐Ÿ’ก Best For: Complete end-to-end uncertainty quantification when starting with prompts.

LLMPanel.score

Use panel to of judges to score provided responses for correctness. Use if responses are already generated. Otherwise, use generate_and_score.

Parameters:

  • prompts - (list of str) A list of input prompts for the model.

  • responses - (list of str) A list of LLM responses for the prompts.

Returns: UQResult containing data (responses and confidence scores) and metadata

๐Ÿ’ก Best For: Computing uncertainty scores when responses are already generated elsewhere.

[8]:
result = await panel.generate_and_score(prompts=prompts)

# option 2: provide pre-generated responses with score method
# result = await panel.score(prompts=prompts, responses=responses)
Generating responses...
Generating LLMJudge scores...
Generating LLMJudge scores...
Generating LLMJudge scores...
[9]:
result_df = result.to_df()
result_df.head()
[9]:
prompt response judge_1 judge_2 judge_3 avg max min median
0 When you solve this math problem only return t... 145 1.0 1.0 1.0 1.0 1.0 1.0 1.0
1 When you solve this math problem only return t... 19 pounds 1.0 1.0 1.0 1.0 1.0 1.0 1.0
2 When you solve this math problem only return t... $3 1.0 1.0 1.0 1.0 1.0 1.0 1.0
3 When you solve this math problem only return t... 198 1.0 1.0 1.0 1.0 1.0 1.0 1.0
4 When you solve this math problem only return t... 63 1.0 1.0 1.0 1.0 1.0 1.0 1.0

## 3. Evaluate Hallucination Detection Performance

To evaluate hallucination detection performance, we โ€˜gradeโ€™ the responses against an answer key. Note the math_postprocessor is specific to our use case (math questions). If you are using your own prompts/questions, update the grading method accordingly.

[10]:
# Populate correct answers and grade responses
result_df["answer"] = svamp.answer
result_df["response_correct"] = [math_postprocessor(r) == a for r, a in zip(result_df["response"], svamp["answer"])]
result_df.head(5)
[10]:
prompt response judge_1 judge_2 judge_3 avg max min median answer response_correct
0 When you solve this math problem only return t... 145 1.0 1.0 1.0 1.0 1.0 1.0 1.0 145 True
1 When you solve this math problem only return t... 19 pounds 1.0 1.0 1.0 1.0 1.0 1.0 1.0 19 True
2 When you solve this math problem only return t... $3 1.0 1.0 1.0 1.0 1.0 1.0 1.0 3 True
3 When you solve this math problem only return t... 198 1.0 1.0 1.0 1.0 1.0 1.0 1.0 198 True
4 When you solve this math problem only return t... 63 1.0 1.0 1.0 1.0 1.0 1.0 1.0 63 True
[11]:
# evaluate precision, recall, and f1-score of Semantic Entropy's predictions of correctness
from sklearn.metrics import precision_score, recall_score, f1_score

for ind in [1, 2, 3]:
    y_pred = [(s > 0) * 1 for s in result_df[f"judge_{str(ind)}"]]
    y_true = result_df.response_correct
    print(f"Judge {ind} precision: {precision_score(y_true=y_true, y_pred=y_pred)}")
    print(f"Judge {ind} recall: {recall_score(y_true=y_true, y_pred=y_pred)}")
    print(f"Judge {ind} f1-score: {f1_score(y_true=y_true, y_pred=y_pred)}")
    print(" ")
Judge 1 precision: 0.8620689655172413
Judge 1 recall: 0.78125
Judge 1 f1-score: 0.819672131147541

Judge 2 precision: 0.9375
Judge 2 recall: 0.9375
Judge 2 f1-score: 0.9375

Judge 3 precision: 0.9090909090909091
Judge 3 recall: 0.9375
Judge 3 f1-score: 0.9230769230769231

## 5. Scorer Definitions Under the LLM-as-a-Judge approach, either the same LLM that was used for generating the original responses or a different LLM is asked to form a judgment about a pre-generated response. Below, we define two LLM-as-a-Judge scorer templates. ### Categorical Judge Template (true_false_uncertain) We follow the approach proposed by Chen & Mueller, 2023 in which an LLM is instructed to score a question-answer concatenation as either incorrect, uncertain, or correct using a carefully constructed prompt. These categories are respectively mapped to numerical scores of 0, 0.5, and 1. We denote the LLM-as-a-judge scorers as \(J: \mathcal{Y} \xrightarrow[]{} \{0, 0.5, 1\}\). Formally, we can write this scorer function as follows:

:nbsphinx-math:`begin{equation} J(y_i) = begin{cases}

0 & text{LLM states response is incorrect} \ 0.5 & text{LLM states that it is uncertain} \ 1 & text{LLM states response is correct}.

end{cases} end{equation}`

Continuous Judge Template (continuous)#

For the continuous template, the LLM is asked to directly score a question-answer concatenationโ€™s correctness on a scale of 0 to 1.

ยฉ 2025 CVS Health and/or one of its affiliates. All rights reserved.