{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# 🎯 LLM-as-a-Judge\n", "\n", "
\n", " LLM-as-a-Judge scorers use one or more LLMs to evaluate the reliability of the original LLM's response. They offer high customizability through prompt engineering and the choice of judge LLM(s). Below is a list of the available scorers:\n", "
\n", "\n", "* Categorical LLM-as-a-Judge ([Manakul et al., 2023](https://arxiv.org/abs/2303.08896); [Chen & Mueller, 2023](https://arxiv.org/abs/2308.16175); [Luo et al., 2023](https://arxiv.org/pdf/2303.15621))\n", "* Continuous LLM-as-a-Judge ([Xiong et al., 2024](https://arxiv.org/pdf/2306.13063))\n", "* Panel of LLM Judges ([Verga et al., 2024](https://arxiv.org/abs/2404.18796))\n", " \n", "Set up LLM instance and load example data prompts.
\n", "Generate LLM Responses and Confidence Scores
\n", "Generate and score LLM responses to the example questions using the LLMPanel()
class.
Evaluate Hallucination Detection Performance
\n", "Compute precision, recall, and F1-score of hallucination detection.
\n", "\n", " | question | \n", "answer | \n", "
---|---|---|
0 | \n", "There are 87 oranges and 290 bananas in Philip... | \n", "145 | \n", "
1 | \n", "Marco and his dad went strawberry picking. Mar... | \n", "19 | \n", "
2 | \n", "Edward spent $ 6 to buy 2 books each book cost... | \n", "3 | \n", "
3 | \n", "Frank was reading through his favorite book. T... | \n", "198 | \n", "
4 | \n", "There were 78 dollars in Olivia's wallet. She ... | \n", "63 | \n", "
Parameter | \n", "Type & Default | \n", "Description | \n", "
---|---|---|
judges | \n", "list of LLMJudge or BaseChatModel
| \n",
" Judges to use. If BaseChatModel, LLMJudge is instantiated using default parameters. | \n", "
llm | \n", "BaseChatModeldefault=None | \n",
" A langchain llm `BaseChatModel`. User is responsible for specifying temperature and other relevant parameters to the constructor of the provided `llm` object. | \n", "
system_prompt | \n", "str or Nonedefault=\"You are a helpful assistant.\" | \n",
" Optional argument for user to provide custom system prompt for the LLM. | \n", "
max_calls_per_min | \n", "intdefault=None | \n",
" Specifies how many API calls to make per minute to avoid rate limit errors. By default, no limit is specified. | \n", "
scoring_templates | \n", "intdefault=None | \n",
" Specifies which off-the-shelf template to use for each judge. Four off-the-shelf templates offered: incorrect/uncertain/correct (0/0.5/1), incorrect/correct (0/1), continuous score (0 to 1), and likert scale score (1-5 scale, normalized to 0/0.25/0.5/0.75/1). These templates are respectively specified as 'true_false_uncertain', 'true_false', 'continuous', and 'likert'. If specified, must be of equal length to `judges` list. Defaults to 'true_false_uncertain' template used by Chen and Mueller (2023) for each judge. | \n", "
🧠 LLM-Specific
\n", "llm
system_prompt
📊 Confidence Scores
\n", "judges
scoring_templates
⚡ Performance
\n", "max_calls_per_min
Method | \n", "Description & Parameters | \n", "
---|---|
LLMPanel.generate_and_score | \n", "\n",
" Generate responses to provided prompts and use panel to of judges to score responses for correctness. \n", "Parameters: \n", "
Returns: \n",
" 💡 Best For: Complete end-to-end uncertainty quantification when starting with prompts.\n",
" \n",
" | \n",
"
LLMPanel.score | \n", "\n",
" Use panel to of judges to score provided responses for correctness. Use if responses are already generated. Otherwise, use `generate_and_score`. \n", "Parameters: \n", "
Returns: \n",
" 💡 Best For: Computing uncertainty scores when responses are already generated elsewhere.\n",
" \n",
" | \n",
"
\n", " | prompt | \n", "response | \n", "judge_1 | \n", "judge_2 | \n", "judge_3 | \n", "avg | \n", "max | \n", "min | \n", "median | \n", "
---|---|---|---|---|---|---|---|---|---|
0 | \n", "When you solve this math problem only return t... | \n", "145 | \n", "1.0 | \n", "1.0 | \n", "1.0 | \n", "1.0 | \n", "1.0 | \n", "1.0 | \n", "1.0 | \n", "
1 | \n", "When you solve this math problem only return t... | \n", "19 pounds | \n", "1.0 | \n", "1.0 | \n", "1.0 | \n", "1.0 | \n", "1.0 | \n", "1.0 | \n", "1.0 | \n", "
2 | \n", "When you solve this math problem only return t... | \n", "$3 | \n", "1.0 | \n", "1.0 | \n", "1.0 | \n", "1.0 | \n", "1.0 | \n", "1.0 | \n", "1.0 | \n", "
3 | \n", "When you solve this math problem only return t... | \n", "198 | \n", "1.0 | \n", "1.0 | \n", "1.0 | \n", "1.0 | \n", "1.0 | \n", "1.0 | \n", "1.0 | \n", "
4 | \n", "When you solve this math problem only return t... | \n", "63 | \n", "1.0 | \n", "1.0 | \n", "1.0 | \n", "1.0 | \n", "1.0 | \n", "1.0 | \n", "1.0 | \n", "
\n", " | prompt | \n", "response | \n", "judge_1 | \n", "judge_2 | \n", "judge_3 | \n", "avg | \n", "max | \n", "min | \n", "median | \n", "answer | \n", "response_correct | \n", "
---|---|---|---|---|---|---|---|---|---|---|---|
0 | \n", "When you solve this math problem only return t... | \n", "145 | \n", "1.0 | \n", "1.0 | \n", "1.0 | \n", "1.0 | \n", "1.0 | \n", "1.0 | \n", "1.0 | \n", "145 | \n", "True | \n", "
1 | \n", "When you solve this math problem only return t... | \n", "19 pounds | \n", "1.0 | \n", "1.0 | \n", "1.0 | \n", "1.0 | \n", "1.0 | \n", "1.0 | \n", "1.0 | \n", "19 | \n", "True | \n", "
2 | \n", "When you solve this math problem only return t... | \n", "$3 | \n", "1.0 | \n", "1.0 | \n", "1.0 | \n", "1.0 | \n", "1.0 | \n", "1.0 | \n", "1.0 | \n", "3 | \n", "True | \n", "
3 | \n", "When you solve this math problem only return t... | \n", "198 | \n", "1.0 | \n", "1.0 | \n", "1.0 | \n", "1.0 | \n", "1.0 | \n", "1.0 | \n", "1.0 | \n", "198 | \n", "True | \n", "
4 | \n", "When you solve this math problem only return t... | \n", "63 | \n", "1.0 | \n", "1.0 | \n", "1.0 | \n", "1.0 | \n", "1.0 | \n", "1.0 | \n", "1.0 | \n", "63 | \n", "True | \n", "