Panel of LLM Judges#
LLMPanel
The Panel of LLM Judges aggregates scores from multiple LLM judges using various aggregation methods to provide a more robust confidence estimate.
Overview#
The LLMPanel class coordinates multiple LLMJudge instances, allowing
you to leverage diverse LLM perspectives for improved evaluation reliability.
Aggregation Methods:
Average (``avg``): Mean of all judge scores
Maximum (``max``): Most optimistic judge assessment
Minimum (``min``): Most conservative judge assessment
Median (``median``): Middle value, robust to outliers
Definition#
Let \(J_1, J_2, ..., J_n\) be \(n\) judges and \(s_k = J_k(y_i)\) be the score from judge \(k\) for response \(y_i\).
Average:
Maximum:
Minimum:
Median:
How It Works#
Configure multiple LLM judges (can be different models or same model with different prompts)
For each response, obtain scores from all judges in the panel
Aggregate scores using your preferred method
Return individual judge scores and aggregated scores
Benefits of using a panel:
Diversity: Different LLMs may catch different types of errors
Robustness: Aggregation reduces impact of individual judge mistakes
Flexibility: Mix models of different sizes and capabilities
Parameters#
The LLMPanel class accepts:
judges: List ofLLMJudgeinstances orBaseChatModelinstancesscoring_templates: List of scoring templates for each judgeexplanations: Whether to include judge explanations
Example#
from uqlm import LLMPanel
# Create a panel with multiple judges
panel = LLMPanel(
llm=original_llm, # LLM to generate responses
judges=[gpt4, claude, gemini], # Panel of judge LLMs
scoring_templates=["true_false_uncertain"] * 3, # Same template for all
explanations=True # Include judge reasoning
)
# Generate responses and get panel scores
results = await panel.generate_and_score(prompts=prompts)
# Access aggregated scores
df = results.to_df()
print(df["avg"]) # Average of all judges
print(df["median"]) # Median score
print(df["min"]) # Most conservative
print(df["max"]) # Most optimistic
# Access individual judge scores
print(df["judge_1"])
print(df["judge_2"])
print(df["judge_3"])
Mixed Templates Example:
from uqlm import LLMPanel
# Use different templates for different judges
panel = LLMPanel(
llm=original_llm,
judges=[gpt4, claude, gemini],
scoring_templates=["true_false_uncertain", "continuous", "likert"]
)
results = await panel.generate_and_score(prompts=prompts)
References#
Verga, P., et al. (2024). Replacing Judges with Juries: Evaluating LLM Generations with a Panel of Diverse Models. arXiv.
See Also#
LLMPanel- Main panel class documentationLLMJudge- Individual judge classTernary Judge (True/False/Uncertain) - Default scoring template