Panel of LLM Judges#

LLMPanel

The Panel of LLM Judges aggregates scores from multiple LLM judges using various aggregation methods to provide a more robust confidence estimate.

Overview#

The LLMPanel class coordinates multiple LLMJudge instances, allowing you to leverage diverse LLM perspectives for improved evaluation reliability.

Aggregation Methods:

Average (``avg``): Mean of all judge scores
Maximum (``max``): Most optimistic judge assessment
Minimum (``min``): Most conservative judge assessment
Median (``median``): Middle value, robust to outliers

Definition#

Let \(J_1, J_2, ..., J_n\) be \(n\) judges and \(s_k = J_k(y_i)\) be the score from judge \(k\) for response \(y_i\).

Average:

\[\text{Panel}_{avg}(y_i) = \frac{1}{n} \sum_{k=1}^n s_k\]

Maximum:

\[\text{Panel}_{max}(y_i) = \max_{k \in \{1,...,n\}} s_k\]

Minimum:

\[\text{Panel}_{min}(y_i) = \min_{k \in \{1,...,n\}} s_k\]

Median:

\[\text{Panel}_{median}(y_i) = \text{median}(s_1, s_2, ..., s_n)\]

How It Works#

Configure multiple LLM judges (can be different models or same model with different prompts)
For each response, obtain scores from all judges in the panel
Aggregate scores using your preferred method
Return individual judge scores and aggregated scores

Benefits of using a panel:

Diversity: Different LLMs may catch different types of errors
Robustness: Aggregation reduces impact of individual judge mistakes
Flexibility: Mix models of different sizes and capabilities

Parameters#

The LLMPanel class accepts:

judges: List of LLMJudge instances or BaseChatModel instances
scoring_templates: List of scoring templates for each judge
explanations: Whether to include judge explanations

Example#

from uqlm import LLMPanel

# Create a panel with multiple judges
panel = LLMPanel(
    llm=original_llm,  # LLM to generate responses
    judges=[gpt4, claude, gemini],  # Panel of judge LLMs
    scoring_templates=["true_false_uncertain"] * 3,  # Same template for all
    explanations=True  # Include judge reasoning
)

# Generate responses and get panel scores
results = await panel.generate_and_score(prompts=prompts)

# Access aggregated scores
df = results.to_df()
print(df["avg"])     # Average of all judges
print(df["median"])  # Median score
print(df["min"])     # Most conservative
print(df["max"])     # Most optimistic

# Access individual judge scores
print(df["judge_1"])
print(df["judge_2"])
print(df["judge_3"])

Mixed Templates Example:

from uqlm import LLMPanel

# Use different templates for different judges
panel = LLMPanel(
    llm=original_llm,
    judges=[gpt4, claude, gemini],
    scoring_templates=["true_false_uncertain", "continuous", "likert"]
)

results = await panel.generate_and_score(prompts=prompts)

References#

Verga, P., et al. (2024). Replacing Judges with Juries: Evaluating LLM Generations with a Panel of Diverse Models. arXiv.