Panel of LLM Judges#

LLMPanel

The Panel of LLM Judges aggregates scores from multiple LLM judges using various aggregation methods to provide a more robust confidence estimate.

Overview#

The LLMPanel class coordinates multiple LLMJudge instances, allowing you to leverage diverse LLM perspectives for improved evaluation reliability.

Aggregation Methods:

  • Average (``avg``): Mean of all judge scores

  • Maximum (``max``): Most optimistic judge assessment

  • Minimum (``min``): Most conservative judge assessment

  • Median (``median``): Middle value, robust to outliers

Definition#

Let \(J_1, J_2, ..., J_n\) be \(n\) judges and \(s_k = J_k(y_i)\) be the score from judge \(k\) for response \(y_i\).

Average:

\[\text{Panel}_{avg}(y_i) = \frac{1}{n} \sum_{k=1}^n s_k\]

Maximum:

\[\text{Panel}_{max}(y_i) = \max_{k \in \{1,...,n\}} s_k\]

Minimum:

\[\text{Panel}_{min}(y_i) = \min_{k \in \{1,...,n\}} s_k\]

Median:

\[\text{Panel}_{median}(y_i) = \text{median}(s_1, s_2, ..., s_n)\]

How It Works#

  1. Configure multiple LLM judges (can be different models or same model with different prompts)

  2. For each response, obtain scores from all judges in the panel

  3. Aggregate scores using your preferred method

  4. Return individual judge scores and aggregated scores

Benefits of using a panel:

  • Diversity: Different LLMs may catch different types of errors

  • Robustness: Aggregation reduces impact of individual judge mistakes

  • Flexibility: Mix models of different sizes and capabilities

Parameters#

The LLMPanel class accepts:

  • judges: List of LLMJudge instances or BaseChatModel instances

  • scoring_templates: List of scoring templates for each judge

  • explanations: Whether to include judge explanations

Example#

from uqlm import LLMPanel

# Create a panel with multiple judges
panel = LLMPanel(
    llm=original_llm,  # LLM to generate responses
    judges=[gpt4, claude, gemini],  # Panel of judge LLMs
    scoring_templates=["true_false_uncertain"] * 3,  # Same template for all
    explanations=True  # Include judge reasoning
)

# Generate responses and get panel scores
results = await panel.generate_and_score(prompts=prompts)

# Access aggregated scores
df = results.to_df()
print(df["avg"])     # Average of all judges
print(df["median"])  # Median score
print(df["min"])     # Most conservative
print(df["max"])     # Most optimistic

# Access individual judge scores
print(df["judge_1"])
print(df["judge_2"])
print(df["judge_3"])

Mixed Templates Example:

from uqlm import LLMPanel

# Use different templates for different judges
panel = LLMPanel(
    llm=original_llm,
    judges=[gpt4, claude, gemini],
    scoring_templates=["true_false_uncertain", "continuous", "likert"]
)

results = await panel.generate_and_score(prompts=prompts)

References#

See Also#