LLM-as-a-Judge Scorers#

LLM-as-a-Judge scorers use one or more LLMs to evaluate the reliability of the original LLM’s response. They offer high customizability through prompt engineering and the choice of judge LLM(s).

Key Characteristics:

  • Universal Compatibility: Works with any LLM

  • Highly Customizable: Use any LLM as a judge and tailor instruction prompts for specific use cases

  • Self-Reflection Capable: Can use the same LLM as both generator and judge

Trade-offs:

  • Added Cost: Requires additional LLM calls for the judge LLM(s)

  • Added Latency: Judge evaluations add to the total response time

Overview:

Under the LLM-as-a-Judge approach, either the same LLM that was used for generating the original responses or a different LLM is asked to form a judgment about a pre-generated response. Several scoring templates are available to accommodate different use cases.

Panel of Judges#

For improved robustness, you can use the LLMPanel class to aggregate scores from multiple LLM judges using various aggregation methods (average, min, max, median).