Binary Judge (True/False)#

true_false

The binary judge template instructs an LLM to classify a question-response as either correct or incorrect.

Definition#

This template modifies the ternary approach to include only two categories:

\[\begin{split}J(y_i) = \begin{cases} 0 & \text{LLM states response is incorrect} \\ 1 & \text{LLM states response is correct} \end{cases}\end{split}\]

The judge function \(J: \mathcal{Y} \rightarrow \{0, 1\}\) maps responses to binary scores.

Key Properties:

Simpler binary classification without uncertain category
Forces the judge to make a definitive decision
Useful when you want clear-cut correct/incorrect labels

How It Works#

Present the judge LLM with the original question and response
Ask the judge to classify the response as “correct” or “incorrect”
Map the classification to a numerical score (1 or 0)

Use this template when you prefer binary decisions without an intermediate uncertainty category.

Parameters#

When using LLMJudge or LLMPanel, specify scoring_template="true_false".

Example#

from uqlm.judges import LLMJudge

# Initialize with binary template
judge = LLMJudge(
    llm=judge_llm,
    scoring_template="true_false"
)

# Score responses
result = await judge.judge_responses(
    prompts=prompts,
    responses=responses
)

Using with LLMPanel:

from uqlm import LLMPanel

# Create a panel with binary scoring
panel = LLMPanel(
    llm=original_llm,
    judges=[judge_llm1, judge_llm2],
    scoring_templates=["true_false"] * 2
)

results = await panel.generate_and_score(prompts=prompts)

References#

Chen, J. & Mueller, J. (2023). Quantifying Uncertainty in Answers from any Language Model and Enhancing their Trustworthiness. arXiv.
Manakul, P., et al. (2023). SelfCheckGPT: Zero-Resource Black-Box Hallucination Detection for Generative Large Language Models. arXiv.
Luo, Z., et al. (2023). ChatGPT as a Factual Inconsistency Evaluator for Text Summarization. arXiv.