P(True)#

p_true

P(True) is a self-reflection method that presents an LLM with its own previous response and asks it to classify the statement as “True” or “False”. The confidence score is derived from the token probability for the “True” answer.

Definition#

Given a prompt \(x\) and the LLM’s response \(y\), the P(True) scorer:

Constructs a self-reflection prompt asking the LLM to evaluate whether \(y\) correctly answers \(x\)
Requests the LLM to respond with “True” or “False”
Returns the token probability for “True” as the confidence score

\[P(True) = p(\text{"True"} | x, y, \text{reflection\_prompt})\]

If the model answers “False”, the score is computed as:

\[P(True) = 1 - p(\text{"False"} | x, y, \text{reflection\_prompt})\]

Key Properties:

Self-reflection approach - uses the same LLM to evaluate its own response
Requires one additional LLM generation per response
Score range: \([0, 1]\)

How It Works#

Generate an original response to the prompt
Construct a self-reflection prompt that presents:
- The original question/prompt
- The LLM’s response
- A request to classify whether the response is correct
Generate the classification with logprobs enabled
Extract the probability of “True” (or 1 - probability of “False”)

This scorer leverages the model’s own ability to assess the quality of its responses, providing a form of self-consistency check.

Parameters#

When using WhiteBoxUQ, specify "p_true" in the scorers list.

Example#

from uqlm import WhiteBoxUQ

# Initialize with p_true scorer
wbuq = WhiteBoxUQ(
    llm=llm,
    scorers=["p_true"]
)

# Generate responses and compute scores
# Note: p_true generates one additional call per prompt
results = await wbuq.generate_and_score(prompts=prompts)

# Access the p_true scores
print(results.to_df()["p_true"])

References#

Kadavath, S., et al. (2022). Language Models (Mostly) Know What They Know. arXiv.