BERTScore#
bert_score
BERTScore leverages contextualized BERT embeddings to measure the semantic similarity between the original response and sampled candidate responses.
Definition#
Let a tokenized text sequence be denoted as \(\mathbf{t} = \{t_1,...,t_L\}\) and the corresponding contextualized word embeddings as \(\mathbf{E} = \{\mathbf{e}_1,...,\mathbf{e}_L\}\), where \(L\) is the number of tokens in the text.
The BERTScore precision, recall, and F1-scores between two tokenized texts \(\mathbf{t}, \mathbf{t}'\) are respectively defined as follows:
Precision:
Recall:
F1-Score:
where \(\mathbf{e}, \mathbf{e}'\) respectively correspond to \(t, t'\).
We compute our BERTScore-based confidence scores as:
i.e., the average BERTScore F1 across pairings of the original response with all candidate responses.
How It Works#
Generate multiple candidate responses \(\tilde{\mathbf{y}}_i\) from the same prompt
For each pair of original response and candidate:
Tokenize both responses
Compute contextualized BERT embeddings for each token
Calculate pairwise token similarities using dot products
Compute precision, recall, and F1-score
Average the F1-scores across all candidates
Parameters#
When using BlackBoxUQ, specify "bert_score" in the scorers list.
Example#
from uqlm import BlackBoxUQ
# Initialize with bert_score scorer
bbuq = BlackBoxUQ(
llm=llm,
scorers=["bert_score"],
device="cuda" # Use GPU for faster BERT inference
)
# Generate responses and compute scores
results = await bbuq.generate_and_score(prompts=prompts, num_responses=5)
# Access the bert_score scores
print(results.to_df()["bert_score"])
References#
Manakul, P., et al. (2023). SelfCheckGPT: Zero-Resource Black-Box Hallucination Detection for Generative Large Language Models. arXiv.
Zhang, T., et al. (2020). BERTScore: Evaluating Text Generation with BERT. arXiv.
See Also#
BlackBoxUQ- Main class for black-box uncertainty quantificationNormalized Cosine Similarity - Alternative similarity measure using sentence embeddings