BERTScore#

bert_score

BERTScore leverages contextualized BERT embeddings to measure the semantic similarity between the original response and sampled candidate responses.

Definition#

Let a tokenized text sequence be denoted as \(\mathbf{t} = \{t_1,...,t_L\}\) and the corresponding contextualized word embeddings as \(\mathbf{E} = \{\mathbf{e}_1,...,\mathbf{e}_L\}\), where \(L\) is the number of tokens in the text.

The BERTScore precision, recall, and F1-scores between two tokenized texts \(\mathbf{t}, \mathbf{t}'\) are respectively defined as follows:

Precision:

\[\text{BertP}(\mathbf{t}, \mathbf{t}') = \frac{1}{|\mathbf{t}|} \sum_{t \in \mathbf{t}} \max_{t' \in \mathbf{t}'} \mathbf{e} \cdot \mathbf{e}'\]

Recall:

\[\text{BertR}(\mathbf{t}, \mathbf{t}') = \frac{1}{|\mathbf{t}'|} \sum_{t' \in \mathbf{t}'} \max_{t \in \mathbf{t}} \mathbf{e} \cdot \mathbf{e}'\]

F1-Score:

\[\text{BertF}(\mathbf{t}, \mathbf{t}') = 2\frac{\text{BertP}(\mathbf{t}, \mathbf{t}') \cdot \text{BertR}(\mathbf{t}, \mathbf{t}')}{\text{BertP}(\mathbf{t}, \mathbf{t}') + \text{BertR}(\mathbf{t}, \mathbf{t}')}\]

where \(\mathbf{e}, \mathbf{e}'\) respectively correspond to \(t, t'\).

We compute our BERTScore-based confidence scores as:

\[\text{BertConf}(y_i; \tilde{\mathbf{y}}_i) = \frac{1}{m} \sum_{j=1}^m \text{BertF}(y_i, \tilde{y}_{ij})\]

i.e., the average BERTScore F1 across pairings of the original response with all candidate responses.

How It Works#

Generate multiple candidate responses \(\tilde{\mathbf{y}}_i\) from the same prompt
For each pair of original response and candidate:
- Tokenize both responses
- Compute contextualized BERT embeddings for each token
- Calculate pairwise token similarities using dot products
- Compute precision, recall, and F1-score
Average the F1-scores across all candidates

Parameters#

When using BlackBoxUQ, specify "bert_score" in the scorers list.

Example#

from uqlm import BlackBoxUQ

# Initialize with bert_score scorer
bbuq = BlackBoxUQ(
    llm=llm,
    scorers=["bert_score"],
    device="cuda"  # Use GPU for faster BERT inference
)

# Generate responses and compute scores
results = await bbuq.generate_and_score(prompts=prompts, num_responses=5)

# Access the bert_score scores
print(results.to_df()["bert_score"])

References#

Manakul, P., et al. (2023). SelfCheckGPT: Zero-Resource Black-Box Hallucination Detection for Generative Large Language Models. arXiv.
Zhang, T., et al. (2020). BERTScore: Evaluating Text Generation with BERT. arXiv.