Normalized Cosine Similarity#

cosine_sim

Normalized Cosine Similarity (NCS) leverages a sentence transformer to map LLM outputs to an embedding space and measure similarity using those sentence embeddings.

Definition#

Let \(V: \mathcal{Y} \rightarrow \mathbb{R}^d\) denote the sentence transformer, where \(d\) is the dimension of the embedding space.

The average cosine similarity across pairings of the original response with all candidate responses is given as follows:

\[CS(y_i; \tilde{\mathbf{y}}_i) = \frac{1}{m} \sum_{j=1}^m \frac{\mathbf{V}(y_i) \cdot \mathbf{V}(\tilde{y}_{ij})}{\|\mathbf{V}(y_i)\| \|\mathbf{V}(\tilde{y}_{ij})\|}\]

To ensure a standardized support of \([0, 1]\), we normalize cosine similarity to obtain confidence scores as follows:

\[NCS(y_i; \tilde{\mathbf{y}}_i) = \frac{CS(y_i; \tilde{\mathbf{y}}_i) + 1}{2}\]

Key Properties:

Uses sentence-level embeddings rather than token-level
Efficient computation compared to token-level methods like BERTScore
Normalized to \([0, 1]\) range where 1 indicates perfect semantic similarity

How It Works#

Generate multiple candidate responses \(\tilde{\mathbf{y}}_i\) from the same prompt
Encode the original response and all candidates using a sentence transformer
Compute cosine similarity between the original response embedding and each candidate embedding
Average the similarities and normalize to \([0, 1]\)

The default sentence transformer is all-MiniLM-L6-v2, which provides a good balance between speed and quality for semantic similarity tasks.

Parameters#

When using BlackBoxUQ, specify "cosine_sim" in the scorers list.

You can also specify a custom sentence transformer using the sentence_transformer parameter.

Example#

from uqlm import BlackBoxUQ

# Initialize with cosine_sim scorer
bbuq = BlackBoxUQ(
    llm=llm,
    scorers=["cosine_sim"],
    sentence_transformer="all-MiniLM-L6-v2"  # Default sentence transformer
)

# Generate responses and compute scores
results = await bbuq.generate_and_score(prompts=prompts, num_responses=5)

# Access the cosine_sim scores
print(results.to_df()["cosine_sim"])

References#

Shorinwa, O., et al. (2024). A Survey of Confidence Estimation and Calibration in Large Language Models. arXiv.
Sentence Transformers - all-MiniLM-L6-v2. HuggingFace.