Functional Equivalence Scorers#

Definition#

Functional equivalence scorers use an LLM to judge whether two code snippets are functionally equivalent, meaning they would produce the same outputs for valid inputs. These scorers were proposed by Bouchard et al. (2026).

functional_equivalence_rate estimates the proportion of sampled responses that are functionally equivalent to the original response:

\[FER(y; \tilde{\mathbf{y}}) = \frac{1}{m} \sum_{j=1}^{m} \mathbb{I}[y \equiv \tilde{y}_j]\]

functional_negentropy clusters the original and sampled responses by functional equivalence, computes entropy over the cluster distribution, and normalizes it to a confidence score. Let \(\mathcal{C}\) denote the set of functional equivalence clusters, and let \(P(C)\) denote the proportion of responses in cluster \(C\). Functional entropy is:

\[FE(y; \tilde{\mathbf{y}}) = -\sum_{C \in \mathcal{C}} P(C) \log P(C)\]

The normalized confidence score is:

\[NFN(y; \tilde{\mathbf{y}}) = 1 - \frac{FE(y; \tilde{\mathbf{y}})}{\log(m + 1)}\]

functional_sets_confidence counts the number of functional equivalence clusters and normalizes it to \([0, 1]\):

\[FSC(y; \tilde{\mathbf{y}}) = \frac{m + 1 - |\mathcal{C}|}{m}\]

where \(|\mathcal{C}|\) is the number of functional equivalence clusters among the original response and \(m\) sampled responses.

Key Properties:

  • Directly targets functional agreement rather than textual similarity

  • Requires an LLM for equivalence judgments

  • Score range: \([0, 1]\)

Parameters#

When using CodeGenUQ, specify "functional_equivalence_rate", "functional_negentropy", or "functional_sets_confidence" in the scorers list. You can set equivalence_llm to use a separate model for equivalence judgments.

Example#

from uqlm import CodeGenUQ

code_uq = CodeGenUQ(
    llm=llm,
    equivalence_llm=equivalence_llm,
    scorers=[
        "functional_equivalence_rate",
        "functional_negentropy",
        "functional_sets_confidence",
    ],
    language="python",
)

results = await code_uq.generate_and_score(prompts=prompts, num_responses=5)

References#

See Also#

  • CodeGenUQ - Class for code-generation uncertainty quantification