Generalized Ensemble#

The Generalized Ensemble allows you to create custom combinations of any black-box, white-box, and LLM-as-a-Judge scorers with configurable weights.

Definition#

Given a set of \(n\) component scorers with scores \(s_1, s_2, ..., s_n\) and weights \(w_1, w_2, ..., w_n\), the ensemble score is:

\[\text{Ensemble}(y_i) = \sum_{k=1}^n w_k \cdot s_k(y_i)\]

where weights are normalized such that \(\sum_{k=1}^n w_k = 1\).

Weight Tuning:

Weights can be optimized using labeled data:

\[\mathbf{w}^* = \arg\max_{\mathbf{w}} \text{Objective}(\text{Ensemble}_{\mathbf{w}}, \mathbf{y}_{true})\]

where the objective can be ROC-AUC, F1-score, accuracy, or other classification metrics.

Available Components#

The generalized ensemble can include any combination of:

Black-Box Scorers:

semantic_negentropy
semantic_sets_confidence
noncontradiction
entailment
exact_match
bert_score
cosine_sim

White-Box Scorers:

normalized_probability
min_probability

LLM-as-a-Judge:

Any BaseChatModel instance
Any LLMJudge instance

How It Works#

Specify the components to include in the ensemble
Optionally specify initial weights (defaults to equal weights)
Generate responses and compute all component scores
Combine using weighted average
Optionally tune weights on labeled data

Weight Tuning Methods#

UQEnsemble supports automatic weight tuning using:

Optuna optimization: Bayesian optimization over weight space
Grid search: For threshold optimization

Supported Objectives:

roc_auc: Area under ROC curve (default for weights)
fbeta_score: F-beta score (default for threshold, uses F1 when beta=1)
accuracy_score: Classification accuracy
balanced_accuracy_score: Balanced accuracy for imbalanced data
log_loss: Logarithmic loss
average_precision: Average precision score
brier_score: Brier score

Example#

Basic custom ensemble:

from uqlm import UQEnsemble

# Create ensemble with custom components
ensemble = UQEnsemble(
    llm=llm,
    scorers=[
        "semantic_negentropy",
        "noncontradiction",
        "cosine_sim",
        judge_llm  # LLM-as-a-Judge component
    ],
    weights=[0.3, 0.3, 0.2, 0.2]  # Custom weights
)

# Generate and score
results = await ensemble.generate_and_score(prompts=prompts, num_responses=5)
print(results.to_df()["ensemble_scores"])

Weight tuning example:

from uqlm import UQEnsemble

# Initialize ensemble (weights will be tuned)
ensemble = UQEnsemble(
    llm=llm,
    scorers=["semantic_negentropy", "noncontradiction", llm]
)

# Tune weights using labeled data
results = await ensemble.tune(
    prompts=prompts,
    ground_truth_answers=answers,
    num_responses=5,
    weights_objective="roc_auc",
    thresh_objective="fbeta_score",
    n_trials=100
)

# View optimized weights
ensemble.print_ensemble_weights()

# Save configuration for later use
ensemble.save_config("my_ensemble_config.json")

Loading a saved configuration:

from uqlm import UQEnsemble

# Load previously tuned ensemble
ensemble = UQEnsemble.load_config("my_ensemble_config.json", llm=llm)

# Use with new data
results = await ensemble.generate_and_score(prompts=new_prompts)

References#

Bouchard, D. & Chauhan, M. S. (2025). Generalized Ensembles for Robust Uncertainty Quantification of LLMs. arXiv.