Generalized Ensemble#
The Generalized Ensemble allows you to create custom combinations of any black-box, white-box, and LLM-as-a-Judge scorers with configurable weights.
Definition#
Given a set of \(n\) component scorers with scores \(s_1, s_2, ..., s_n\) and weights \(w_1, w_2, ..., w_n\), the ensemble score is:
where weights are normalized such that \(\sum_{k=1}^n w_k = 1\).
Weight Tuning:
Weights can be optimized using labeled data:
where the objective can be ROC-AUC, F1-score, accuracy, or other classification metrics.
Available Components#
The generalized ensemble can include any combination of:
Black-Box Scorers:
semantic_negentropysemantic_sets_confidencenoncontradictionentailmentexact_matchbert_scorecosine_sim
White-Box Scorers:
normalized_probabilitymin_probability
LLM-as-a-Judge:
Any
BaseChatModelinstanceAny
LLMJudgeinstance
How It Works#
Specify the components to include in the ensemble
Optionally specify initial weights (defaults to equal weights)
Generate responses and compute all component scores
Combine using weighted average
Optionally tune weights on labeled data
Weight Tuning Methods#
UQEnsemble supports automatic weight tuning using:
Optuna optimization: Bayesian optimization over weight space
Grid search: For threshold optimization
Supported Objectives:
roc_auc: Area under ROC curve (default for weights)fbeta_score: F-beta score (default for threshold, uses F1 when beta=1)accuracy_score: Classification accuracybalanced_accuracy_score: Balanced accuracy for imbalanced datalog_loss: Logarithmic lossaverage_precision: Average precision scorebrier_score: Brier score
Example#
Basic custom ensemble:
from uqlm import UQEnsemble
# Create ensemble with custom components
ensemble = UQEnsemble(
llm=llm,
scorers=[
"semantic_negentropy",
"noncontradiction",
"cosine_sim",
judge_llm # LLM-as-a-Judge component
],
weights=[0.3, 0.3, 0.2, 0.2] # Custom weights
)
# Generate and score
results = await ensemble.generate_and_score(prompts=prompts, num_responses=5)
print(results.to_df()["ensemble_scores"])
Weight tuning example:
from uqlm import UQEnsemble
# Initialize ensemble (weights will be tuned)
ensemble = UQEnsemble(
llm=llm,
scorers=["semantic_negentropy", "noncontradiction", llm]
)
# Tune weights using labeled data
results = await ensemble.tune(
prompts=prompts,
ground_truth_answers=answers,
num_responses=5,
weights_objective="roc_auc",
thresh_objective="fbeta_score",
n_trials=100
)
# View optimized weights
ensemble.print_ensemble_weights()
# Save configuration for later use
ensemble.save_config("my_ensemble_config.json")
Loading a saved configuration:
from uqlm import UQEnsemble
# Load previously tuned ensemble
ensemble = UQEnsemble.load_config("my_ensemble_config.json", llm=llm)
# Use with new data
results = await ensemble.generate_and_score(prompts=new_prompts)
References#
Bouchard, D. & Chauhan, M. S. (2025). Generalized Ensembles for Robust Uncertainty Quantification of LLMs. arXiv.
See Also#
UQEnsemble- Main ensemble classBS Detector - Pre-configured BS Detector ensemble
Tuner- Weight optimization utilities