uqlm.judges.judge.LLMJudge#

class uqlm.judges.judge.LLMJudge(llm, max_calls_per_min=None, scoring_template='true_false_uncertain', system_prompt=None, template_ques_ans=None, keywords_to_scores_dict=None)#

Bases: ResponseGenerator

__init__(llm, max_calls_per_min=None, scoring_template='true_false_uncertain', system_prompt=None, template_ques_ans=None, keywords_to_scores_dict=None)#

Class for using LLM-as-a-judge to score proposed answers to questions based on correctness. Four off-the-shelf templates are offered: incorrect/uncertain/correct (0/0.5/1), incorrect/correct (0/1), continuous score (0 to 1), and likert scale score ( 1-5 scale, normalized to 0/0.25/0.5/0.75/1). Customization is also supported for user-provided classification-based judging templates. The correct/incorrect/uncertain template is based on Chen and Mueller(2023) [1]

Parameters:

llm (langchain llm object) – A langchain llm object to get passed to chain constructor. User is responsible for specifying temperature and other relevant parameters to the constructor of their llm object.
max_calls_per_min (int, default=None) – Specifies how many api calls to make per minute to avoid a rate limit error. By default, no limit is specified.
scoring_template ({'true_false_uncertain', 'true_false', 'continuous', 'likert'}, default='true_false_uncertain') – specifies which off-the-shelf template to use, if any. Four off-the-shelf templates offered: incorrect/uncertain/correct (0/0.5/1), incorrect/correct (0/1), continuous score (0 to 1), and likert scale score ( 1-5 scale, normalized to 0/0.25/0.5/0.75/1). These templates are respectively specified as ‘true_false_uncertain’, ‘true_false’, ‘continuous’, and ‘likert’
system_prompt (str or None, default=None) – Optional argument for user to provide custom system prompt. If None, a default instruction system prompt will be used.
template_ques_ans (f-string, default=None) – Template for self reflection question, which includes question and answer to compute LLM judge score. Use this to define the LLM response format, if required update argument “keywords_to_scores_dict” accordingly. Must be formatted so that template_ques_ans.format(question, answer) places question and answer appropriately in the string. Defaults to variation of Chen et al. (2023).
keywords_to_scores_dict (dict, default=None) – Keys must be scores, values must be list of strings containing keywords to search. If None, the default dictionary will be used: { 0.0: [“incorrect”, “not correct”, “not right”], 0.5: [“not sure”, “not certain”, “unsure”, “uncertain”], 1.0: [“correct”, “right”], }

Methods

`__init__`(llm[, max_calls_per_min, ...])	Class for using LLM-as-a-judge to score proposed answers to questions based on correctness.
`generate_responses`(prompts[, system_prompt, ...])	Generates evaluation dataset from a provided set of prompts.
`judge_responses`(prompts, responses[, retries])	Judge responses for correctness.

async generate_responses(prompts, system_prompt='You are a helpful assistant.', count=1)#

Generates evaluation dataset from a provided set of prompts. For each prompt, self.count responses are generated.

Return type:

Dict[str, Any]

Parameters:

prompts (list of strings) – List of prompts from which LLM responses will be generated
system_prompt (str or None, default="You are a helpful assistant.") – Optional argument for user to provide custom system prompt
count (int, default=1) – Specifies number of responses to generate for each prompt.

Returns:

A dictionary with two keys: ‘data’ and ‘metadata’.

’data’dict

A dictionary containing the prompts and responses.

’prompt’list: A list of prompts.
’response’list: A list of responses corresponding to the prompts.

’metadata’dict

A dictionary containing metadata about the generation process.

’temperature’float: The temperature parameter used in the generation process.
’count’int: The count of prompts used in the generation process.
’system_prompt’str: The system prompt used for generating responses

Return type:

dict

async judge_responses(prompts, responses, retries=5)#

Judge responses for correctness.

Return type:

Dict[str, Any]

Parameters:

prompts (list of str) – A list of input prompts for the model.
responses (list of str) – A list of model responses for the provided prompts.
retries (int, default=5) – Number of times to retry for failed score extraction

Returns:

Dictionary containing Q/A concatenation prompts, judge responses, and judge scores

Return type:

Dict

References

uqlm.judges.judge.LLMJudge#

This Page