uqlm.judges.judge.LLMJudge#
- class uqlm.judges.judge.LLMJudge(llm, max_calls_per_min=None, scoring_template='true_false_uncertain', system_prompt=None, template_ques_ans=None, keywords_to_scores_dict=None)#
Bases:
ResponseGenerator
- __init__(llm, max_calls_per_min=None, scoring_template='true_false_uncertain', system_prompt=None, template_ques_ans=None, keywords_to_scores_dict=None)#
Class for using LLM-as-a-judge to score proposed answers to questions based on correctness. Three off-the-shelf templates are offered: incorrect/uncertain/correct (0/0.5/1), incorrect/correct (0/1), and continuous score (0 to 1). Customization is also supported for user-provided classification-based judging templates. The correct/incorrect/uncertain template is based on Chen and Mueller(2023) [1]
- Parameters:
llm (langchain llm object) – A langchain llm object to get passed to chain constructor. User is responsible for specifying temperature and other relevant parameters to the constructor of their llm object.
max_calls_per_min (int, default=None) – Specifies how many api calls to make per minute to avoid a rate limit error. By default, no limit is specified.
scoring_template ({'true_false_uncertain', 'true_false', 'continuous'}, default='true_false_uncertain') – specifies which off-the-shelf template to use, if any. Three off-the-shelf templates offered: incorrect/uncertain/correct (0/0.5/1), incorrect/correct (0/1), and continuous score (0 to 1). These templates are respectively specified as ‘true_false_uncertain’, ‘true_false’, and ‘continuous’
system_prompt (str or None, default=None) – Optional argument for user to provide custom system prompt. If None, a default instruction system prompt will be used.
template_ques_ans (f-string, default=None) – Template for self reflection question, which includes question and answer to compute LLM judge score. Use this to define the LLM response format, if required update argument “keywords_to_scores_dict” accordingly. Must be formatted so that template_ques_ans.format(question, answer) places question and answer appropriately in the string. Defaults to variation of Chen et al. (2023).
keywords_to_scores_dict (dict, default=None) – Keys must be scores, values must be list of strings containing keywords to search. If None, the default dictionary will be used: { 0.0: [“incorrect”, “not correct”, “not right”], 0.5: [“not sure”, “not certain”, “unsure”, “uncertain”], 1.0: [“correct”, “right”], }
Methods
__init__
(llm[, max_calls_per_min, ...])Class for using LLM-as-a-judge to score proposed answers to questions based on correctness.
generate_responses
(prompts[, system_prompt, ...])Generates evaluation dataset from a provided set of prompts.
judge_responses
(prompts, responses[, retries])Judge responses for correctness.
- async generate_responses(prompts, system_prompt='You are a helpful assistant.', count=1)#
Generates evaluation dataset from a provided set of prompts. For each prompt, self.count responses are generated.
- Return type:
Dict
[str
,Any
]- Parameters:
prompts (list of strings) – List of prompts from which LLM responses will be generated
system_prompt (str or None, default="You are a helpful assistant.") – Optional argument for user to provide custom system prompt
count (int, default=1) – Specifies number of responses to generate for each prompt.
- Returns:
A dictionary with two keys: ‘data’ and ‘metadata’.
- ’data’dict
A dictionary containing the prompts and responses.
- ’prompt’list
A list of prompts.
- ’response’list
A list of responses corresponding to the prompts.
- ’metadata’dict
A dictionary containing metadata about the generation process.
- ’temperature’float
The temperature parameter used in the generation process.
- ’count’int
The count of prompts used in the generation process.
- ’system_prompt’str
The system prompt used for generating responses
- Return type:
dict
- async judge_responses(prompts, responses, retries=5)#
Judge responses for correctness.
- Return type:
Dict
[str
,Any
]- Parameters:
prompts (list of str) – A list of input prompts for the model.
responses (list of str) – A list of model responses for the provided prompts.
retries (int, default=5) – Number of times to retry for failed score extraction
- Returns:
Dictionary containing Q/A concatenation prompts, judge responses, and judge scores
- Return type:
Dict
References