uqlm.judges.judge.LLMJudge#

class uqlm.judges.judge.LLMJudge(llm, max_calls_per_min=None, scoring_template='true_false_uncertain', system_prompt=None, template_ques_ans=None, keywords_to_scores_dict=None)#

Bases: ResponseGenerator

__init__(llm, max_calls_per_min=None, scoring_template='true_false_uncertain', system_prompt=None, template_ques_ans=None, keywords_to_scores_dict=None)#

Class for using LLM-as-a-judge to score proposed answers to questions based on correctness. Three off-the-shelf templates are offered: incorrect/uncertain/correct (0/0.5/1), incorrect/correct (0/1), and continuous score (0 to 1). Customization is also supported for user-provided classification-based judging templates. The correct/incorrect/uncertain template is based on Chen and Mueller(2023) [1]

Parameters:
  • llm (langchain llm object) – A langchain llm object to get passed to chain constructor. User is responsible for specifying temperature and other relevant parameters to the constructor of their llm object.

  • max_calls_per_min (int, default=None) – Specifies how many api calls to make per minute to avoid a rate limit error. By default, no limit is specified.

  • scoring_template ({'true_false_uncertain', 'true_false', 'continuous'}, default='true_false_uncertain') – specifies which off-the-shelf template to use, if any. Three off-the-shelf templates offered: incorrect/uncertain/correct (0/0.5/1), incorrect/correct (0/1), and continuous score (0 to 1). These templates are respectively specified as ‘true_false_uncertain’, ‘true_false’, and ‘continuous’

  • system_prompt (str or None, default=None) – Optional argument for user to provide custom system prompt. If None, a default instruction system prompt will be used.

  • template_ques_ans (f-string, default=None) – Template for self reflection question, which includes question and answer to compute LLM judge score. Use this to define the LLM response format, if required update argument “keywords_to_scores_dict” accordingly. Must be formatted so that template_ques_ans.format(question, answer) places question and answer appropriately in the string. Defaults to variation of Chen et al. (2023).

  • keywords_to_scores_dict (dict, default=None) – Keys must be scores, values must be list of strings containing keywords to search. If None, the default dictionary will be used: { 0.0: [“incorrect”, “not correct”, “not right”], 0.5: [“not sure”, “not certain”, “unsure”, “uncertain”], 1.0: [“correct”, “right”], }

Methods

__init__(llm[, max_calls_per_min, ...])

Class for using LLM-as-a-judge to score proposed answers to questions based on correctness.

generate_responses(prompts[, system_prompt, ...])

Generates evaluation dataset from a provided set of prompts.

judge_responses(prompts, responses[, retries])

Judge responses for correctness.

async generate_responses(prompts, system_prompt='You are a helpful assistant.', count=1)#

Generates evaluation dataset from a provided set of prompts. For each prompt, self.count responses are generated.

Return type:

Dict[str, Any]

Parameters:
  • prompts (list of strings) – List of prompts from which LLM responses will be generated

  • system_prompt (str or None, default="You are a helpful assistant.") – Optional argument for user to provide custom system prompt

  • count (int, default=1) – Specifies number of responses to generate for each prompt.

Returns:

A dictionary with two keys: ‘data’ and ‘metadata’.

’data’dict

A dictionary containing the prompts and responses.

’prompt’list

A list of prompts.

’response’list

A list of responses corresponding to the prompts.

’metadata’dict

A dictionary containing metadata about the generation process.

’temperature’float

The temperature parameter used in the generation process.

’count’int

The count of prompts used in the generation process.

’system_prompt’str

The system prompt used for generating responses

Return type:

dict

async judge_responses(prompts, responses, retries=5)#

Judge responses for correctness.

Return type:

Dict[str, Any]

Parameters:
  • prompts (list of str) – A list of input prompts for the model.

  • responses (list of str) – A list of model responses for the provided prompts.

  • retries (int, default=5) – Number of times to retry for failed score extraction

Returns:

Dictionary containing Q/A concatenation prompts, judge responses, and judge scores

Return type:

Dict

References