Demo of ResponseGenerator class#

Import necessary libraries for the notebook.

[1]:
# Run if python-dotenv not installed
# import sys
# !{sys.executable} -m pip install python-dotenv

import os
import time

import openai
import pandas as pd
from dotenv import load_dotenv

from langfair.generator import ResponseGenerator
[2]:
# User to populate .env file with API credentials
repo_path = '/'.join(os.getcwd().split('/')[:-2])
load_dotenv(os.path.join(repo_path, '.env'))

API_KEY = os.getenv('API_KEY')
API_BASE = os.getenv('API_BASE')
API_TYPE = os.getenv('API_TYPE')
API_VERSION = os.getenv('API_VERSION')
MODEL_VERSION = os.getenv('MODEL_VERSION')
DEPLOYMENT_NAME = os.getenv('DEPLOYMENT_NAME')

Read in prompts from which responses will be generated.

[ ]:
# THIS IS AN EXAMPLE SET OF PROMPTS. USER TO REPLACE WITH THEIR OWN PROMPTS
from langfair.utils.dataloader import load_realtoxicity

prompts = load_realtoxicity(n=10)
print(f"\nExample prompt\n{'-'*14}\n'{prompts[0]}'")

ResponseGenerator() - Class for generating data for evaluation from provided set of prompts (class)

Class parameters:

  • langchain_llm (langchain llm (Runnable), default=None) A langchain llm object to get passed to LLMChain llm argument.

  • suppressed_exceptions (tuple, default=None) Specifies which exceptions to handle as ‘Unable to get response’ rather than raising the exception

  • max_calls_per_min (Deprecated as of 0.2.0) Use LangChain’s InMemoryRateLimiter instead.

Below we use LangFair’s ResponseGenerator class to generate LLM responses. To instantiate the ResponseGenerator class, pass a LangChain LLM object as an argument. Note that although this notebook uses AzureChatOpenAI, this can be replaced with a LangChain LLM of your choice.

[4]:
# # Run if langchain-openai not installed
# import sys
# !{sys.executable} -m pip install langchain-openai

# Example with AzureChatOpenAI. REPLACE WITH YOUR LLM OF CHOICE.
from langchain_openai import AzureChatOpenAI

llm = AzureChatOpenAI(
    deployment_name=DEPLOYMENT_NAME,
    openai_api_key=API_KEY,
    azure_endpoint=API_BASE,
    openai_api_type=API_TYPE,
    openai_api_version=API_VERSION,
    temperature=1 # User to set temperature
)
[5]:
# Create langfair ResponseGenerator object
rg = ResponseGenerator(
    langchain_llm=llm,
    suppressed_exceptions=(openai.BadRequestError, ValueError), # this suppresses content filtering errors
)

Estimate token costs before generation#

estimate_token_cost() - Estimates the token cost for a given list of prompts and (optionally) example responses. This method is only compatible with GPT models.

Method Parameters:

  • prompts - (list of strings) A list of prompts.

  • example_responses - (list of strings, optional) A list of example responses. If provided, the function will estimate the response tokens based on these examples.

  • model_name - (str, optional) The name of the OpenAI model to use for token counting.

  • response_sample_size - (int, default=30) The number of responses to generate for cost estimation if response_example_list is not provided.

  • system_prompt - (str, default=”You are a helpful assistant.”) The system prompt to use.

  • count - (int, default=25) The number of generations per prompt used when estimating cost.

Returns:

  • A dictionary containing the estimated token costs, including prompt token cost, completion token cost, and total token cost. (dictionary)

[6]:
for model_name in ["gpt-3.5-turbo-16k-0613", "gpt-4-32k-0613"]:
    estimated_cost = await rg.estimate_token_cost(tiktoken_model_name=model_name, prompts=prompts, count=1)
    print(f"Estimated cost for {model_name}: $", round(estimated_cost['Estimated Total Token Cost (USD)'],2))
Token costs were last updated on 10/21/2024 and may have changed since then.
Estimating cost based on 1 generations per prompt...
Generating sample of responses for cost estimation...
Generating 1 responses per prompt...
Responses successfully generated!
Estimated cost for gpt-3.5-turbo-16k-0613: $ 0.6
Token costs were last updated on 10/21/2024 and may have changed since then.
Estimating cost based on 1 generations per prompt...
Generating sample of responses for cost estimation...
Generating 1 responses per prompt...
Responses successfully generated!
Estimated cost for gpt-4-32k-0613: $ 9.16

Note that using GPT-4 is considerably more expensive than GPT-3.5

Evaluating Response Time: Asynchronous Generation with ResponseGenerator vs Synchronous Generation with openai.chat.completions.create#

Generate responses asynchronously with ResponseGenerator

generate_responses() - Generates evaluation dataset from a provided set of prompts. For each prompt, self.count responses are generated.

Method Parameters:

  • prompts - (list of strings) A list of prompts

  • system_prompt - (str or None, default=”You are a helpful assistant.”) Specifies the system prompt used when generating LLM responses.

  • count - (int, default=25) Specifies number of responses to generate for each prompt.

Returns:

A dictionary with two keys: data and metadata.

  • data (dict) A dictionary containing the prompts and responses.

  • metadata (dict) A dictionary containing metadata about the generation process, including non-completion rate, temperature, and count.

[7]:
# Generate 1 response per prompt for 200 prompts
start = time.time()
async_responses = await rg.generate_responses(
    prompts=prompts[0:200],
    count=1
)
stop = time.time()
print(f"Time elapsed for asynchronous generation: {stop - start}")
Generating 1 responses per prompt...
Responses successfully generated!
Time elapsed for asynchronous generation: 17.648038625717163
[8]:
pd.DataFrame(async_responses['data'])
[8]:
prompt response
0 Corruption involving the contractors is the ch... report released by an independent watchdog gro...
1 5][6][7] The term was coined by the author Maki It seems like there is some missing context in...
2 As for the internals, the people with the most... typically the executives and upper management ...
3 On March 2, 2017, roughly 100 of our I'm sorry, but your message appears to be inco...
4 Syrian Democratic Forces’ spokesman, Talal Sil... I'm sorry, but I couldn't find any recent info...
... ... ...
195 The poet responsible for those words might not... Brandis, but indeed he should be. The poet res...
196 “Previously we had a hacked together experienc... combination of different software tools to man...
197 In my case, I started life as a first-born son, That's an interesting starting point! Being a ...
198 Channel is part of the government While I am an AI assistant and cannot provide ...
199 Like Van Gogh, he wanted to make passionate im... their emotions. Van Gogh was known for his exp...

200 rows × 2 columns

[9]:
async_responses['metadata']
[9]:
{'non_completion_rate': 0.005,
 'system_prompt': 'You are a helpful assistant.',
 'temperature': 1.0,
 'count': 1}

Generate responses synchronously for comparison#

[10]:
def openai_api_call(prompt, system_prompt="You are a helpful assistant.", model="exai-gpt-35-turbo-16k"):
    try:
        completion = openai.chat.completions.create(
            model=model,
            messages=[
                {"role": "system", "content": system_prompt},
                {"role": "user", "content": prompt}
            ]
        )
        return completion.choices[0].message.content
    except openai.BadRequestError:
        return "Unable to get response"
[10]:
openai.api_key = API_KEY
openai.azure_endpoint = API_BASE
openai.model_version = MODEL_VERSION
openai.api_version = API_VERSION
openai.api_type = API_TYPE

start = time.time()
sync_responses = [openai_api_call(prompt) for prompt in prompts[0:200]]
stop = time.time()
print(f"Time elapsed for synchronous generation: {stop - start}")
Time elapsed for synchronous generation: 370.58987402915955

Note that asynchronous generation with ResponseGenerator is significantly faster than synchonous generation.

Handling RateLimitError with ResponseGenerator#

Passing too many requests asynchronously will trigger a RateLimitError. For our ‘exai-gpt-35-turbo-16k’ deployment, 1000 prompts at 25 generations per prompt with async exceeds the rate limit.

[9]:
responses = await rg.generate_responses(prompts=prompts)
langfair: Generating 25 responses per prompt...
---------------------------------------------------------------------------
RateLimitError                            Traceback (most recent call last)
Cell In[9], line 1
----> 1 responses = await rg.generate_responses(prompts=prompts_df.head(1000).prompt)

File ~/PUBLIC/langfair/langfair/generator/generator.py:231, in ResponseGenerator.generate_responses(self, prompts, system_prompt, count)
    229 # set up langchain and generate asynchronously
    230 chain = self._setup_langchain(system_message=system_prompt)
--> 231 generations, duplicated_prompts = await self._generate_in_batches(
    232     chain=chain, prompts=prompts
    233 )
    234 responses = []
    235 for response in generations:

File ~/PUBLIC/langfair/langfair/generator/generator.py:342, in ResponseGenerator._generate_in_batches(self, chain, prompts, system_prompts)
    338 # generate responses for current batch
    339 tasks, duplicated_batch_prompts = self._task_creator(
    340     chain, prompt_batch, system_prompts
    341 )
--> 342 responses_batch = await asyncio.gather(*tasks)
    344 # extend lists to include current batch
    345 duplicated_prompts.extend(duplicated_batch_prompts)

File ~/PUBLIC/langfair/langfair/generator/generator.py:364, in ResponseGenerator._async_api_call(chain, prompt, system_text, count)
    362 """Generates responses asynchronously using an LLMChain object"""
    363 try:
--> 364     result = await chain.agenerate(
    365         [{"text": prompt, "system_text": system_text}]
    366     )
    367     return [result.generations[0][i].text for i in range(count)]
    368 except (
    369     openai.APIConnectionError,
    370     openai.NotFoundError,
   (...)
    374     openai.RateLimitError,
    375 ):

File /opt/conda/envs/langfair/lib/python3.9/site-packages/langchain/chains/llm.py:165, in LLMChain.agenerate(self, input_list, run_manager)
    163 callbacks = run_manager.get_child() if run_manager else None
    164 if isinstance(self.llm, BaseLanguageModel):
--> 165     return await self.llm.agenerate_prompt(
    166         prompts,
    167         stop,
    168         callbacks=callbacks,
    169         **self.llm_kwargs,
    170     )
    171 else:
    172     results = await self.llm.bind(stop=stop, **self.llm_kwargs).abatch(
    173         cast(List, prompts), {"callbacks": callbacks}
    174     )

File /opt/conda/envs/langfair/lib/python3.9/site-packages/langchain_core/language_models/chat_models.py:570, in BaseChatModel.agenerate_prompt(self, prompts, stop, callbacks, **kwargs)
    562 async def agenerate_prompt(
    563     self,
    564     prompts: List[PromptValue],
   (...)
    567     **kwargs: Any,
    568 ) -> LLMResult:
    569     prompt_messages = [p.to_messages() for p in prompts]
--> 570     return await self.agenerate(
    571         prompt_messages, stop=stop, callbacks=callbacks, **kwargs
    572     )

File /opt/conda/envs/langfair/lib/python3.9/site-packages/langchain_core/language_models/chat_models.py:530, in BaseChatModel.agenerate(self, messages, stop, callbacks, tags, metadata, run_name, run_id, **kwargs)
    517     if run_managers:
    518         await asyncio.gather(
    519             *[
    520                 run_manager.on_llm_end(
   (...)
    528             ]
    529         )
--> 530     raise exceptions[0]
    531 flattened_outputs = [
    532     LLMResult(generations=[res.generations], llm_output=res.llm_output)  # type: ignore[list-item, union-attr]
    533     for res in results
    534 ]
    535 llm_output = self._combine_llm_outputs([res.llm_output for res in results])  # type: ignore[union-attr]

File /opt/conda/envs/langfair/lib/python3.9/site-packages/langchain_core/language_models/chat_models.py:715, in BaseChatModel._agenerate_with_cache(self, messages, stop, run_manager, **kwargs)
    713 else:
    714     if inspect.signature(self._agenerate).parameters.get("run_manager"):
--> 715         result = await self._agenerate(
    716             messages, stop=stop, run_manager=run_manager, **kwargs
    717         )
    718     else:
    719         result = await self._agenerate(messages, stop=stop, **kwargs)

File /opt/conda/envs/langfair/lib/python3.9/site-packages/langchain_openai/chat_models/base.py:623, in BaseChatOpenAI._agenerate(self, messages, stop, run_manager, **kwargs)
    621 message_dicts, params = self._create_message_dicts(messages, stop)
    622 params = {**params, **kwargs}
--> 623 response = await self.async_client.create(messages=message_dicts, **params)
    624 return self._create_chat_result(response)

File /opt/conda/envs/langfair/lib/python3.9/site-packages/openai/resources/chat/completions.py:1633, in AsyncCompletions.create(self, messages, model, audio, frequency_penalty, function_call, functions, logit_bias, logprobs, max_completion_tokens, max_tokens, metadata, modalities, n, parallel_tool_calls, presence_penalty, response_format, seed, service_tier, stop, store, stream, stream_options, temperature, tool_choice, tools, top_logprobs, top_p, user, extra_headers, extra_query, extra_body, timeout)
   1593 @required_args(["messages", "model"], ["messages", "model", "stream"])
   1594 async def create(
   1595     self,
   (...)
   1630     timeout: float | httpx.Timeout | None | NotGiven = NOT_GIVEN,
   1631 ) -> ChatCompletion | AsyncStream[ChatCompletionChunk]:
   1632     validate_response_format(response_format)
-> 1633     return await self._post(
   1634         "/chat/completions",
   1635         body=await async_maybe_transform(
   1636             {
   1637                 "messages": messages,
   1638                 "model": model,
   1639                 "audio": audio,
   1640                 "frequency_penalty": frequency_penalty,
   1641                 "function_call": function_call,
   1642                 "functions": functions,
   1643                 "logit_bias": logit_bias,
   1644                 "logprobs": logprobs,
   1645                 "max_completion_tokens": max_completion_tokens,
   1646                 "max_tokens": max_tokens,
   1647                 "metadata": metadata,
   1648                 "modalities": modalities,
   1649                 "n": n,
   1650                 "parallel_tool_calls": parallel_tool_calls,
   1651                 "presence_penalty": presence_penalty,
   1652                 "response_format": response_format,
   1653                 "seed": seed,
   1654                 "service_tier": service_tier,
   1655                 "stop": stop,
   1656                 "store": store,
   1657                 "stream": stream,
   1658                 "stream_options": stream_options,
   1659                 "temperature": temperature,
   1660                 "tool_choice": tool_choice,
   1661                 "tools": tools,
   1662                 "top_logprobs": top_logprobs,
   1663                 "top_p": top_p,
   1664                 "user": user,
   1665             },
   1666             completion_create_params.CompletionCreateParams,
   1667         ),
   1668         options=make_request_options(
   1669             extra_headers=extra_headers, extra_query=extra_query, extra_body=extra_body, timeout=timeout
   1670         ),
   1671         cast_to=ChatCompletion,
   1672         stream=stream or False,
   1673         stream_cls=AsyncStream[ChatCompletionChunk],
   1674     )

File /opt/conda/envs/langfair/lib/python3.9/site-packages/openai/_base_client.py:1838, in AsyncAPIClient.post(self, path, cast_to, body, files, options, stream, stream_cls)
   1824 async def post(
   1825     self,
   1826     path: str,
   (...)
   1833     stream_cls: type[_AsyncStreamT] | None = None,
   1834 ) -> ResponseT | _AsyncStreamT:
   1835     opts = FinalRequestOptions.construct(
   1836         method="post", url=path, json_data=body, files=await async_to_httpx_files(files), **options
   1837     )
-> 1838     return await self.request(cast_to, opts, stream=stream, stream_cls=stream_cls)

File /opt/conda/envs/langfair/lib/python3.9/site-packages/openai/_base_client.py:1532, in AsyncAPIClient.request(self, cast_to, options, stream, stream_cls, remaining_retries)
   1529 else:
   1530     retries_taken = 0
-> 1532 return await self._request(
   1533     cast_to=cast_to,
   1534     options=options,
   1535     stream=stream,
   1536     stream_cls=stream_cls,
   1537     retries_taken=retries_taken,
   1538 )

File /opt/conda/envs/langfair/lib/python3.9/site-packages/openai/_base_client.py:1618, in AsyncAPIClient._request(self, cast_to, options, stream, stream_cls, retries_taken)
   1616 if remaining_retries > 0 and self._should_retry(err.response):
   1617     await err.response.aclose()
-> 1618     return await self._retry_request(
   1619         input_options,
   1620         cast_to,
   1621         retries_taken=retries_taken,
   1622         response_headers=err.response.headers,
   1623         stream=stream,
   1624         stream_cls=stream_cls,
   1625     )
   1627 # If the response is streamed then we need to explicitly read the response
   1628 # to completion before attempting to access the response text.
   1629 if not err.response.is_closed:

File /opt/conda/envs/langfair/lib/python3.9/site-packages/openai/_base_client.py:1665, in AsyncAPIClient._retry_request(self, options, cast_to, retries_taken, response_headers, stream, stream_cls)
   1661 log.info("Retrying request to %s in %f seconds", options.url, timeout)
   1663 await anyio.sleep(timeout)
-> 1665 return await self._request(
   1666     options=options,
   1667     cast_to=cast_to,
   1668     retries_taken=retries_taken + 1,
   1669     stream=stream,
   1670     stream_cls=stream_cls,
   1671 )

File /opt/conda/envs/langfair/lib/python3.9/site-packages/openai/_base_client.py:1618, in AsyncAPIClient._request(self, cast_to, options, stream, stream_cls, retries_taken)
   1616 if remaining_retries > 0 and self._should_retry(err.response):
   1617     await err.response.aclose()
-> 1618     return await self._retry_request(
   1619         input_options,
   1620         cast_to,
   1621         retries_taken=retries_taken,
   1622         response_headers=err.response.headers,
   1623         stream=stream,
   1624         stream_cls=stream_cls,
   1625     )
   1627 # If the response is streamed then we need to explicitly read the response
   1628 # to completion before attempting to access the response text.
   1629 if not err.response.is_closed:

File /opt/conda/envs/langfair/lib/python3.9/site-packages/openai/_base_client.py:1665, in AsyncAPIClient._retry_request(self, options, cast_to, retries_taken, response_headers, stream, stream_cls)
   1661 log.info("Retrying request to %s in %f seconds", options.url, timeout)
   1663 await anyio.sleep(timeout)
-> 1665 return await self._request(
   1666     options=options,
   1667     cast_to=cast_to,
   1668     retries_taken=retries_taken + 1,
   1669     stream=stream,
   1670     stream_cls=stream_cls,
   1671 )

File /opt/conda/envs/langfair/lib/python3.9/site-packages/openai/_base_client.py:1633, in AsyncAPIClient._request(self, cast_to, options, stream, stream_cls, retries_taken)
   1630         await err.response.aread()
   1632     log.debug("Re-raising status error")
-> 1633     raise self._make_status_error_from_response(err.response) from None
   1635 return await self._process_response(
   1636     cast_to=cast_to,
   1637     options=options,
   (...)
   1641     retries_taken=retries_taken,
   1642 )

RateLimitError: Error code: 429 - {'error': {'code': '429', 'message': 'Requests to the ChatCompletions_Create Operation under Azure OpenAI API version 2023-07-01-preview have exceeded token rate limit of your current OpenAI S0 pricing tier. Please retry after 36 seconds. Please go here: https://aka.ms/oai/quotaincrease if you would like to further increase the default rate limit.'}}

To handle this error, we can use InMemoryRateLimiter to limit the number of requests per minute.

[10]:
from langchain_core.rate_limiters import InMemoryRateLimiter

rate_limiter = InMemoryRateLimiter(
    requests_per_second=5,
    check_every_n_seconds=5,
    max_bucket_size=500,
)

llm = AzureChatOpenAI(
    deployment_name=DEPLOYMENT_NAME,
    openai_api_key=API_KEY,
    azure_endpoint=API_BASE,
    openai_api_type=API_TYPE,
    openai_api_version=API_VERSION,
    temperature=1, # User to set temperature
    rate_limiter=rate_limiter
)

rg_limited = ResponseGenerator(langchain_llm=llm)
[11]:
responses = await rg_limited.generate_responses(prompts=prompts)
Generating 25 responses per prompt...
Responses successfully generated!
[13]:
pd.DataFrame(responses['data'])
[13]:
prompt response
0 You are required to keep generation given the ... Hosaka in his novel "Generation X." It refers ...
1 You are required to keep generation given the ... The term was coined by the author Maki to desc...
2 You are required to keep generation given the ... The term was coined by the author Maki [genera...
3 You are required to keep generation given the ... The term was coined by the author Maki for the...
4 You are required to keep generation given the ... The term was coined by the author Maki to desc...
... ... ...
24995 You are required to keep generation given the ... discuss the importance of preserving and hand ...
24996 You are required to keep generation given the ... discuss the influence of music on Dave's caree...
24997 You are required to keep generation given the ... discuss the evolution of music in the 1970s. T...
24998 You are required to keep generation given the ... Close delve into the captivating world of musi...
24999 You are required to keep generation given the ... explore the vast influence of legendary musici...

25000 rows × 2 columns