I’ve been building ChatGPT-powered features since its release and have been absolutely amazed by what it can do. Even though I’ve been interacting with GPT every day for a while now, oftentimes its outputs still feel like magic. For months, creating inputs (prompts) for GPT was a fun new way to program this tool, but handling its output was a mess. In my prompting, I experimented with various ways to steer its output to give it structure and used basic string functions to slice and dice the results, but it never worked 100% of the time. I evaluated a number of tools that tried to fix this problem, but none of them quite fit our use case. All of this changed dramatically with the advent of Function Calling. For those that have been living in a cave, OpenAI fine-tuned GPT3.5-turbo and GPT-4 so that these models can accept a list of functions, defined using JSON schema, as input along with a prompt. The model(s) can then select a function and call it as output. That is, GPT will return an output that answers one’s prompt, but in the form of a function call with the function name and inputs for its parameters. Thus, a program accepting that output can choose to call that function with those inputs and return the output to GPT for further processing. This seemingly simple functionality was the basis for an entire ecosystem of ChatGPT “agents” that existed prior to OpenAI rolling out function calling; now a user can build a custom agent using just the OpenAI libraries.
I suspect, however, that, like me, a large majority of people using function calling are using it to get structured output out of GPT. For many, there are stomach-turning security issues with allowing a LLM to execute code in your program, but for receiving structured output, function calling is extremely useful. In this blog post I will demonstrate how to use function calling to receive structured output and that one small improvement — adding an explanation parameter — appears to improve the outputs of function calls and helps to debug the prompts that generate them significantly.
If you’re trying to solve a problem with ChatGPT, at some point in time you’ve googled the phrase “prompt engineering.” And, if you got lucky, you came across Lilian Weng’s post on the topic that saved you from having to read a bunch of verbose academic papers. Lilian does a fantastic job at explaining the technical side of prompt engineering and summarizing the results of academic papers; her work can immediately improve your prompting. However, I want to diverge for a second to explain how I often think about prompting LLMs that has recently helped me.
Imagine you’ve hired a high school or college graduate that excelled in all language or liberal arts classes. They possess an elite ability to apply language reasoning skills to any tasks that involve language. However, for your specific task, they lack any prior context on how to perform the task. Thus, for them to succeed at performing your task, you can either teach them all the context you’ve learned about the task, or you can give them a set of criteria to follow where all they need to perform the task is to apply the criteria and use their language reasoning skills.
Since you’ve hired this person with no background or context on what you do, one thing you might do to verify the results they produce is ask them to provide a chain of reasoning or rationale for why they arrived at the answer they did. For the new employee, being forced to write down a chain of reasoning will likely help them arrive at an accurate answer. For you, their manager, it will help you to understand if they’ve arrived at an accurate answer and if there are any gaps or issues in the criteria you provided them. Furthermore, for any third party you might show their work, it would reduce any confusion about the answer by seeing an explanation of how the answer was found.
Of course, in our example the employee is ChatGPT and this style of requesting a chain of reasoning is now a common technique for improving results. Chain-of-thought (CoT) prompting was introduced by Wei et al. 2022 as well as Kojima et al. 2022, Zhou et al. 2022 and later made very popular in the ReAct paper by Yao et al. 2023. In its most basic form, simply asking ChatGPT to think step-by-step before providing an answer fits this bill, though these papers introduce a much more nuanced perspective.
Prior to function calling, I was using chain-of-thought prompting not only to obtain more accurate answers, but also to debug edge cases in my prompts. When you’re providing ChatGPT criteria to follow to help create an answer, oftentimes there will be an edge case you have not considered and ChatGPT’s explanation will show the gap in your criteria. This explanation would be provided in the text blob that came back from ChatGPT. If you just stored the raw prompt and raw answer and examined these when your provided answer didn’t make much sense, this explanation would oftentimes elucidate the gap in your criteria. However, when you start using function calling, you likely will not be getting a text blob back.
Here are a few examples from my work with ChatGPT:
- Return the top-k most similar questions to a given question,
- Create a synthesized answer between two answers to the same question,
- Find a specific column type in tabular data,
- Answer a new question based on similar previous questions’ answers,
- Answer a question using related documentation to find the answer.
In all of these examples, once you move to function calling, ChatGPT is going to start filling a parameter with its answer to these requests. However the explanation of how it arrived at that answer, which was regularly a feature of the old text blob, is lost. As a result of this, I started adding an explanation parameter to every single function call and modifying the prompt so ChatGPT knew to dump an explanation into this parameter. Not only did this subjectively seem to improve the answers overall (in the same way CoT is supposed to), it also provided me with a way to debug my prompts when answers went haywire. Let’s look at a more concrete example to better understand what I mean.
We will look at an example of answering a new question based on similar questions’ answers. Note that “similarity” is measured via an approximate nearest neighbors search over previous questions that have been embedded as vectors. Let’s take a look at the function call we are sending to GPT:
{"name": "get_answer_and_explanation",
"description": "Returns the answer to the new question and an explanation of how the answer was created",
"parameters": {
"type": "object",
"properties": {
"answer": {
"type": "string",
"description": "The answer to the new question",
},
"explanation": {
"type": "string",
"description": "An explanation of how the answer was created",
},
},
"required": ["answer", "explanation"],
},
}
As you can see, this is pretty straight forward. We just add a second parameter to the function call called “explanation” and give it a succinct description. GPT will create an answer to our new question, fill it in the answer parameter and then explain how it arrived at that answer in the explanation parameter. Next, let’s see how we connect this function call back to the prompt.
Recall that the goal is to steer ChatGPT and provide a list of criteria it can follow to utilize its incredible reasoning abilities without providing it a mass of in-context examples (my main reason for avoiding the latter is because of token rate limits). Here is the beginning of the system prompt:
You are a helpful assistant that answers questions about a company based onsimilar questions the company has responded to in the past. Below are 5
questions the company has responded to in the past. After those 5 questions,
there will be a new question for you to answer.
This provides the most basic steering on the task ChatGPT is to perform. Next, also in the system prompt, we provide the list of criteria that will help it understand how to apply its reasoning abilities and adjudicate edge cases.
Please think step-by-step when answering the new question and use the criterialisted below as guidelines for answering the new question:
a.) Use the old 5 questions and their responses as context to answer the
new question.
b.) The old 5 questions are sorted by similarity to your new question.
However, the most similar questions may not have the best responses
to answer your new question. Analyze the old questions and their
answers and then pick one of the following strategies for answering
the new question based on your analysis:
b1.) Two or more old questions are similar to the new question and have
good responses. Synthesize these responses into an answer for the
new question.
b2.) Two or more old questions are similar to the new question,
but their responses disagree or contradict each other.
Pick the response that answers the new question the best.
b3.) One old question has the best response for answering the new question.
Discard the other old questions and use that response to answer
the new question.
b4.) The old questions are sufficiently dissimilar to the new question,
answer the new question with “cannot be answered”
c.) Additional criteria on how to evaluate a response:
c1.) A good response is usually sentence(s) of English. Note it can contain
grammatical and spelling mistakes.
c2.) A good response could be considered a response to it’s question.
c3.) The voice of a good response should be written as if it is a
human at a company responding to the question. Sometimes it’s as
simple as “Yes” or “No” and other times it includes additional
information.
Now, I didn’t come up with 10 criteria in a vacuum. This list was built up over successive requests to ChatGPT and analysis of its responses. And, like I’ve already argued at length, the explanation parameter was the single best method for understanding what ChatGPT was doing and thereby enhancing this list of criteria. Finally, also in the system prompt, we provide formatting instructions which is what will help ChatGPT correctly call the function:
Your answer should be the answer to the new question and anexplanation of how you arrived at that answer that includes
your step-by-step reasoning using the criteria above.
Use this format:
Answer: <your-answer>
Explanation: <your-explanation>
You can see here that the end of the system prompt is what connects the whole prompt back up to the function call. Now GPT can easily see how its response to the prompt can match the function’s input parameters.
Given what we know about the function call and prompt, what are some examples of these explanations? Here is one of the more simple examples where GPT-4’s explanation provides so much more detail on how GPT-4 created the answer:
"answer": "Yes, our company does have cyber insurance.","explanation": "The new question is 'does your company have cyber insurance?'
The most similar old questions are 'does your company have cyber insurance?'
and 'how much coverage do you have in cyber insurance?' The responses to
these old questions confirm that the company does have cyber insurance.
Therefore, the strategy b3 is used: One old question has the best response
for answering the new question. The response to the old question 'does your
company have cyber insurance?' is used to answer the new question."
I find it so cool when GPT-4 highlights which strategy it’s chosen to create an answer:
“explanation”: “The new question is asking about the amount of coveragethe company has in cyber insurance. Looking at the old questions, question
2 and question 3 are the most similar to the new question and both provide
a clear response about the amount of coverage the company has in cyber
insurance. The company maintains a Cyber Liability with a combined limit
of $5,000,000 each occurrence and $10,000,000 annual aggregate. Therefore,
I used strategy b1, synthesizing these responses into an answer for the new
question.”
Here is another where it synthesizes across many questions:
“explanation”: “The answer is synthesized from multiple previous questionsand their responses. Questions 2, 3, 6, 8, 9, 11, 14, 17, 18, 19, 20, and 21
all provide information about the various tools and infrastructure pieces
that the company monitors. These responses collectively provide a
comprehensive answer to the new question.”
How about an older one where the explanation helped me to improve the list of criteria?
“answer”: “Cannot be answered.”,“explanation”: “I apologize, but I cannot provide an answer based on the
given responses. The two answers provided are contradictory, and it is
unclear which one accurately reflects the types of hygiene maintenance
required for the company. Therefore, I cannot answer the question with
certainty.”
Without this explanation, I would have had no idea why GPT-4 was choosing Cannot be answered as the answer and likely not have been able to debug what was happening. This is a situation that is typical with function calling and why I think the `explanation` parameter is so helpful. It is easy to see how this would guide one to engineer the prompt differently. I hope this blog has shown how this small change to one’s function calls can improve quality-of-life with respect to ChatGPT outputs.
Finally, I should also address the paper “The Unreliability of explanations in Few-shot prompting for Textual Reasoning” by Ye & Durrett 2022. Here, the authors provided few-shot prompts to GPT3.5 that included explanations and found that when GPT3.5 hallucinated an answer it would often hallucinate an explanation that entailed that answer (it was less likely to produce an explanation that did not entail the answer). Any regular user of ChatGPT should know that it is possible for GPT to hallucinate its answer and explanation and so it’s always good to check an entire prompt and make sure this is not happening. That said, it is reasonable to presume that GPT-4 has gotten better at this sort of thing. Further, my instinct from my usage of GPT models is that few-shot examples often cause GPT to try and fit its responses to these examples at the expense of hallucinations. That is, GPT would rather produce an explanation that more closely models the few-shot explanations and is a hallucination than one that does not model them and is factually accurate. I do not have any research to back up this instinct so take it with a grain of salt. At any rate, one key difference between the paper and what I’ve described above is that there are no few-shot examples on the explanation parameter; we only attempt to constrain it via the criteria list. My subjective take from repeated use of this setup is that GPT-4’s explanations appear to always entail the answer and almost always cite which criteria it has chosen to create the answer which raises confidence about factual accuracy. But any GPT user should always be checking answers and explanations regularly for hallucinations.