Improving Prompt Consistency with Structured Generations

Context: Evaluation Sensitivity to Format Changes

It has become increasingly clear that LLM benchmark performance is closely, and somewhat surprisingly, dependent on the format of the prompt itself, even though a number of methods have been introduced through the years to reduce prompt-related variance. For example, when we evaluate models in few-shot, we provide format examples to the model to force a specific pattern in output; when we compare the log-likelihood of plausible answers instead of allowing free-form generation, we attempt to constrain the answer space.

The Leaderboards and Evals team provided a demonstration of this by looking at 8 different prompt formats for a well-known task, MMLU (looking at 4 subsets of the task). These prompt variations were provided to 5 different models (chosen because they were SOTA at the time for their size, and covered a variety of tokenization and languages). Scores were computed using a log-probability evaluation, where the most probable answer is considered the correct one, a classic metric for multi-choice tasks.

Exploring Prompt Variations

Let's look at the different formats in more detail, by using the first question of the global_facts subset of MMLU.

Question: “As of 2016, about what percentage of adults aged 18 years or older were overweight?”

Choices: [ "10%", "20%", "40%", "80%" ]

Correct choice: “40%”

Without choices in the prompt:

As of 2016, about what percentage of adults aged 18 years or older were overweight? Q: As of 2016, about what percentage of adults aged 18 years or older were overweight? A: Question: As of 2016, about what percentage of adults aged 18 years or older were overweight? Answer:

With choices in the prompt:

Question: As of 2016, about what percentage of adults aged 18 years or older were overweight? Choices: 10% 20% 40% 80% Answer: Question: As of 2016, about what percentage of adults aged 18 years or older were overweight? Choices: A. 10% B. 20% C. 40% D. 80% Answer: Question: As of 2016, about what percentage of adults aged 18 years or older were overweight? Choices: (A) 10% (B) 20% (C) 40% (D) 80% Answer:

Log probs of 10%, 20%, 40%, 80%:

import numpy as np

log_probs = np.array([-1.2, -1.5, -0.8, -2.1])
print(log_probs)

Log probs of 10%, 20%, 40%, 80% vs A, B, C, D:

import numpy as np

log_probs = np.array([-1.2, -1.5, -0.8, -2.1])
choices = np.array([0, 1, 2, 3])
print(np.log(np.exp(log_probs) / np.exp(log_probs[choices])))

Log probs of 10%, 20%, 40%, 80% vs (A), (B), (C), (D),:

import numpy as np

log_probs = np.array([-1.2, -1.5, -0.8, -2.1])
choices = np.array([0, 1, 2, 3])
print(np.log(np.exp(log_probs) / np.exp(log_probs[choices])))

Prompts either contain just the question, or some tags to indicate that we are in a question/answer format, and possibly the choices in the prompt. In all cases, evaluations compare the log-likelihood of the possible choices only. All these formats appear in the evaluation literature, and should contain virtually the same amount of information in each row. However, just below, you can see the wide variation in performance across these theoretically superficial changes!

Each model sees its performance vary by around 10 points, with the exception of the most extreme example, Qwen1.5-7B, dropping all the way to an accuracy of 22.9% with the 7th prompt variation (mostly due to a tokenizer issue), with essentially the same information it was able to achieve an accuracy of up to 51.2% with another prompt.

Impact on Ranking

In isolation, a change in score is not necessarily a big deal so long as the ranking is consistent. However, as we can see in the next plot, ranking is impacted by these changes:

No model is consistently ranked across prompts even though the only difference is their format, not the information itself. This means that if the authors of Gemma-7b wanted to show that their model was superior to Mistral-7B-v0.1, they could do so simply by choosing the correct prompt.

Additional Sources of Variance

However, this is not the only source of variance in model scores. In extended experiments, we compared evaluating the same models, with the same prompt formats, using the exact same few-shot samples shuffled differently before the prompt (A/B/C/D/E Prompt vs C/D/A/B/E Prompt, for example). The following figure shows the model scores delta between these two few-shot orderings: we observe a difference of up to 3 points in performance for the same model/prompt combination!

Focusing on Output, Not Input

While FormatSpread is a great attempt to make leaderboards more fair and honest, what we really want as practical users of LLMs is prompt consistency. That is, we would like to find some way to reduce this variance among prompts.

At .txt, we focus on improving and better understanding structured generation, which is when the output of a model is constrained to follow a specific structure. Our library, Outlines, allows us to structure the output of an LLM by defining a regular expression or a context-free grammar (we give examples below).

Initial Exploration: GSM8K 1-8 Shot Prompting

In order to test this out further, we wanted to explore the behavior of two very similar but strong models in the 7B parameter space: Mistral-7Bv0.1 and Zephyr-7B-beta. The reason behind this choice is to not only study variance in individual outcomes, but to look at the changes in relative ranking. We use the GSM8K task which is a set of grade school math word problems.

Here is the basic format of a GSM8K 1-shot prompt with the implied structure highlighted.

In order to consistently generate correctly structured answers we create a regular expression that matches the structure we see inherent in the original prompt format. The following regex is used in Outlines to define the structure for generation:

We can see in the regex that we allow the model to reason for anywhere from 200 to 700 characters, then it must declare that “The answer is” and then reply with up to 10 digit number (that cannot start with 0).

Preliminary Results

Our first experiment was to continue exploring the GSM8K dataset and iterated on 1 through 8 shot prompting. The results, shown below, were very compelling.

There are two major features we see in this figure: variance in performance across the n-shot setups was majorly reduced and there were no instances where the ranking swapped (Mistral consistently leads over Zephyr). It’s also worth pointing out that 1-shot structured performance is substantially better than 1-shot unstructured performance, and on par with 5-shot.

Diving Deeper: GPQA n-Shot and Shot Order Variations

For the next experiment we wanted to look at varying both n-shots as well as the order of the n-shots. Order was controlled by setting the seed used for shuffling the examples. As mentioned previously, only the first n-shots are shuffled to keep the information consistent between prompts, this means that all 1-shot prompts are the same across seeds. Here’s an example of the shot order for 4-shot:

seed 4-shot order

42 2-1-3-0

1337 1-0-3-2

1981 3-2-0-1

1992 0-3-1-2

12345 1-0-2-3

Exploring GPQA

Additionally, to explore how transferable these results were, we changed the task to Graduate-Level Google-Proof Q&A Benchmark (GPQA). GPQA is a hard knowledge multi-choice evaluation task. Below is the prompt format and highlighted structure.

For this next experiment we are specifically using the ‘diamond’ subset which represents curated and cleaned up high quality questions. Of the 198 questions in this dataset we reserve 8 for n-shot prompting (though only ever used the first 5), and then evaluated on the remaining 190 questions.

Results

Visualized below we can see a grid representing the accuracy achieved for all the possible combinations for shot seed and n, for the two models, both without (left) and with (right) structured generation.

One thing which immediately stands out is that the structured output tends to score higher than the unstructured output across the board. We see the mean of each grid for structured and unstructured below:

Mean of results across prompt seed and n-shot

model unstructured structured

Mistral-7B-v0.1 0.2360 0.2935

Zephyr-7b-beta 0.2387 0.3048

Additionally, across all the values in the grid we also find reduced variance when comparing the structured with unstructured generation.

Standard deviation in results across prompt seed and n-shot

model unstructured structured

Mistral-7B-v0.1 0.0213 0.0202

Zephyr-7b-beta 0.0273 0.0180

Impact on Ranking

While increased expected performance and decreased variance are great properties to have, what we really want to understand is the impact on ranking. In the next plot we examine these grids in terms of which of the two models would be declared a winner:

A: Zephyr-7b-beta

B: Mistral-7B-v0.1

“-”: tie

As we can see from these images, there is a major improvement in the consistency of calling a winner when structured generation is applied. These results paint a consistent picture with the findings we had using GSM8K across various n-shot.

Conclusion and Future Work

While these results are incredibly promising, we still need to explore these results across more models and more tasks. What we’ve seen so far is that structured generation could prove to be an essential part of evaluation. Simultaneously increasing the expected score and decreasing the variance across prompt changes is a very promising result that deserves further research.

Source: https://huggingface.co/blog/evaluation-structured-outputs

Improving Prompt Consistency with Structured Generations

Improving Prompt Consistency with Structured Generations