Controlling Output Format at the Prompt

This article describes the control of output formatting with prompts when using ChatGPT's API for advanced natural language processing.In particular, it focuses on how the LLM of a model can be correctly interpreted using prompts, comparing natural language, TypeScript type expressions, and Zod Schema expressions.

Motivation

With the advent of ChatGPT's API, it has become easy to create applications using advanced NLP. However, when trying to create something that can actually be used at the product level, it is necessary to control the non-deterministic behavior of the LLM, which is quite difficult.

One of the challenges is to make the output of the LLM in a form that can be parsed by subsequent programs. To be more specific, since JSON is often the output format of choice, what is the best prompt when you want the LLM to output JSON with the correct format and structure with a high probability?

The simple answer is to tell the LLM to output JSON with the correct format and structure. This is sufficient as long as the structure of the data to be output is simple, but as it becomes more complex, the natural language specification tends to be long and ambiguous. Therefore, the use of a widely used data structure notation such as TypeScript types or Zod schema definitions comes to mind.

The issue of "choosing an appropriate notation for structural definitions for the purpose of conveying the structure of the data expected in the output with a prompt" is whether the LLM of the model to which the prompt is given correctly interprets that definition notation. The debate is also whether it is easy to control the LLM's attention within the notation.

Now, in our (SparkleAI) product, we often adopt "TypeScript type definition" as a method that engineers are accustomed to, without any particular verification, and this never causes any problems. However, there are expressions that are difficult to include in types (for example, restrictions on the number of characters), and when compared to Schema definitions such as Zod, it is necessary to evaluate which is more appropriate in this regard.

In this blog, I will investigate and summarize the rate at which ChatGPT succeeds in molding output when the output format is specified in natural language/TypeScript/zod.

Subject of evaluation

Natural Language

PurposeCommunication

Issues of the methodThere are ambiguities and it may not return with the specified type

TypeScript type expression

PurposeType checking of program source

Issues of the methodMany programs are written in TS, so it is likely to be interpreted correctly.

zod Schema Expression

PurposeSchema verification at program runtime

Issues of the methodMore specific than type expressions, but does it affect the result?

Prompt for evaluation

When the number of characters to be output or the number of items to be output is specified, ChatGPT often does not follow the constraint. For this reason, prompt engineering is used to generate the problem in a roundabout way so as not to hit the constraint. However, this time, since we also want to confirm the effect of character count and count constraints in Zod on the results, we will use the "problem generation" prompt with character count constraints as the subject matter.

We will use the following prompts with the output format definitions swapped for natural language/TypeScript type expression/Schema expression of zod, respectively.

# Instruction
For the following text, create comprehension questions for middle school Japanese students.
Output the result as a JSON object that satisfies the QuestionSheet defined below in zod.
Do not include line breaks or tab quotes in the JSON string.
```
const Question = z.object({
  question: z.string().max(30)
  answer: z.string().max(30),
});
const QuestionSheet = z.object({
  questions: z.array(Question).min(7)
});
```

# Text
```
{document}
```

# Output

Evaluation Method

The following information will be tabulated and evaluated on the results of 30 iterations of 10 different documents (300 iterations in total), given the prompts to generate the problem with the output format specified in each NL, TS, and ZD.

parse: The percentage of documents that can be parsed as JSON with simple preprocessing
schema: Percentage of JSON structure that matches our specified structure when parsed.
count: Percentage of issues that meet the requirements if the structure is correct.
length: Percentage of string lengths that meet your requirements if the structure is correct.

Assuming that the prompt differences would be greater in settings with a higher diversity of generation, we fixed the setting to a realistic upper limit: Temperature=1.2, which is used in the task of generating diverse results.

Result

NL

parse0.84 (251/300)

scheme0.57 (143/251)

count0.8 (114/143)

length0.8 (1556/1950)

TS

parse0.95 (284/300)

scheme1.0 (283/284)

count0.73 (206/283)

length0.76 (2861/3788)

ZOD

parse0.94 (278/296)

scheme1.0 (278/278)

count0.82 (228/278)

length0.79 (2935/3736)

The large difference between "NL" and "TS and ZOD" in the prase and schema tests indicates that using an artificial language to specify the structure of the output is still more controllable than using natural language.

On top of that, there is no significant difference between TypeScript and Zod in the parse and schema tests. On the other hand, TypeScript's count and length scores are worse than those of Zod and natural language. It may be that the method of incorporating the number and character constraints into TypeScript type expressions makes it harder to attract attention than the Schema specification.

From the above, in terms of the easiest way to produce the most targeted structure, using Schema expressions has a better numerical performance. However, in addition to Zod, there are other Schema expressions such as Yup / io-ts / joi, which have similar but slightly different descriptions, and there is a concern about whether they can be controlled as intended. Also, since they have less universality than TypeScript, care should be taken as there is a possibility that the descriptions may change in the future.

Conclusion

This article compares and contrasts the use of natural language, TypeScript, and Zod to control the output format of ChatGPT. Its main focus is on how well these methods allow ChatGPT to generate accurate data structures, and how well they can control ChatGPT's attention.

As a result, we found that using a programming language to specify the structure of the output is more accurate than using natural language. Specifically, the rate at which the output is successfully parsed and the schema matches our specified structure is higher for both TypeScript and Zod than for natural language, and there is no significant difference between the two.

However, we found that TypeScript was inferior to Zod and natural language when it came to constraining the number of items and the length of strings. This may be due to the fact that TypeScript's type expression makes it difficult to include constraints on count and length, making it harder to attract attention compared to the Schema specification.

From this point of view, Schema expressions are appropriate in that they are the easiest to produce the targeted structure. However, Schema expressions have less universality than TypeScript, and the descriptions may change in the future, so care should be taken in this regard.

Based on the above, we believe that TypeScript type expressions are the most useful for specifying JSON output and controlling its structure. However, since TypeScript type expressions have slightly inferior constraints on count and length compared to Schema, it is recommended to further control these elements with plain-text instructions in prompts.