AI
2023-08-01

Controlling Output Format at the Prompt - Comparing Natural Language/TypeScript/Zod

This article describes the control of output formatting with prompts when using ChatGPT's API for advanced natural language processing.In particular, it focuses on how the LLM of a model can be correctly interpreted using prompts, comparing natural language, TypeScript type expressions, and zod Schema expressions.

Controlling Output Format at the Prompt - Comparing Natural Language/TypeScript/Zod

Motivation

With the advent of ChatGPT's API, it has become easy to create applications using advanced NLP. However, when trying to create something that can actually be used at the product level, it is necessary to control the non-deterministic behavior of the LLM, which turns out to be quite difficult.

One of the challenges is to make the output of the LLM in a form that can be parsed by subsequent programs. To be more specific, since JSON is often the output format of choice, what is the best prompt if we want LLM to output JSON with a high probability of correct format & specified structure? This is the challenge.

The simple answer is to tell the LLM to output JSON with the correct format and structure with a high probability of correctness. This is sufficient as long as the structure of the data to be output is simple, but if the data becomes complex, specifying the structure in natural language tends to be lengthy or ambiguous. Therefore, we can think of a way to express it using a notation of data structure that is widely used, such as TypeScript types or zod schema definitions.

The issue of "choosing an appropriate notation for structural definitions for the purpose of conveying the structure of the data expected in the output with a prompt" is whether the LLM of the model to which the prompt is given will correctly interpret the notation of the definition, and whether the LLM of the model within the notation of the definition will correctly interpret it. Also, the issue is whether it is easy to control the LLM's attention within the notation of the definition.

Now, in our (SparkleAI) product, we often adopt "TypeScript type definition" as a method that engineers are accustomed to, without any particular verification, and this never causes any problems. However, there are expressions that are difficult to include in a type (e.g., restrictions on the number of characters), and an evaluation is needed as to which is more appropriate in this respect when compared to Schema definitions such as zod.

In this blog, I will investigate and summarize the rate at which ChatGPT succeeds in molding output when the output format is specified in natural language/TypeScript/zod.

Subject of evaluation

Natural Language

Purpose
Communication

Issues of the method
There are ambiguities and it may not return with the specified type

TypeScript type expression

Purpose
Type checking of program source

Issues of the method
Many programs are written in TS, so it is likely to be interpreted correctly.

zod Schema Expression

Purpose
Schema verification at program runtime

Issues of the method
More specific than type expressions, but does it affect the result?

Prompt for evaluation

When the number of characters to be output or the number of items to be output is specified, ChatGPT often does not follow the constraint. For this reason, prompt engineering is used to generate the prompt once and then fix the number of characters, etc. later. However, in this case, we also want to check the effect of constraints on the number of characters and pieces, such as ZOD, on the results, so we will use the "Generate Problem" prompt with constraints on the number of characters as the subject.

We will use the following prompts with the output format definitions swapped for natural language/TypeScript type expression/Schema expression of zod, respectively.


# 指示
下で与えられる文章について、読解力を問う中学生の国語の問題を作り
以下にzodで定義するQuestionSheetを満たすオブジェクトとしてJSONで出力してください。
JSONの文字列中には改行やタブルクォートを含めてはいけません。
```
const Question = z.object({
question: z.string().max(30)
answer: z.string().max(30),
});
const QuestionSheet = z.object({
questions: z.array(Question).min(7)
});
```

# 文章
```
{document}
```

# 出力結果

Evaluation Method

The following information will be tabulated and evaluated on the results of 30 iterations of 10 different documents (300 iterations in total), given the prompts to generate the problem with the output format specified in NL, TS and ZD, respectively.

  • parse: The percentage of documents that can be parsed as JSON with simple preprocessing
  • schema: Percentage of JSON structure that matches our specified structure when parsed.
  • count: Percentage of issues that meet the requirements if the structure is correct.
  • length: Percentage of string lengths that meet your requirements if the structure is correct.

Assuming that the prompt differences would be greater in settings with a higher diversity of generation, we fixed the setting to a realistic upper limit: Temperature=1.2, which is used in the task of generating diverse results.

Result

NL

parse
0.84 (251/300)

scheme
0.57 (143/251)

count
0.8 (114/143)

length
0.8 (1556/1950)

TS

parse
0.95 (284/300)

scheme
1.0 (283/284)

count
0.73 (206/283)

length
0.76 (2861/3788)

ZOD

parse
0.94 (278/296)

scheme
1.0 (278/278)

count
0.82 (228/278)

length
0.79 (2935/3736)

The large difference between "NL" and "TS and ZOD" in the prase and schema tests indicates that using an artificial language to specify the structure of the output is still more controllable than using natural language. This shows that it is still easier to control the structure of the output using an artificial language than using a natural language.

On top of that, there is no significant difference between TypeScript and Zod in the parse and schema tests. On the other hand, TypeScript's count and length scores are worse than those of Zod and natural language as well, suggesting that the way TypeScript puts constraints on the number of pieces or characters in a type expression may be less likely to attract attention than Schema's specification.

From the above, in terms of the easiest way to produce the most targeted structure, using Schema expressions has a better numerical performance. However, in addition to Zod, there are other Schema expression libraries such as Yup / io-ts / joi, which have similar description methods but different details, and we are not sure if we can control them as we intend to specify. Also, since Schema is less universal than TypeScript, it is necessary to be aware of the possibility that there will be changes in the description in the future.

Conclusion

This article compares and contrasts the use of natural language, TypeScript, and Zod to control the output format of ChatGPT. Its main focus is on how well these methods allow ChatGPT to generate accurate data structures and how well they control ChatGPT's attention.

As a result, we found that using a programming language to specify the structure of the output is more accurate than using natural language. Specifically, the rate at which the output is successfully parsed and the rate at which the schema matches the specified one is higher for both TypeScript and Zod than for natural language, and there is no significant difference between the two.

However, we found that TypeScript was inferior to Zod and natural language when it came to constraining the number of items and the length of strings. This may be due to the fact that TypeScript's type expressions are difficult to include constraints on the number of items and the length of strings, and therefore are less attentive than Schema's specifications.

From this point of view, Schema expressions are appropriate in that they are the easiest to produce the targeted structure. However, Schema expressions have less universality than TypeScript, and the descriptions are subject to change, so care must be taken in this respect.

Based on the above, we believe that TypeScript type expressions are the most useful for specifying JSON output and controlling its structure. However, since TypeScript type expressions have slightly inferior restrictions on the number and length of pieces compared to Schema, it is recommended that these elements be further controlled by plain-text instructions in the prompts.