Everything GPT-2: 4. Data Preparation

If you think your data is clean, you haven’t looked at it hard enough.

This article is part of a series on GPT-2. It’s best if you start in the beginning. The links are located at the bottom of the page.

In the next tutorial, you will fine tune (train) GPT-2 on any topic that you want with a single large text file or folder containing a lot of text files. In the example, I will work with a large selection of Pulitzer Prize winning novels. You can select any text you would like as long as there is a lot of it and the text is very clean. I would encourage you to make your own text file(s) in an area that interests you.

Why should you use a lot of text?
Overfitting is more likely with GPT-2 since the model is so large (774M variant is ~6 gb), though it is less of problem than in normal modeling situations. Generally, if a model overfits, then it won’t work well in the real world. If GPT-2 overfits, it will memorize sections of or all of the training text. The more it overfits, the more it will memorize. If your model memorizes a bit of text, then you model is not ruined, it just does not generate as much original text. For some examples, see Gwern’s blog.

How much is a lot?
It depends upon the variant of GPT-2 that you are using. For the 774 M size, I used at least 20 mb of training text, trained up to 80,000 steps, and never had a problem.

How clean should the training text be?
As clean as your patience will allow. GPT-2 learns the formatting faster than the meaning of the text, so the formatting is guaranteed to come through. The less clean your input, the more you will have to massage the output to get it formatted correctly.

What other formatting considerations should I be aware of?
The context window for GPT-2 is limited to 1024 tokens meaning, so the text you submit to it and text it generates must be less than 1024 tokens.
GPT-2 is pretty intelligent about formatting so if you train it on thousands of instances of something and then prompt it correctly, it will just work. For example, if you trained it on:
“Tell me a joke: This is the joke. |Punchline| This is the punch line.
and then prompted it with:
“Tell me a joke:”
It would return:
“|Punchline| This is the punch line <|endoftext|>”
<|endoftext|> is a special token that is used signal a shift in context to GPT-2.

How do I use a text sample that has a lot of examples but is not very large, think reddit page titles?
Make the formatting very clean and monitor the output for instances of GPT-2 repeating examples from the training text.

How concerned should I be about overfitting?
A little less than normal as long as your data set is large enough. Each of the blogs on Everything GPT-2: 3. Tools speaks to this a bit. Think of the optimal curve as a parabola opening downward with training steps on the X and increasing as you move right and best fit on the Y with fit being better as you move up. Your best fit is somewhere between 1 step and 800,000 steps, but the curve is gentle, a little over or undertrained won’t make a lot of difference.

Articles in the series:
Everything GPT-2: 0. Intro
Everything GPT-2: 1. Architecture Overview
Everything GPT-2: 2. Architecture In-Depth
Everything GPT-2: 3. Tools
Everything GPT-2: 4. Data Preparation
Everything GPT-2: 5. Fine-Tuning
Everything GPT-2: 6. Optimizations
Everything GPT-2: 7. Production

All resources for articles in the series are centralized in this google drive folder.

(Aside) Why so Socratic?
Efficiency. By using the Socratic Method, I don’t have to carefully craft transition sentences to link ideas. Since these articles are mostly fact driven, they are not truly Socratic; however, even in a less than ideal setting it is instructive to think how questions shape answers.

(shameless self-promotion) Have you found this valuable?
If no, then tweet at me with some feedback.
If yes, then send me a pence for my thoughts.