Everything GPT-2: 5. Fine-Tuning

Edward Girling
2 min readOct 27, 2020

Specialize GPT-2 for enhanced performance on any text

This article is part of a series on GPT-2. It’s best if you start in the beginning. The links are located at the bottom of the page.

What is fine-tuning?
GPT-2 was trained on 40 gigabytes of text ranging across many subjects. It is very good at generating text, but it can be improved by training it on text specific to its application. This process is called transfer learning.

Prior to running either tutorial see this article for setup. The best way to go through this article is interactively:

  1. Finetune with GPT-2 Simple — We will use this to train the 774M variant because this package is a little more efficient with memory.
  2. Fine tune with Transformers’ Trainer utility — The tutorial is provided because it is a better way to train GPT-2 if you can get access to enough memory. The Trainer utility is faster and the model is already converted to Pytorch 1.x. I tried running the 774M variant with the 16 gigbytes of memory and I still went out of memory. So the tutorial is run with the 355M variant.
Google colab cuda OOM

Articles in the series:
Everything GPT-2: 0. Intro
Everything GPT-2: 1. Architecture Overview
Everything GPT-2: 2. Architecture In-Depth
Everything GPT-2: 3. Tools
Everything GPT-2: 4. Data Preparation
Everything GPT-2: 5. Fine-Tuning
Everything GPT-2: 6. Optimizations
Everything GPT-2: 7. Production

All resources for articles in the series are centralized in this google drive folder.

(Aside) The 355 million parameter and smaller versions of GPT-2 are cool, but not super impressive. Once you get to the 774 million parameter version, results start to become reliable enough to be useful, but you need a ton a memory to fine-tune them. You are basically forced to use distributed training using multiple GPU’s. If you have any experience or ideas then DM me. The memory bottleneck is quite frustrating since, as you will see in the optimization section, you can run even the large models with text generation times under 1 second.

Have you found this valuable?
If no, then tweet at me with some feedback.
If yes, then send me a pence for my thoughts.



Edward Girling

Mathematician, enjoys his knowledge distilled. Find my insight deep, my jokes laughable, my resources useful, connect with me on twitter @Rowlando_13