GPT-2 is good, now you will it make it fast.
At this point you have a finetuned 774M variant of GPT-2, but it has some problems. It is formatted for Tensorflow 1.x which has been deprecated and generating text with it is slow.
To fix the formatting problems, we will use a script from Huggingface. The result will be a finetuned model formatted for Pytorch 1.x (Pytorch started with 0.x). I could have converted it to Tensorflow 2.x but ONNX has better support for Pytorch. If you trained your model using the Transformers library, then your finetuned model is already formatted for Pytorch 1.x.
To make the model faster, the script will implement numerical optimizations(small speed up) and quantization(huge speed up) provided by ONNX. The numerical optimizations are achieved by replacing different parts of the model with numerically efficient ones, for example fusing self attention and normalization into one operation, taking into account GPU or CPU architectures, down compiling to C or lower level languages, and or a host of other methods. For optimizations of this sort, there is not a unified theory, but rather pieces are swapped out in a mix and match way with the only goal being to make it faster without losing too much accuracy. The details are interesting but not necessary. Quantization is done by replacing (32 bit) floating point numbers and with integers (8 bit) and reparametrizing the model to support integer multiplication. This reduces the model size by 75 percent and decreases the wait time by 50 to 70 percent.
Articles in the series:
Everything GPT-2: 0. Intro
Everything GPT-2: 1. Architecture Overview
Everything GPT-2: 2. Architecture In-Depth
Everything GPT-2: 3. Tools
Everything GPT-2: 4. Data Preparation
Everything GPT-2: 5. Fine-Tuning
Everything GPT-2: 6. Optimizations
Everything GPT-2: 7. Production
All resources for articles in the series are centralized in this google drive folder.
(Aside) How and why does quantization work?
It has two pieces: data compression by approximating floating point numbers with integers and scaling factors and more efficient calculations using 8 bit integer calculations in place of of 32 bit floating point calculations.
It works because neural networks are generally very resilient to noise. At their core they are mechanism to extract a signal from a very noisy source of information. The loss in precision due to quantization error is very similar to errors resulting from having noisier data. Check out mathworks for more info!