Everything GPT-2: 1. Architecture Overview

5 min readOct 16, 2020

Prepare for brain melt in 3, 2, 1 …

This article is part of a series on GPT-2. It’s best if you start in the beginning. The links are located at the bottom of the page.

This article is intended to inform your intuition rather than going through every point in depth.

In different resources, GPT-2 refers to different things. Some resources will mean the whole thing that takes in some words and gives you some more words. For this article and the next, I will refer to GPT-2 as the parts that take in some words and generate a single word piece (token). My reasoning will become more clear as you learn more. In the most technical resources you will find GPT-2 only refers to the pieces that are unique to GPT-2 when compared with similar Natural Language Processing (NLP) Models.

Overview of feeding in text and generating a single token:

Tokenization — Take some words, break them up into their common pieces. Take those common pieces and replace them with a number. In this example the dashes are place holders to improve readability
→The cats played with the yarn.
→ The | cat | s | play | ed | with | the | y | arn |.|
→- -1- -|- -2- | 3 |- -4- | 5- |- - 6- - |- -1-| 7 |- 8- -|9|
A single number is a token. Tokenization is necessary because computers only work with numbers. This also represents words efficiently. See how “the” gets represented by “1” twice and “arn” could be part of barn, or yarn.
Embedding with time signal — Take that string of numbers and convert each number to a vector. This captures the position of words relative to one another and allows words to take value from other words associated with them, e.g.
“The boy ran through the woods, and he surely had not stolen the cherry pie for which they were chasing him.”
In this sentence “he” should clearly tie a lot of its meaning to “boy”. The same is true for “ the” and “boy”. The definite article carries a lot of meaning in the larger context.
Decoder Block — This is where the magic happens. The pieces are self-attention blocks, feedforward neural nets, and Normalization. Self-attention blocks identify which words to focus on. In the sentence, “Jimmy played with the burning bush, and then went around to the next bush,” the words “Jimmy,” “played,” “burning,” and “bush” capture a high proportion of the meaning in that sentence. This idea that certain words and phases capture more meaning and thus should be given more “attention” is the intuition of self-attention blocks. Feedforward neural nets are complex network made up an input layer that accepts information, hidden layers that capture the hidden correlations between each data point, and an output layer which transmit information. In between each self-attention block and feedforward neural net, there is a normalization layer. Normalization is an empirically developed technique. How normalization makes learning more effective during the training phase is not agreed upon, but it does so nonetheless.
Many more Decoder Blocks
Linear Layer — Prior to the tokenization process, a vocabulary size will decided upon and a vocabulary will be setup. The vocabulary is just a list of all the possible tokens (numbers) that can be produced and which letter or group of letters the tokens are equal to. The linear layer takes the output of the last decoder block and converts it to a vector whose dimensions are vocabulary size by 1. In short, it takes a lot of inputs and produces a list where each spot represents a token. The higher the number in the spot the better the chance that that token is the best pick.
Softmax — Converts the output of the linear layer to a probability distribution. The output of the linear layer tells you gives you information about which tokens are the best picks, but it is hard to use. The values range from very small values(huge negative values) to very large values and their meaning is in relation to all the other values. To make them easier to use apply the Softmax function which converts the vector to a probability distribution. This means each number represents the probability that that token is the correct one.
Pick a token — Choose the method to pick the next token from the probability distribution of tokens, and use that method to pick the token. There are various methods to do so including, greedy, temperature sampling, nucleus sampling, and top-k sampling.
Convert Token to a word piece using the vocabulary.

After one pass through GPT-2, it has taken some words and generated a single word piece. To generate more text, take that word piece, append it to the end of the initial text, and run that text through again.

Congratulations on finishing! Get ready for the next one, it will sting a bit.

Articles in the series:
Everything GPT-2: 0. Intro
Everything GPT-2: 1. Architecture Overview
Everything GPT-2: 2. Architecture In-Depth
Everything GPT-2: 3. Tools
Everything GPT-2: 4. Data Preparation
Everything GPT-2: 5. Fine-Tuning
Everything GPT-2: 6. Optimizations
Everything GPT-2: 7. Production

All resources for articles in the series are centralized in this google drive folder.

(Aside) Do you think the point of this article (to form an intuition on the subject) not worth the effort of doing it when the next article will go through everything precisely?
Intuition is an often forgotten tool because it can be hard to form, and its processes can be unscientific. Some people just get the concept while others are still working through the concept analytically. The value of forming intuition is two-fold; it eases the learning process and greatly increases your chances of retention. A simple example is worked out here.

Have you found this valuable?
If no, then tweet at me with some feedback.
If yes, then send me a pence for my thoughts.

Everything GPT-2: 1. Architecture Overview

Written by Edward Girling