Everything GPT-2: 7. Production

Deploy GPT-2 as an Autoscaling Web API


  1. AWS — Free account with administrative username setup for programmatic access. When you setup the account, you will create a root user. Use the root user to setup an administrator with programmatic access. Use this one for Cortex and your own use. Set up at least one empty S3 Storage bucket.
  2. Docker — Installed.
  3. Cortex — Installed per their website. Cortex basic tutorials completed. If you are running windows, then check out their Windows installation guide for running Cortex using WSL (Windows Subsystem for Linux) version 2. Note: If you install Ubuntu 20.04 (recommended), then you will have to run sudo apt-get install python3-pip and pip3 install cortex from the WSL Bash window because Ubuntu 20.04 does not ship with pip3.
  4. AWS Command Line Interface Version 2 — Installed. If using Windows, installed on WSL.
  5. Git Command Line — Installed.

My process is to review the code on my windows machine using atom, copy the code to the WSL partition, and run it with Cortex. If you have a Mac or Linux machine as your primary machine, then you won’t have to do the extra file copying.

Tutorial Steps:

  1. Configure your AWS credentials with the command line utility. Do this on the OS which will be running Cortex. I tried configuring my credentials multiple ways, and I could not get my credentials to work. Configuring with the command line utility was very straight forward. Simply run aws configure and set your configuration. Set your Default output format: json. Make sure that the S3 bucket that you will use is in the same region as your default region otherwise you will incur extra fees from inter-region transfers.
  2. Upload your model and tokenizer to AWS S3 with this Colab Notebook
  3. Clone the GPT-2 Everything repo to your primary coding machine

4. Open the files in your favorite editor and dig in. Here are the highlights:

a. Rowlando13_API Specs.pdf :
Since your final product will be an API, it is best to write the specs before you try to build it. Here are the specs for mine:

b. cstm_generate.py: The core code for your API. It accepts an ONNX model, and parameters for text, temperature, etc, and recreates most of the features of the Transformers library generate utility. The code is well commented if you want to look through it.

c. predictor_1.py: Takes cstm_generate.py and makes the API described in the specs. You should update the AWS credentials and S3 bucket names.

d. cortex.yaml: Settings for you specific API.

e. cluster2.yaml: Settings for you Kubernates cluster.

5. Move model to Linux.

Edit predictor1.py to add your AWS credentials and S3 bucket names. Copy the predictor files and json files for testing into your home directory for your Linux/Mac/WSL. For WSL, your windows files are accessible at their normal locations except change all the \ to / and C: to /mnt/c/, .e.g. C:\Users\edwar => /mnt/c/User/edwar .

Bash commands on WSL will look like :

6. Deploy (Locally)

Watch the logs by running:

Watch the logs until you see:

Exit the logs.

7. Test the local deployment with a CURL request.

You should get something back like:

Delete your API.

8. Deploy to AWS

Deploy the cluster and wait for it to come online. It should take about 15 minutes.

Deploy the model, and watch the logs again.

9. Test the cloud deployment with a CURL request.

You may run into a service unavailable message depending on what modes you run in. The maximum request time is 29 seconds or you will service unavailable.

Congratulations! You have just deploy a custom state-of-the-art NLP model to the cloud in an auto-scaling framework. You can access it at its endpoint with a simple post request. I know it has been a long journey, but thanks for sticking around. To take the cluster down:

Articles in the series:
Everything GPT-2: 0. Intro
Everything GPT-2: 1. Architecture Overview
Everything GPT-2: 2. Architecture In-Depth
Everything GPT-2: 3. Tools
Everything GPT-2: 4. Data Preparation
Everything GPT-2: 5. Fine-Tuning
Everything GPT-2: 6. Optimizations
Everything GPT-2: 7. Production

All resources for articles in the series are centralized in this google drive folder.

(Aside) If you are relatively new to Machine Learning, you probably have not experienced the pain that is development operations (dev ops). The models are super big; one typo can mean you have to restart the process. To alleviate this pain, I do as much testing locally with as small a model as possible, then deploy a small model to the cloud, then deploy a large model to the cloud. For this last article, I used the onnx-quant-124M-base model located in the resource folder. It is just under 200 mb, compared to the ~800 mb quantized 774m model.

Have you found this valuable?
If no, then tweet at me with some feedback.
If yes, then send me a pence for my thoughts.

Mathematician, enjoys his knowledge distilled. Find my insight deep, my jokes laughable, my resources useful, connect with me on twitter @Rowlando_13