Everything GPT-2: 7. Production
Deploy GPT-2 as an Autoscaling Web API
Prerequisites:
- AWS — Free account with administrative username setup for programmatic access. When you setup the account, you will create a root user. Use the root user to setup an administrator with programmatic access. Use this one for Cortex and your own use. Set up at least one empty S3 Storage bucket.
- Docker — Installed.
- Cortex — Installed per their website. Cortex basic tutorials completed. If you are running windows, then check out their Windows installation guide for running Cortex using WSL (Windows Subsystem for Linux) version 2. Note: If you install Ubuntu 20.04 (recommended), then you will have to run
sudo apt-get install python3-pip
andpip3 install cortex
from the WSL Bash window because Ubuntu 20.04 does not ship with pip3. - AWS Command Line Interface Version 2 — Installed. If using Windows, installed on WSL.
- Git Command Line — Installed.
My process is to review the code on my windows machine using atom, copy the code to the WSL partition, and run it with Cortex. If you have a Mac or Linux machine as your primary machine, then you won’t have to do the extra file copying.
Tutorial Steps:
- Configure your AWS credentials with the command line utility. Do this on the OS which will be running Cortex. I tried configuring my credentials multiple ways, and I could not get my credentials to work. Configuring with the command line utility was very straight forward. Simply run
aws configure
and set your configuration. Set yourDefault output format: json
. Make sure that the S3 bucket that you will use is in the same region as your default region otherwise you will incur extra fees from inter-region transfers. - Upload your model and tokenizer to AWS S3 with this Colab Notebook
- Clone the GPT-2 Everything repo to your primary coding machine
git clone https://github.com/Rowlando13/EverythingGPT-2.git
4. Open the files in your favorite editor and dig in. Here are the highlights:
a. Rowlando13_API Specs.pdf :
Since your final product will be an API, it is best to write the specs before you try to build it. Here are the specs for mine:
b. cstm_generate.py: The core code for your API. It accepts an ONNX model, and parameters for text, temperature, etc, and recreates most of the features of the Transformers library generate utility. The code is well commented if you want to look through it.
c. predictor_1.py: Takes cstm_generate.py and makes the API described in the specs. You should update the AWS credentials and S3 bucket names.
d. cortex.yaml: Settings for you specific API.
e. cluster2.yaml: Settings for you Kubernates cluster.
5. Move model to Linux.
Edit predictor1.py to add your AWS credentials and S3 bucket names. Copy the predictor files and json files for testing into your home directory for your Linux/Mac/WSL. For WSL, your windows files are accessible at their normal locations except change all the \ to / and C: to /mnt/c/, .e.g. C:\Users\edwar => /mnt/c/User/edwar .
Bash commands on WSL will look like :
cp -r /mnt/c/your_source_directories/EverythingGPT-2/cortex your_destination_directories
cp -r /mnt/c/your_source_directories/EverythingGPT-2/api_testing_samples your_destination_directories
6. Deploy (Locally)
cd ~/EverythingGPT-2/cortex
cortex deploy
Watch the logs by running:
cortex logs gpt2-774m-pulitzer
Watch the logs until you see:
2020-12-15 21:11:37.039702:cortex:pid-242:INFO:Application startup complete.
2020-12-15 21:11:37.040991:cortex:pid-242:INFO:Uvicorn running on unix socket /run/uvicorn/proc-0.sock (Press CTRL+C to quit)
Exit the logs.
7. Test the local deployment with a CURL request.
cd ..
cd api_testing_samples
curl http://localhost:8890 -X POST -H "Content-Type: application/json" -d @sample1.json
You should get something back like:
{"text": {"0": " first alternate completion to you sentence.", "1": " second alternate completion to your sentence."}, "truncated": false}
Delete your API.
cortex delete gpt2-774m-pulitzer
8. Deploy to AWS
Deploy the cluster and wait for it to come online. It should take about 15 minutes.
cd ..
cd cortex
cortex cluster up --config cluster1.yaml
Deploy the model, and watch the logs again.
cortex env default aws
cortex deploy
cortex logs gpt2-774m-pulitzer
9. Test the cloud deployment with a CURL request.
cd ..
cd api_testing_samples
curl yourapiendpoint -X POST -H "Content-Type: application/json" -d @sample1.json
You may run into a service unavailable message depending on what modes you run in. The maximum request time is 29 seconds or you will service unavailable.
Congratulations! You have just deploy a custom state-of-the-art NLP model to the cloud in an auto-scaling framework. You can access it at its endpoint with a simple post request. I know it has been a long journey, but thanks for sticking around. To take the cluster down:
cortex cluster down
Articles in the series:
Everything GPT-2: 0. Intro
Everything GPT-2: 1. Architecture Overview
Everything GPT-2: 2. Architecture In-Depth
Everything GPT-2: 3. Tools
Everything GPT-2: 4. Data Preparation
Everything GPT-2: 5. Fine-Tuning
Everything GPT-2: 6. Optimizations
Everything GPT-2: 7. Production
All resources for articles in the series are centralized in this google drive folder.
(Aside) If you are relatively new to Machine Learning, you probably have not experienced the pain that is development operations (dev ops). The models are super big; one typo can mean you have to restart the process. To alleviate this pain, I do as much testing locally with as small a model as possible, then deploy a small model to the cloud, then deploy a large model to the cloud. For this last article, I used the onnx-quant-124M-base model located in the resource folder. It is just under 200 mb, compared to the ~800 mb quantized 774m model.
Have you found this valuable?
If no, then tweet at me with some feedback.
If yes, then send me a pence for my thoughts.