The same thing goes by many names…

The Problem

Important things go by multiple names or even by codes. Think states in the US. Virginia might go by Virginia, VA, or the State of Virginia. You can’t write a regex to solve this because the two letter abbreviations are irregular. You could put a big dictionary for all the states and use pandas.Series.map, but the dictionary is really data. In an ideal world, it would be separated from your code.

The Solution

Store the States and their various names in a csv, read in the csv, transform the csv into a dictionary, and use it to rename the entries. Fortunately, I wrote a Python script which makes handling this easy and quick.

  1. Download “The Renamer” folder and unzip it.
  2. Upload it into you own google drive at the top level. All the data is included in folder so all notebook cells will work without path alterations.

Example Problem

You have some vaccination data for the the United States for 2021. You want to make a scatter plot with dates on the x-axis and percent vaccinated on the y axis. You would like to plot all 50 states. So the legend is not so large, you want to use the two letter abbreviations for the states. However, the data uses the full names for the states, and calls New York, New York State.

Example Data Source

The data used is obtained from Our World In Data. It is the us_state_vaccinations data. In the event that they remove their data, I downloaded it and included it in the Google Drive on September 7th, 2021.

Load and Explore the data

Load the vaccination data.

Make a state_ids.csv File

Before we can use The Renamer, we need to build an ids.csv file. The rules for making it are:

  1. In row two enter destination, single, and many
  2. In column one, starting in row three enter all the state abbreviations.
  3. In column two, starting in row three enter all the full names of the states, with New York as New York State.
  4. In column three, starting in row three add all forms of the state name.

Build Dictionaries for Transform

The renamer function uses a dictionary to make the changes, so you first have to build it.The build_ids function returns however a dictionary for each ‘one’ or ‘many’ column. In our case, there are two columns so it returns two dictionaries. Our example only needs the first dictionary. The second shows that the other functionality works.

Leave Some Data Alone or not?

Our data set has some non-state data like the Bureau of Prisons. We might like to keep it or remove all the non-state data. The renamer has functionality built in for both cases.

Option 1 Leave Non-state Data Alone

Option 2 Replace Nonstate Data and Drop it.

Plot the Data

The plots are very large since there are a lot of categories. They are best viewed in the Colab Notebook.

Why use it?

Don’t reinvent the wheel. It is fast. Take a look at the tests. I often work with csv files that are 30,000–100,000 rows x 10–80 columns. It usually only takes a few seconds.

Can I install with pip?

No. Including the tests, it is only a couple hundred lines. You are better off cloning the .py file to your local machine. I don’t want to clutter PyPI with another package nobody uses. If you want it as a pip package, open an issue / vote for the issue on the repo. If it gets enough attention, I will package it.

Mathematician, enjoys his knowledge distilled. Find my insight deep, my jokes laughable, my resources useful, connect with me on twitter @Rowlando_13