The same thing goes by many names…
Important things go by multiple names or even by codes. Think states in the US. Virginia might go by Virginia, VA, or the State of Virginia. You can’t write a regex to solve this because the two letter abbreviations are irregular. You could put a big dictionary for all the states and use pandas.Series.map, but the dictionary is really data. In an ideal world, it would be separated from your code.
Store the States and their various names in a csv, read in the csv, transform the csv into a dictionary, and use it to rename the entries. Fortunately, I wrote a Python script which makes handling this easy and quick.
The Documentation at Read the Docs
To follow the tutorial with a Google Colab notebook:
- Go to this Google drive
- Download “The Renamer” folder and unzip it.
- Upload it into you own google drive at the top level. All the data is included in folder so all notebook cells will work without path alterations.
This is the direct link to the Colab Notebook, but many of the cells will not run, because you don’t have the data.
The code snippets included in this Medium article are just highlights from the Colab Notebook.
You have some vaccination data for the the United States for 2021. You want to make a scatter plot with dates on the x-axis and percent vaccinated on the y axis. You would like to plot all 50 states. So the legend is not so large, you want to use the two letter abbreviations for the states. However, the data uses the full names for the states, and calls New York, New York State.
Example Data Source
The data used is obtained from Our World In Data. It is the us_state_vaccinations data. In the event that they remove their data, I downloaded it and included it in the Google Drive on September 7th, 2021.
Load and Explore the data
Load the vaccination data.
The data is day by day count of various vaccination metrics. There is state and other data. We only want the date, location, and people_fully_vaccinated_per_hundred. Additionally, for the date, lets just take the 12th of each month, since the data starts on the 12th on January.
Make a state_ids.csv File
Before we can use The Renamer, we need to build an ids.csv file. The rules for making it are:
For our example the steps are
- In row one enter the names of the columns.
- In row two enter destination, single, and many
- In column one, starting in row three enter all the state abbreviations.
- In column two, starting in row three enter all the full names of the states, with New York as New York State.
- In column three, starting in row three add all forms of the state name.
The tedious part of typing out all the states is done for you. The screen shot only capture the top few and displays the indices of the side(not part of the csv) since I read it in as a pandas.Dataframe.
Build Dictionaries for Transform
The renamer function uses a dictionary to make the changes, so you first have to build it.The build_ids function returns however a dictionary for each ‘one’ or ‘many’ column. In our case, there are two columns so it returns two dictionaries. Our example only needs the first dictionary. The second shows that the other functionality works.
Leave Some Data Alone or not?
Our data set has some non-state data like the Bureau of Prisons. We might like to keep it or remove all the non-state data. The renamer has functionality built in for both cases.
Option 1 Leave Non-state Data Alone
Option 2 Replace Nonstate Data and Drop it.
Plot the Data
The plots are very large since there are a lot of categories. They are best viewed in the Colab Notebook.
Why use it?
Don’t reinvent the wheel. It is fast. Take a look at the tests. I often work with csv files that are 30,000–100,000 rows x 10–80 columns. It usually only takes a few seconds.
Screen shot of tests on my local machine for 13,000 rows read in once and processed 3 times.
Can I install with pip?
No. Including the tests, it is only a couple hundred lines. You are better off cloning the .py file to your local machine. I don’t want to clutter PyPI with another package nobody uses. If you want it as a pip package, open an issue / vote for the issue on the repo. If it gets enough attention, I will package it.
Thanks for going through the tutorial. I hope The Renamer is useful to you.