Renaming Headers

Ian Currie

Data Cleaning With pandas and NumPy Ian Currie 07:01

Transcript
Discussion (2)

00:00 In this lesson, you’ll be renaming some of the headers of the Olympic dataset. As they stand currently, the names aren’t very good. One of the first things you’ll probably want to do before looking into even slicing the data and having a bit of a look at how it’s structured is to rename the headers and tidy up the column names and the labels of the columns because right now they’re being automatically assigned by pandas.

00:27 As you can see here, the labels given are just numbers, likewise with the rows. So you want to first make this the header, and then you’ll want to rename these because at the moment they are a bit messy.

00:44 The way you can approach this is by using the .rename() method. There’s a couple ways to use this. You can use a dictionary as a map for renaming. That is, on one hand, you have your keys, which are the actual column names.

00:57 And then as the value for the entries of the dictionary, you’ll have the names that you want them to have. This will also be a good opportunity to learn about ignoring errors.

01:07 For example, if your column names change, for whatever reason, you may want to suppress any errors or not. The thing you want to do is to have the errors raise up, so you know, something has changed.

01:19 And you’re also going take this opportunity to understand what the inplace argument means and why you generally shouldn’t use it. It’s present in a lot of methods in pandas, and essentially it means you change the DataFrame in place, meaning that you’re not sort of creating a copy of it, which is what a lot of methods do. They will take the original DataFrame, transform it with something like the .rename() method and then return a new DataFrame.

01:45 The inplace argument means that it’s going to change the actual DataFrame and not return. It will just return nothing or None.

01:55 Back in VS Code, you’ve got your code that will read the CSV Olympics and will return a DataFrame with that data. A handy thing about looking at your code through this interactive window and using Jupyter and IPython in general is you can take your DataFrame, get a method, and we’re looking at the .rename(), and instead of calling it, you can append a question mark (?) here, and now you can press Shift + Enter or Control + Enter, or this play button (▷).

02:27 It will output the documentation of the .rename() method. As you can see, first it’s telling you that the output of this exceeds the size limit, and you can output the full data in a text editor. So if you click this,

02:42 you’ll get a messy output because a lot of these are ANSI control codes, which tell the console to color things and highlight them. But in a plain text file, this obviously does not work, but here you can see it says it takes a mapper, which is the dictionary, and it takes a bunch of different arguments.

03:04 And here it gives you a bit about the function: dict values must be unique. Labels not contained in a dict / Series will be left as-is. Extra labels listed don't throw an error. Things like this.

03:15 It’s just a very useful shortcut to the documentation if you just want to consult something quickly and not have to go to the website. The read_csv() function has an argument you can add in called header.

03:29 And then you can set that to 1, which means the first row. So now if you run this again with Shift + Enter … and now you can see that the first row has been designated as the header.

03:44 With the header being designated, you’ll want to look at renaming these headers into something more readable. For this, you can use the aptly named .rename() method.

03:55 And the argument you’ll use with .rename() is the columns keyword argument. Now to this, you can pass in a dictionary which represents a one-to-one mapping of the existing headers and the ones that you want.

04:09 So the first one is called Unnamed: 0. So copy this, have it as your first key—you’ll need to wrap it in speech brackets ("")—and have a look at what the data is here. Now, this looks like they are countries, so how about calling this column country?

04:31 Save that, run it, and then down here, you’ll see that run, up to get the last command, Control + Enter to run this. And as you can see, the first column has now been renamed.

04:45 So how about you get this one, Winter, and change it to just winter_olympics? So grab the whole of this. "? Winter". We'll just call this “winter_olympics”`.

05:03 Save, Control + Enter to run, go down here, have a SyntaxError because there’s no comma separating the key-value pairs. Shift + Enter again to run. Down here, this is run. Up (↑), Control + Enter to run this again, and now you can see that winter_olympics has been renamed here.

05:26 Now comes a bit of a tedious job of just renaming all the headers.

05:43 That should be it. Control + Enter to run.

05:48 Down here seems to be a SyntaxError. Nope, we’re fine.

05:55 Okay. And as you can see now, all the columns have been named to something a bit more user-friendly, although there seems to be one that has been missed, the last one.

06:11 There are many conventions, but one of the conventions is just to use typical Python snake case here for all the column names, which just makes it a bit more intuitive to get the names and reference things. This script is looking good. It cleans up all the headers, puts the header in the right place, and renames them all into something a bit more usable. Running this,

06:36 and then looking at the data,

06:40 it would seem to be all good. So this dataset could now be considered clean enough to start some data analysis on it. In the next lesson, you’re going to be reviewing .loc[], sometimes called LOC or the location indexer, for slicing and dicing and exploring your data.

Mauricio Mejía Castro on Nov. 20, 2022

Here is the renaming dictionary if you don’t want to type it out. :)

columns={
    'Unnamed: 0': 'country',
    '? Summer': 'summer_olympics',
    '01 !': 'summer_golds',
    '02 !': 'summer_silvers',
    '03 !': 'summer_bronzes',
    'Total': 'summer_total',
    '? Winter': 'winter_olympics',
    '01 !.1': 'winter_golds',
    '02 !.1': 'winter_silvers',
    '03 !.1': 'winter_bronzes',
    'Total.1': 'winter_total',
    '? Games': 'total_games',
    '01 !.2': 'total_golds',
    '02 !.2': 'total_silvers',
    '03 !.2': 'total_bronzes',
}

dotnet on April 23, 2023

another very good course. to make the headers more readable and the code overall ‘more pythonic ‘, i would suggest the following solution. what is your opinion ?

thanks a lot.

def read_file(file: str): olympics = pd.read_csv(‘data_sets/olympics.csv’, header=1)

columns_old = olympics.columns
columns_new = ['country', 'summer', 's_gold', 's_silver', 
's_bronce', 's_total', 'winter', 'w_gold', 'w_silver', 
'w_bronce', 'w_total', 'total_games,' 'total_gold', 
total_silver', 'total_bronce', 'combined_total']

ren_dict = {x: y for (x, y) in zip(columns_old, columns_new)}
return olympics.rename(columns=ren_dict)

Become a Member to join the conversation.