Creating Columns With Arithmetic Operations and NumPy
00:00
You can apply basic arithmetic operations such as addition, subtraction, multiplication, and division to pandas Series
and DataFrame
objects in pretty much the same way as you would with NumPy arrays. So, for example, with a NumPy array, you could take a column and multiply the whole column by 2
, and you could do the same thing with a DataFrame.
00:22
Let’s take the 'js-score'
column, and if we simply multiply it times 2
, we get a Series
object where all of the entries in the js-score
column are multiplied by 2
.
00:35
We can bring that 2 *
at the front, and that’ll give us the same thing. We can also do division. So, say, divide it by 4
.
00:48
And we can also add a couple of columns. So, for example, let’s add the 'js-score'
column and the 'py-score'
column.
00:56
So here, we’re taking two columns, and these are going to be two pandas Series
objects. They’re going to have the same index, and so pandas will just know how to match up the indices and add up the corresponding elements.
01:10 Now, these basic arithmetic operations that we can do on columns, we can use this technique to create new columns—say, by doing some sort of linear combinations of the columns.
01:22
So, for example, let’s suppose that I take the 'js-score'
and the 'py-score'
, and I also want to add up the 'django-score'
. Maybe here, the idea is that we want to find some sort of average, and maybe the py-score
—that’s going to be worth, say, 40% of the average.
01:41
And then maybe the other two scores are going to be worth 0.3
, so we can bring in those numbers and these multiplication operations. This is going to give us a new Series. And maybe what we want to do is save the series as a column in our DataFrame, and that would, give us a total score for our job candidates.
02:02
Let’s create a new column, call it 'total'
, and we’ll create it using this arithmetic operation. So let’s run that, and then let’s take a look at our DataFrame, and so now we’ve got this sort of total score based on all of the columns in the DataFrame relating to the score for each of the candidates.
02:23
Now, in addition to using just the basic arithmetic operations, you can also use most NumPy and SciPy routines to pandas Series
and DataFrame
objects. So, let me show you another way that we could have done this.
02:37
I’m going to create a pandas Series
object, and I’m going to call it wgts
(weights). This is going to be basically keeping track of the weights of the individual tests, and that’ll give us another way to compute this total
column.
02:52
So again, we have the data, the data is going to be, say, 0.4
, 0.3
, and 0.3
, and the index is going to be… Well, we want the 0.4
to be for the 'py-score'
.
03:05
And then for the other scores, we want those to be the ones for 0.3
.
03:14 All right, let’s take a look at that.
03:17
Then what we’ll do is, this Series
object, the index is the exact same as the column labels that we want to work with. So what we could do easily is simply, from the DataFrame, pull out the columns that we want. And these columns are the ones from the index of the wgts
, right?
03:39
The index of the wgts
is going to be 'py-score'
, 'django-score'
, and 'js-score'
. So just for you to see that, we get those score columns.
03:48
And then if we just multiply this by the wgts
Series, pandas knows that what we want to do here is take the py-score
value in the wgts
pandas Series
object and multiply the column with the py-score
values, and similarly for the django-score
and the js-score
. And so this creates a DataFrame.
04:09
Then what we want to do is use the sum()
function in NumPy. Maybe we should import numpy
first, so let’s go import numpy as np
. In here, what we want to do is we want to take the np.sum()
function.
04:27
And by default, this is going to sum along each individual column. So what we’re going to get here are three values for the py-score
, the django-score
, and the js-score
.
04:38 So in other words, we fix a column, and this is going to add up along the rows once we fix a column. So let me just show you that. We got that. Let me move this over here so that we’re not getting this exact same line.
04:53
Let’s just run that here. Now, if you instead want to sum along the rows—in other words, you fix a row and you sum the entries of that row—you’ll want to pass in a value of 1
to the axis
.
05:07
So you’re basically saying “Sum along the columns,” right? We want to fix a row, sum it along the columns. That gives us, then, the total score in another way. And if we compare that over here, we’ve got 50.6
for Xavier
, and we’ve got 67
, and so on, and that’s exactly what we are getting over here.
05:30
So, this would give us another way to define, or to create, that total
column in the DataFrame by combining the fact that we can multiply Series
objects with DataFrame
objects and use any of the NumPy basic routines on DataFrames.
05:49 That gives us the exact same thing as we had before.
05:56 All right! So, these are a few of the many things that you can do in pandas by combining basic arithmetic operations and some of the built-in NumPy routines on pandas Series and pandas DataFrames to use them to possibly create new columns in your DataFrame. All right, up next, we’ll take a look at sorting a pandas DataFrame.
Anonymous on Sept. 30, 2021
A minimal example works fine for me:
df = pd.DataFrame({'c1': [5,8,0], 'c2': [1,2,3]})
df['quotient'] = df['c1']/df['c2']
I get
>>> df
c1 c2 quotient
0 5 1 5.0
1 8 2 4.0
2 0 3 0.0
Maybe you did something more complicated. Anyway, here’s a great explanation of the dreaded SettingWithCopyWarning (and it is just a warning, not an error): [www.dataquest.io/blog/settingwithcopywarning/]
hwoarang09 on April 28, 2022
i want to know the difference between the two
df[‘total’] = np.sum(df[wgts.index] * wgts, axis=1)
df[‘total2’] = (df*wgts).sum(axis=1)
should i use np.sum???? what is the possible error of second code?? thanks!
Become a Member to join the conversation.
BadgerPaul on Aug. 29, 2021
I am trying to perform an arithmetic operation similar to that in the Creating Columns With Arithmetic Operations and NumPy lesson (at minute 2:25). In my case, I am dividing one column of my DataFrame by another column in the same DataFrame:
per_capita[“Deaths per 100,000 Pop”] = per_capita[“Total Deaths”] / per_capita[“Population”]
**While the requested calculations are made and a new column is created, I receive the following error statement: **
“C:\Users\paulm\anaconda3\envs\pandas_playground\lib\site-packages\pandas\core\frame.py:3607: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead**
See the caveats in the documentation: (pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy) self._set_item(key, value)”**
**I am not experienced enough to understand what I did in error or what the error statement is requesting that I do to correct that error. I am using Pandas and Jupyter Notebook. Thoughts? Advice? **