Cleaning Your Data
For more examples of what you can do with data cleanup, check out Pythonic Data Cleaning With pandas and NumPy.
00:00
You can tell a lot from the data returned by the .info()
method. Take another look. For each column, the table lists the number of non-null values.
00:11
Most of the columns have 126,314 non-null values. And the DataFrame
has 126,314 rows, thus those columns have no null or missing values. However, the notes
column has only 5,424 non-null values, which says there are a lot of missing values in the notes
column.
00:35 This is a problem because missing data results in what is essentially guessing for some data tasks. Thus, removing or resolving missing data is important.
00:46
One way to manage missing data is to completely remove it. You can simply omit any row that has a column with missing values. The nba
DataFrame
only has missing values for the notes
column, but the .dropna()
method will search all columns in the DataFrame
for nulls.
01:08
The cleaned data has 5,424 rows, so more than 90% of the rows were dropped. For the nba
DataFrame
, the notes data is not that relevant, at least for this course, therefore it makes more sense to drop the column itself. Remember, in the previous lesson there was a discussion about the axis
keyword argument. The .dropna()
method accepts the same argument and values. By default, it is 0
, which means “Drop rows with missing values.” However, if you explicitly set it to 1
, .dropna()
will search for columns with missing values.
01:48
And now you can see that clean_nba
only has 23 columns.
01:54
Also, the notes
column has been removed. If the notes
column did have relevant information, you could keep it and replace just the missing values with a default value.
02:06
The .fillna()
method on a Series
will take the value to insert.
02:12
Also, you used the inplace
argument to modify the calling DataFrame
instead of returning the modified DataFrame
. Look at .info()
again.
02:24
There are now 126,314 non-null values in the notes
column. And if you .describe()
the 'notes'
column, the value occurring the most is 'no notes at all'
.
02:36
This makes sense as more than 90% of the values were missing before being replaced. Run .describe()
on nba
again. Look at the minimum value of the pts
column.
02:49
Apparently, there was at least one game in which a team scored no points. Did they even try? Let’s take a deeper look. If you look at the notes
column, it says that this game was forfeited. Now since there’s only one game with 0
points it might not have much of an impact, but you can now proceed to either remove or handle this row.
03:10
Several times, you’ve seen the game_result
column that records a win or loss for that game. But does every win match up with the rest of the data in the row?
03:21 If a game result is a win, then the points scored for that row should be more than the opposition points scored. And for a loss, the opposition points should be greater.
03:31 You can test these conditions using a query.
03:35
The first condition checks that the number of points scored by a team is greater than the number of opposition points. The second condition checks that the game was not a win. When joined with the AND operator (&
), the query should yield no results. And it does, but there’s an easier way to see. Simply get the .empty
attribute.
03:57
A query returns a DataFrame
, and a DataFrame
with no rows is empty. To check the opposite, reverse the first comparison to check for games in which the opponents had more points and make sure the game was not a loss. Again, this should be empty.
04:14 You have only seen part of what you can do with data cleanup. For more examples, check out this post on Real Python. In the next lesson, you’ll learn how to combine data from multiple sources.
Become a Member to join the conversation.