For more examples of what you can do with data cleanup, check out Pythonic Data Cleaning With pandas and NumPy.
Cleaning Your Data
You can tell a lot from the data returned by the
.info() method. Take another look. For each column, the table lists the number of non-null values.
Most of the columns have 126,314 non-null values. And the
DataFrame has 126,314 rows, thus those columns have no null or missing values. However, the
notes column has only 5,424 non-null values, which says there are a lot of missing values in the
00:35 This is a problem because missing data results in what is essentially guessing for some data tasks. Thus, removing or resolving missing data is important.
One way to manage missing data is to completely remove it. You can simply omit any row that has a column with missing values. The
DataFrame only has missing values for the
notes column, but the
.dropna() method will search all columns in the
DataFrame for nulls.
The cleaned data has 5,424 rows, so more than 90% of the rows were dropped. For the
DataFrame, the notes data is not that relevant, at least for this course, therefore it makes more sense to drop the column itself. Remember, in the previous lesson there was a discussion about the
axis keyword argument. The
.dropna() method accepts the same argument and values. By default, it is
0, which means “Drop rows with missing values.” However, if you explicitly set it to
.dropna() will search for columns with missing values.
And now you can see that
clean_nba only has 23 columns.
notes column has been removed. If the
notes column did have relevant information, you could keep it and replace just the missing values with a default value.
.fillna() method on a
Series will take the value to insert.
Also, you used the
inplace argument to modify the calling
DataFrame instead of returning the modified
DataFrame. Look at
There are now 126,314 non-null values in the
notes column. And if you
'notes' column, the value occurring the most is
'no notes at all'.
This makes sense as more than 90% of the values were missing before being replaced. Run
nba again. Look at the minimum value of the
Apparently, there was at least one game in which a team scored no points. Did they even try? Let’s take a deeper look. If you look at the
notes column, it says that this game was forfeited. Now since there’s only one game with
0 points it might not have much of an impact, but you can now proceed to either remove or handle this row.
Several times, you’ve seen the
game_result column that records a win or loss for that game. But does every win match up with the rest of the data in the row?
03:21 If a game result is a win, then the points scored for that row should be more than the opposition points scored. And for a loss, the opposition points should be greater.
03:31 You can test these conditions using a query.
The first condition checks that the number of points scored by a team is greater than the number of opposition points. The second condition checks that the game was not a win. When joined with the AND operator (
&), the query should yield no results. And it does, but there’s an easier way to see. Simply get the
A query returns a
DataFrame, and a
DataFrame with no rows is empty. To check the opposite, reverse the first comparison to check for games in which the opponents had more points and make sure the game was not a loss. Again, this should be empty.
04:14 You have only seen part of what you can do with data cleanup. For more examples, check out this post on Real Python. In the next lesson, you’ll learn how to combine data from multiple sources.
Become a Member to join the conversation.