Survey Your Data
00:03 The next plot will give you a general overview of a specific column of your dataset. First, you’ll have a look at the distribution of a property with a histogram.
00:12
Then you’ll get to know some tools to examine the outliers. Distributions and Histograms. A DataFrame
is not the only class in pandas with a .plot()
method.
00:24
As so often happens in pandas, the Series
object provides similar functionality. You can get each column of a DataFrame as a Series
object.
00:34
Here’s an example using the "Median"
column of the DataFrame you created from the college major data.
00:45
Now that you have a Series
object, you can create a plot for it. A histogram is a good way to visualize how values are distributed across a dataset. Histograms group values into bins and display a count of the data points whose values are in any given bin.
01:01
Let’s create a histogram for the "Median"
column. As you can see, it’s just a case of calling .plot()
on the median_column
Series and passing the string "hist"
to the kind
parameter.
01:14 You can see the histogram here and it shows the data grouped into ten bins ranging from $20,000 to $120,000, with each bin having a width of $10,000. The histogram has a different shape than the normal distribution, which has a symmetric bell shape with a peak in the middle.
01:33 For more information about histograms, check out Real Python’s Python Histogram Plotting: NumPy, Matplotlib, Pandas and Seaborn course. The histogram of the median data, however, peaks on the left below $40,000.
01:48 The tail stretches far to the right and suggests there are indeed fields whose majors can expect significantly higher earnings. That brings us to outliers.
01:57 Have you spotted that lonely small bin on the right edge of the distribution? It seems that one data point has its own category. The majors in this field get an excellent salary compared not only to the average, but also to the runner-up. Although this isn’t its main purpose, a histogram can help you to detect such an outlier.
02:16 Let’s investigate a couple of factors about this, which major it represents and how big its edge is. Contrary to the first overview, you only want to compare a few data points, but you want to see more details about them. For this, a bar plot is an excellent tool.
02:33
First, select the five majors with the highest median earnings. You’ll need two steps. Firstly, to sort the median column, use the .sort_values()
method and provide the name of the column you want to sort by as well as the direction—in this case, ascending=False
.
02:58
Secondly, to get the top five items of the list, use head. Chaining these two methods together can allow us to create a top_5
DataFrame.
03:09 Now we have a smaller DataFrame containing only the top five most lucrative majors. As a next step, you can create a bar plot that shows only the majors with these top five median salaries, as seen here.
03:33 You’ll notice that in this plot, the labels are difficult to read because of their angle—they’re vertical. And because of their size, they’re nearly as long as the graph itself.
03:42
You can use a rot
argument to rotate the labels and fontsize
to alter their size to suit your screen. You may need to experiment here.
03:53 Here, we see the plot with five bars and it shows that the median salary of petroleum engineering majors is more than $20,000 higher than the rest. The earnings for the second- through fourth-place majors are relatively close to each other.
04:07 If you have a data point with a much higher or lower value than the rest, then you’ll probably want to investigate a bit further. For example, you can look at the columns that contain related data.
04:17
Let’s investigate all majors whose median salary is above $60,000. First, you need to filter these majors with the mask df[df["Median"] > 60000]
.
04:39 You can then create another bar plot showing all three earnings columns.
04:55 You should see a bar plot with three bars per major, like this. The 25th and 75th percentile confirm what you’ve already seen: petroleum engineering majors were by far the best paid recent graduates.
05:08 Why should you be so interested in outliers in this dataset? If you’re a college student pondering which major to pick, you have at least one pretty obvious reason.
05:18 But outliers are also very interesting from an analysis point of view. They can indicate not only industries with an abundance of money, but also invalid data.
05:29 Invalid data can be caused by any number of errors or oversights, including a sensor outage, an error during the manual data entry, or a five-year-old participating in a focus group meant for kids age ten and above.
05:43 Investigating outliers is an important step in data cleaning, and Real Python has a course on exactly this subject.
05:52 Even if the data is correct, you may decide that it’s just so different from the rest that it produces more noise than benefit. Let’s assume you analyze the sales data of a small publisher.
06:02 You group the revenues by region and compare them to the same month as the previous year. Then, out of the blue, the publisher lands a national bestseller.
06:11 This pleasant event makes your report kind of pointless. With the bestseller’s data included, sales are going up everywhere. Performing the same analysis without the outlier would provide more valuable information for the business, allowing you to see that in New York your sales numbers have improved significantly, but in Miami they actually got worse.
06:31 Having made an initial survey of the data, the next step is to see if any columns of the dataset are connected, and that’s what you’ll see in the next video.
Become a Member to join the conversation.