Survey Your Data
Now that you have a
Series object, you can create a plot for it. A histogram is a good way to visualize how values are distributed across a dataset. Histograms group values into bins and display a count of the data points whose values are in any given bin.
01:14 You can see the histogram here and it shows the data grouped into ten bins ranging from $20,000 to $120,000, with each bin having a width of $10,000. The histogram has a different shape than the normal distribution, which has a symmetric bell shape with a peak in the middle.
01:33 For more information about histograms, check out Real Python’s Python Histogram Plotting: NumPy, Matplotlib, Pandas and Seaborn course. The histogram of the median data, however, peaks on the left below $40,000.
01:57 Have you spotted that lonely small bin on the right edge of the distribution? It seems that one data point has its own category. The majors in this field get an excellent salary compared not only to the average, but also to the runner-up. Although this isn’t its main purpose, a histogram can help you to detect such an outlier.
02:16 Let’s investigate a couple of factors about this, which major it represents and how big its edge is. Contrary to the first overview, you only want to compare a few data points, but you want to see more details about them. For this, a bar plot is an excellent tool.
First, select the five majors with the highest median earnings. You’ll need two steps. Firstly, to sort the median column, use the
.sort_values() method and provide the name of the column you want to sort by as well as the direction—in this case,
03:09 Now we have a smaller DataFrame containing only the top five most lucrative majors. As a next step, you can create a bar plot that shows only the majors with these top five median salaries, as seen here.
03:53 Here, we see the plot with five bars and it shows that the median salary of petroleum engineering majors is more than $20,000 higher than the rest. The earnings for the second- through fourth-place majors are relatively close to each other.
04:07 If you have a data point with a much higher or lower value than the rest, then you’ll probably want to investigate a bit further. For example, you can look at the columns that contain related data.
04:55 You should see a bar plot with three bars per major, like this. The 25th and 75th percentile confirm what you’ve already seen: petroleum engineering majors were by far the best paid recent graduates.
05:29 Invalid data can be caused by any number of errors or oversights, including a sensor outage, an error during the manual data entry, or a five-year-old participating in a focus group meant for kids age ten and above.
06:11 This pleasant event makes your report kind of pointless. With the bestseller’s data included, sales are going up everywhere. Performing the same analysis without the outlier would provide more valuable information for the business, allowing you to see that in New York your sales numbers have improved significantly, but in Miami they actually got worse.
Become a Member to join the conversation.