Exploring plt.scatter() Further
00:00
Exploring further. plt.scatter()
offers even more flexibility in customizing scatter plots. In this section, you’ll explore how to mask data using NumPy arrays and scatter plots through an example where you’ll generate random data points and then separate them into two distinct regions within the same scatter plot.
00:23 A commuter who’s keen on collecting data has collated the arrival times for buses at her local bus stop over a six-month period. The timetable arrival times are at 15 minutes and 45 minutes past the hour, but she noticed that the true arrival times follow a normal distribution around these times.
00:42
This plot shows the relative likelihood of a bus arriving at each minute within an hour. This probability distribution can be represented using NumPy and np.linspace()
as seen here.
02:13
You’ve created two normal distributions centered on 15 and 45 minutes past the hour and summed them. You then set the most likely arrival time to a value of 1
by dividing by the maximum value. Onscreen, you can see the output from this code showing the two normal distributions centered on 15 and 45 minutes.
02:36
You can now simulate bus arrival times using this distribution. To do this, you can create random times and random relative probabilities using the built-in random
module.
03:54
Here, you’ve simulated 40
bus arrivals, which have then been plotted on the scatter plot. Your plot will look different since the data you’re generating is random.
04:04 However, not all of these points are likely to be close to the reality that the commuter observed. You can plot the distribution she obtained from the data with the simulated bus arrivals.
04:28 The distributions representing the bus arrival times are created.
04:58 They’re then added together and normalized by dividing by the maximum. Next, the curve is added to the plot with the following line. This code gives the output seen onscreen. To keep the simulation realistic, you need to make sure that the random bus arrivals match the data and the distribution obtained from those data.
05:32 You can filter the randomly generated points by keeping only the ones that fall within the probability distribution. This can be achieved by creating a mask for the scatter plot.
05:58
The variables in_region
and out_region
are NumPy arrays containing Boolean values based on whether the randomly generated likelihoods fall above or below the distribution y
. You then plot two separate scatter plots: one with the points that fall within the distribution and another for the points that fall outside the distribution.
06:39 The data points that fall above the distribution are not representative of the real data. You’ve segmented the data points from the original scatter plot based on whether they fall within the distribution and used a different color and marker to identify the two sets of data. Onscreen, you can see the result of this code with the data points that fall above the distribution, which are not representative of the real data, plotted in a different color with an X marker. Once again, note that your result will look different as the points are randomly generated and will change each time the code is run.
07:20 In the next section of the course, there’s a review of the key input parameters that you’ve covered so far.
Become a Member to join the conversation.