Predicting the Age of Sea Snails
In this lesson, you’ll get ready to predict the age of sea snails with Python using the abalone dataset from UCI Machine Learning Repository! To do that, you’ll want to ensure that you’ve installed Python with Anaconda. You’ll also want to download Seaborn to plot a histogram of the sea snails’ rings.
To learn more about the pandas DataFrame, check out The pandas DataFrame: Make Working With Data Delightful.
00:00 Now it’s time to learn about predicting the age of sea snails. The abalone dataset contains publicly available biological data with measurements of several thousand abalones.
00:13 To follow along with the coding part of this course, you’ll be using the abalone dataset to build a k-nearest neighbors model.
00:22 So what are abalones? Well, here’s an example if you’d like to see one. Abalones are a family of sea snails that look a bit like mussels. They’re found around the world, but mostly in cold waters.
00:35 The abalones in this dataset were collected near Tasmania, Australia.
00:42 Eventually, you’ll be training a kNN model to predict abalone age. Biologists can calculate an abalone’s age by counting the inner rings on its shell. However, this process involves cutting through the shell, staining it, and using a microscope to count the rings, which is a tedious process.
01:00 Your goal is to create a model that takes an abalone’s physical measurements and estimates its age. If successful, such a model would help biologists, saving them time and effort.
01:14 You’ll now begin analyzing the abalone dataset by importing it and checking a few of its descriptive statistics. If you haven’t already done so, be sure to import Python with Anaconda so you can follow along with the code.
01:27 The Anaconda distribution of Python comes with all sorts of useful libraries you can use to work with the data. Specifically, you can get the pandas library to get started in this lesson.
01:41
First, import the pandas
library and alias it as pd
, which is standard. The abalone dataset is publicly available through this URL, so you can use pandas’ read_csv()
function to download the data and structure it as a pandas DataFrame.
01:57
Also note that these data do not have a header with column names, so set header
to be None
for now, so that the data are read in properly.
02:09
You can now view the first few rows of the abalone dataset by executing abalone.head
. Each row in this dataset represents an individual abalone, while each column is a different measurement.
02:24
Right now, there are no column names, but those can be found on the UCI Machine Learning Repository. Here’s a list of those names, and you can go ahead and assign those to the columns
property of the abalone
DataFrame.
02:41
Since the goal of this exercise is to make age predictions based off of the physical measurements of the abalones, you should remove the "Sex"
column from the dataset.
02:52
You can use the .drop()
method to do this.
02:57
Just be sure to specify axis=1
to tell pandas, you want to drop a specific column instead of a row.
03:08
Take a look at the top part of the abalone
DataFrame once again to verify that the columns have been named appropriately and that "Sex"
is no longer included.
03:21
And you can also check abalone.info()
to see that there are just over four thousand rows, or abalones, in this dataset. You can also check if there are any missing values, which there are not, and you should also see that all the columns are numeric, either floats or integers.
03:43 Now that you have the data loaded in and have a general feel for what’s included, let’s learn more about the target that you’re going to try to predict.
03:51
In this case, that’s the "Rings"
column. You can use the .describe()
method to understand general summary statistics here. These abalone have ten rings on average, and most seem to have between eight and eleven rings. However, the minimum is one, and the maximum is twenty-nine rings in this dataset.
04:13
A histogram will also give you a good sense of the rings. You need a plotting library to do this, and one option is Seaborn. To use it, import seaborn
, and it’s common to alias this library as sns
.
04:28
Now you can call sns.hisplot()
to create a histogram. You want to plot the "Rings"
column, and let’s go ahead and specify fifteen bins for this histogram.
04:42 The decision to use fifteen bins is made based on trial and error. If you specify too few bins, you may miss out on certain trends. However, if you set the bin number too high, your histogram won’t look nice and smooth.
04:56 This histogram of the abalone rings shows a nice peak right around ten rings, and you can also see the great majority of abalones have between five and fifteen rings, though there are some with fewer and more.
05:14 Since you’re going to be building a machine learning model to predict rings, it can also be helpful to explore correlations between your input variables and the target output.
05:23
Here you’re hoping for variables that have strong correlation with the target because that would mean the physical measurements and abalone age are related, and your modeling efforts have some chance of succeeding. To compute correlations, apply the .corr()
method to your DataFrame.
05:42
This correlation_matrix
variable now contains correlations between every column of the abalone
DataFrame and every other column, but perhaps the most important correlations are those for the output "Rings"
variable.
06:00
Let’s look specifically at the correlations between "Rings"
and every other variable. The closer to one, the more positive correlation there is.
06:09
Of course, "Rings"
is perfectly correlated with itself at a value of 1
. Based on these values, you can conclude that there’s at least some correlation between the physical abalone measurements and their age, but it’s not particularly high.
06:25 If you saw very high correlations, you could expect a fairly straightforward modeling process, perhaps using something like linear regression. In this case, you can try k-nearest neighbors and see what happens. There are, of course, many other ways that you could explore these data using pandas.
06:44 Try a few others out as well and see what else you can find.
06:50
In this lesson, you learned about the abalone
dataset, which is publicly available data about a type of sea snails. The eventual goal of this project will be to predict an abalone’s age, or rings, from its physical measurements, so you’ll be building a kNN model to do just that.
07:08 You went on to explore the data set using pandas. Specifically, you plotted the distribution of the rings in this dataset and noted that most abalone have about ten rings.
07:20 You also calculated the correlations between the physical measurements and the rings and saw that these quantities are at least somewhat related.
07:31 In the next lesson, find out how kNN actually works through a step-by-step approach to this highly intuitive algorithm.
Become a Member to join the conversation.