Resource mentioned in this lesson: Element Spectral Lines
“Serena” Quasar’s sparcl ID: 719e7ea6-8b79-11ef-be93-525400f334e1
Resource mentioned in this lesson: Element Spectral Lines
“Serena” Quasar’s sparcl ID: 719e7ea6-8b79-11ef-be93-525400f334e1
00:00 In the previous lesson, I showed you how to fetch a quasar spectral data from the Sparcl database. In this lesson, I’ll show you how to use pandas to read spectral line data from a web page.
00:11 The last piece of data you need for the dashboard is the emission wavelengths of various elements found in quasars. One of the fun parts of data science is using different sources of data and then making them work together.
00:22 The emission data in question is available on a web page, so I’m going to show you how to go grab it. As you might guess, grabbing stuff from the web tends to result in some messiness, so once it’s been fetched, you’ll need to clean it up a bit. When you’re happy with it, this like our other data, gets written to a CSV file that can then be used in the dashboard.
00:43 Although my preferred DataFrame library is Polars, pandas does have a handy call for reading data from a web page, so you’ll need to install pandas and the LXML parser if you wish to follow along.
00:57 This is the web page that contains the data you want. It’s from the Sloan Digital Sky Survey, which has lots of cool stuff. The good news is pandas knows how to turn an HTML table into a DataFrame.
01:10 The bad news is this page contains a single table with three different kinds of data in it and inline headers for each of those three sections. All you want is the emission line data, so once you’ve got a DataFrame, you’ll need to do some chopping to get rid of the extra stuff.
01:25 Let’s head to the REPL and chop up the web.
01:29 Like with Polars, the convention is to alias pandas when importing it. Conveniently, getting data from a web table is a single pandas call,
01:47 and here’s the response. A few things to note here. First, pandas requires an index column. In this case, it’s created a counting integer. That’s the first column. Don’t worry about that.
01:59 When this gets converted to Polars, Polars knows how to ignore it. Second, you can see that the subheader sections of the table have been imported as rows.
02:08 You’ll need to get rid of these and the associated absorption and skyline data that isn’t interesting to us. Also, look at lines 41 and 35. They both say H for hydrogen.
02:21 If you go back and look at the original table on the web, you’ll note that one of these is an actual H, and the other is an H alpha. The table uses images for the Greek letters, which pandas ignores.
02:33 This would be the point where I would grumble about the lack of using Unicode on the page. But hey, free data. You get what you paid for. For our simple purposes, treating all hydrogen lines as a group is fine.
02:45 If you needed to be more specific, you’d have to modify the affected rows by hand or find a different data source. The last little problem is at the bottom right hand corner there.
02:55 See the square bracket? Yep. pandas returned a list of things, so you’ll need to dereference what’s inside of it to convert the actual data to Polars. Speaking of, let’s convert this to Polars and clean it up.
03:10 By the way, if you’re a hardcore pandas person and who doesn’t like the chubby little bamboo munchers, everything in this course could be done with pandas instead, I just prefer Polars myself.
03:20 I find the data frame manipulation calls easier to read and understand, but that’s a personal preference. I’m also from the great white north, so I might just have an affinity for the fiercer white polar bear.
03:32
Now that I’ve got it imported, I can use pl.from_pandas()
function to convert a pandas DataFrame to a Polars one.
03:41 Don’t forget the need for zero indexing table because the return from the previous call was a list.
03:48 And here it is in Polars; the index column is gone as Polars doesn’t use them. Let’s think about what needs to be done. First, some data has to go. Second, the row headers really aren’t data and should be chopped.
04:02 And third, all of the columns are strings, which could be better. Let’s start by chopping the first three rows containing the table headers.
04:12 Note that this command didn’t change the existing DataFrame. It just shows you the cut result. I always do this first to see whether or not it did what I wanted, and if it did, then I’ll do it again, replacing the original object.
04:27 I’ve saved you the effort and counted the number of unwanted rows at the end. It’s 14. Let’s double check.
04:36 Yep, that’s the stuff. By default, Polars prints the first five and last five rows of a DataFrame when it’s evaluated in the REPL. You can change that.
04:49
The Polars config object controls how Polars behaves. tbl_rows
specifies how many rows to show in the summary. Setting it to minus one means to not limit the number of rows.
05:02 If I look again at the last 14 lines, I see all of them. Great. I’m sure I don’t want this stuff. Let’s chop it.
05:15
I’ve overwritten the DataFrame with a new one containing everything but the last 14 rows of the previous one. I could print this out, but because of the setting for config, the screen would scroll, so instead, head()
returns the first five, and what’s a head without a tail?
05:34 This is the last five. Let me reset the config.
05:42
In Polars, setting the value to None
like this means to use the default behavior. Now let’s do something about the data types. The select()
method allows you to choose columns from a DataFrame and create a new DataFrame.
05:55 Along the way, you can perform calculations on the columns. In our case, the calculation is going to convert it from a string to a float.
06:04
Starting the select()
call, I want my first column to be the element which this table has called species
.
06:13
Each argument to select()
becomes a column in the new data frame, the col()
operator is a selector, meaning in this case to choose column three.
06:23
That’s the fourth. There’s zero indexed. Column three is the species
column. In the old one, it’s going to become the first column in our new DataFrame.
06:34
The second column is the wavelength data. I’m using col()
again to access a column of data, this time, the first one. On top of that, I’m chaining it with a call to cast()
.
06:45
Cast()
converts data. I’m asking it to convert to a floating-point number.
06:56
Same goes for galaxy_weight
,
07:02
and for quasar_weight
. I couldn’t get a straight answer as to just what these weight columns are, but if you look at the data closely, some elements have multiple rows with different weight values.
07:13 Remember, this data comes from observations and experiments, so I suspect this has something to do with how they were sourced. And there’s the end result.
07:23 Not bad. With just a few lines of code, you now have a clean set of data to use in the dashboard. You might ask why I keep doing this inline in the REPL.
07:32 Well, it’s an old habit. I want to see the results and make sure I didn’t muck them up. Then I can just hit up arrow in the REPL and modify the line to store the result away.
07:42
This is especially important because I keep overriding the DataFrame. If I didn’t check it first, and I made a mistake in the select()
overriding the DataFrame, I’d have to go back and start all over again.
08:06 and now I’ve saved it to a CSV. Alright, you’ve got all the data. Time to start adding features to the dashboard.
Become a Member to join the conversation.