Coding HR Diagrams

Exploring Astrophysics in Python With pandas and Matplotlib Christopher Trudeau 05:19

00:00 And here it is, my HR diagram script. Like with NumPy and Matplotlib the common practice for pandas is to import it and alias it, this time as pd. I’m starting to think these data science folks like their abbreviations.

00:17 I want to display the stars’ size and luminosity values as ratios against the sun. So I need these two constants to do my calculations. And remember, this graph is going to have a dot for each star, and I want to color the dot based on the star’s color.

00:33 This function is a mapper between the temperature and color.

00:37 In this case, its color isn’t the color from the chart I showed you before, but a Matplotlib color that closely resembles the color in the chart. Astronomers aren’t so poetic as to name their star colors “palegoldenrod”.

00:50 They stick with a more traditional yellow or yellowish.

00:58 The HR diagram is based on some data which has been downloaded from the SIMBAD star catalog. It’s a big CSV file with over 650 stars that I’m going to graph.

01:09 This read_csv() function reads that data in. The first argument is the name of the data file, hr_simbad.txt. And technically the data file isn’t a CSV as it doesn’t use commas to separate the data, but pipe symbols.

01:23 So the delimiter argument to read_csv() tells pandas to use that as the delimiter instead. There are more columns in the data file than I care about, so to keep memory efficient, I’m including the usecols argument to tell pandas that I only want the temperature and diameter columns in my DataFrame.

01:43 A quick side note, if you go off to SIMBAD to download data yourself, you’ll find you won’t be able to use this code as I’ve written it. The SIMBAD files have some ugly column names and you can do that with pandas, but it’s more painful.

01:55 And not only that, but the version I got included a bunch of dashes to make a pretty table-like thing. These are things you can deal with in pandas if you want to.

02:03 But to keep the code simple, I edited the data file instead.

02:07 If you’re doing a one-off like I am, there’s nothing wrong with editing the data file. If you’re going to be constantly grabbing new data from the same source, you’re better off having the cleaning step being done as part of your program.

02:20 That said, the output of the read_csv() function is a pandas DataFrame, which you’ll recall is the object that contains the rows and columns. For us that’ll be two columns, temp and diameter, and 660-some rows of stars.

02:36 Now onto the calculations. The next few lines are all about calculating the data to be graphed. And in each case, I’m using the assign() method on the Data Frame to create a new column.

02:48 The arguments to assign() takes a keyword-value pair where the keyword is the name of the new column, and the value is its contents. Like with NumPy, you can do math on an entire column.

03:00 Here I’m processing the diameter column, which I read in from read_csv(). Unfortunately, it doesn’t just have a number. It specifies a number and a kilometers unit, which pandas reads as a string.

03:13 So I’m splitting that so that I can get the number and the kilometers into a tuple. Then just using the number part from that tuple and converting it into a float.

03:23 So that’s a three-step process, but the end result is the numeric radius data in kilometers out of the text-based info in the column read from the CSV, and of course the last bit there is dividing by two.

03:35 And that’s how I’m turning my diameter into a radius.

03:39 Once I’ve got a column with the star’s radius in kilometers, I can calculate the radius as a ratio of the sun. This line does some simpler math, taking the previously determined radius in kilometers and then does division to get the ratio.

03:54 The next step is to calculate the luminosity, which is a ratio of the star’s luminosity against the sun’s, both of which are related to temperature.

04:03 Once I’ve got the star’s temperature, I use the star_color() function I showed you earlier to populate the color column.

04:10 The apply() method on a column returns a new column where each value is passed into the function given to apply() to result in new data.

04:18 This is an alternative to a for loop, where you’d be calling the function for each item in the loop. But because pandas is written in a low-level language, doing it this way is a lot faster than processing it in pure Python.

04:30 All right, those four assign() were a little data heavy. So let me recap. I’ve created four new columns in the DataFrame. The first is the star’s radius in kilometers.

04:40 The second is the star’s radius as a ratio to the sun. The third is its luminosity as a ratio to the sun’s luminosity. And the fourth is a string containing a Matplotlib color for coloring the dots in on our scatter plot.

04:55 Just a quick aside, you can actually pass multiple keyword-value arguments to the df.assign() method, creating more than one column at a time.

05:04 I couldn’t do that in this case though, because each column needed the previous column to exist in order to do its calculation. But for other data where the columns are independent of each other, you can save some time by only calling assign() once.

Become a Member to join the conversation.