Using pandas to Count Conjunctions
00:00 The script I’m building is going to calculate the angle of separation between Mercury and six other planets. That’s six because you can’t do the angular separation between Mercury and Earth because Earth isn’t in Earth’s sky and the other one, well, Pluto is Mickey’s dog.
00:18 It’s not a planet, at least not until the IAU changes its mind. Again, I’ll be doing this comparison calculation a bunch of times, so I’m going to have a table of data where each row is a day containing the six angular separations, and each column is the angular separation of a specific planet from Mercury.
00:39
When dealing with tabular data, pandas is probably the go-to library. It has a concept of rows and columns, which it stores in an object known as a DataFrame
.
00:50
You can access rows, columns, cells, or groups of cells in a DataFrame
, as well as do operations on entire rows or entire columns. The coding style of pandas, and in fact, most libraries like it can take a little getting used to.
01:06 For example, you can do math on all the things in a column in a single line of code, and so it can sometimes be hard to remember when you’re dealing with more than one piece of data at a time or dealing with something singular.
01:18 At least I find it a little hard.
01:21
And just how do you construct a DataFrame
? Well, you can create the objects by hand, which is what I’m going to be doing in this lesson. Or you can read in a CSV file, which is handy.
01:31 The library is very spreadsheet-like. So by reading in a CSV, you can actually export from an actual spreadsheet and import it into your program.
01:39
Each row in the DataFrame
has an index. This can be an auto-generated value like a counter, or you can set it explicitly when dealing with data that has a date or timestamp.
01:51 It’s common practice to make that the index value. This is called time series data. A great reason for doing this is pandas has tools that allow you to interpolate between rows when the index is a date or time.
02:05
There are several different ways of getting at rows and columns in a DataFrame
, and the .loc
attribute allows you to use square bracket notation to get at part of the data by referencing the index or the name of the column.
02:18 Note, this isn’t like a list. The index in this case may not be a counter. It could be that date or time, which is the timestamp that I was talking about.
02:28
To go along with the .loc
attribute is the .iloc
attribute, which does use Python-style indexing. So if you want to slice using an index number like you’re used to in your code, you do that with .iloc
instead.
02:43
Using .loc
and .iloc
can get a little complicated because they both allow you to write filters as well. That would show a subset of the data or operate on a subset of the data.
02:53 I will be writing a few of these in the course. I’ll do my best to explain them, but feel free to treat them like black magic. If you want to learn this spell, there are pandas-specific courses that I’ll point you at later on.
03:06 Okay, so back to my table filled with angular separations, I’m going to have a row for each date and columns for the angular separations between Mercury and each of the six planets.
03:18 I then want to determine if there is a conjunction. I can do that by counting how many columns in a row contain values that are small enough to consider them conjunctions.
03:29 I’m going to do this using one of those tricky little black magic bits that I just talked about and is very tricky, it’s probably why they wear masks.
03:38
So the .iloc
attribute on a DataFrame
allows you to access a row and or column using Python’s numeric slicing format. The code here is looking at the first row, that’s the zero and a slice of the columns.
03:54 Skipping the first column. My first column will have the date in it, and that shouldn’t be included in our calculation about conjunctions.
04:03
To find how many conjunctions there are, I use the .le()
method, which stands for less than or equal. The return from this call is a new set of data with a Boolean for each planet whose angular separation is less than seven.
04:18 Seven degrees is a bit wide for conjunction, but it’s small enough to keep the calculations. In our case quick.
04:25
The output of the .le()
call is a group of Boolean, one for each planet. True indicates the angular separation is below seven and false means it isn’t.
04:37
Then I’m going to double down on this trickiness and use the return from that and call sum()
.
04:44
When you sum Boolean, they get cast to integer, one for true, zero for false. So summing Booleans is equivalent to counting the true values. When you chain the sum()
call to the .le()
call, you get a count of conjunctions that are less or equal to seven.
05:01
In both of these calls, I’m going to use the axis
argument. The axis
argument works in a whole bunch of pandas calls and changes the behavior of the function based on whether to operate on rows or columns.
05:13
So the .le()
is being done across the columns and the sum()
is being done across the rows. You want these in place to make sure you aren’t summing the column itself, but summing across the columns.
05:23 Look, I know these two lines are messy, and if you’ve never done any pandas before, there are a lot to absorb at once.
05:30 If you’re not quite absorbing it, don’t worry. This is a fairly advanced bit of pandas and if you’re interested, it will make a lot more sense if you go take an intro course.
05:40 If you’re not interested, guess what? This is the equivalent of Googling and copying and pasting.
05:45
Pandas is a third-party library, which means you’ll need to pip install
it. Unlike with all third-party libraries, you should use a virtual environment.
05:53
When you do so, once you’ve got your DataFrame
filled with information, you’re going to want to print it out to screen. If you call print()
on a pandas DataFrame
, it shows you some of the data, but the results can be a bit chunky.
06:07
How does something that only eats bamboo gets so rolly-polly? Great. Now I’m fat shaming a bear. Anyhow, pandas does allow for formatting of a DataFrame
using Styler
objects, but these only work within a Jupyter Notebook.
06:21 They won’t apply in the terminal. Enter the tabulate third-party library. It builds tables for your terminal. It supports more than just pandas. So if you need to print out tables, it’s a useful tool all around.
06:34
Sing it along with me. Tabulate is a third-party library, so with it you need to pip install
. Don’t forget to use a virtual environment. I’ll use tabulate to show off our planetary information and look for some conjunctions.
Become a Member to join the conversation.