Loading video player…

Starting With Polars DataFrames

This course uses third party libraries. You should use a virtual environment to manage installations. For more information see: Python Virtual Environments: A Primer

The code in this course was tested using:

  • Python 3.13.2
  • Polars 1.22.0

00:00 In the previous lesson, I gave an overview of the course. In this lesson, I’ll introduce you to DataFrames. Polars is a third-party library, which means it needs to be installed.

00:12 Best practices with library installation is to use a virtual environment. If you aren’t familiar with virtual environments, you should probably read the linked tutorial on the screen first.

00:23 Once you’ve got a virtual environment, you can install Polars using the typical pip install command.

00:30 A DataFrame is a two-dimensional data structure that is used in many data science libraries, so many so that there are now libraries to do universal conversion between other libraries’ DataFrames.

00:42 That’s beyond the scope of this course, but many of the concepts you’ll learn here also translate to other data science libraries as well. When I say two-dimensional, that can be thought of as rows and columns similar to what you would do inside a spreadsheet, but in Python’s memory instead. Under the covers, Polars has another concept called a Series, which is what it uses to define the columns, which then get grouped together.

01:07 To form a DataFrame, you could use a lot of Polars without encountering a Series, but if you want to tack on a new column to your existing DataFrame, using a Series is one of the easier ways. Let’s head off to the REPL to play with some DataFrames.

01:23 To get started, you need to import Polars.

01:28 In a similar vein to NumPy and pandas, the common thing to do with Polars is to import and alias it rather than importing the individual classes that you need.

01:38 Let’s define a DataFrame. To do that, I need some data to put inside it.

01:54 I’m using a dictionary with multiple keys. Each key is the header name of a column, and the contents are the values in the row. The data here is about the four tallest buildings in the world,

02:09 and this is their corresponding heights in meters,

02:14 and here are the number of floors they have. Interestingly, these numbers don’t necessarily map directly as floors can be of varying heights, and is it tallest people argue about whether you measure to the top of a spire or just the top floor?

02:30 With the data ready, I simply instantiate the DataFrame class, passing this dictionary in, and there you go. I have a DataFrame named buildings.

02:44 When you evaluate a DataFrame in the REPL, you get a table of the data inside. Just above the table is the word shape. This indicates the size of the table: four comma three means four rows and three columns.

02:57 This is important with larger datasets where not all of it gets printed to the screen. Note that the keys from our dict are the names of our columns, and Polars has figured out what kind of data is inside the columns.

03:10 The str, f64 and i64 are the data types, meaning string, float, and integer. Polars has its own data types, which are compatible with NumPy data types.

03:23 The 64 here indicates how many bits are used to store the value. This means you will get a different behavior with Polars numbers than Python numbers with extreme values. Python has no upper integer limit, whereas a 64-bit integer means the most you can do is store a little over nine quintillion.

03:43 Also, if you try to mix and match data types inside of a column, Polars will give you an error unless you explicitly allow it when you construct the data frame. The .schema attribute of a DataFrame shows the same information that is in the top of the table you just viewed, although this time it uses the full Polars class names for the data types.

04:05 The .describe() method shows you summary information about the contents of the DataFrame. The count is the number of items in each row. There are ways of having empty items, so it doesn’t have to just be a count of the number of rows.

04:19 The null count counts how many null values there are. The mean, min, max, and percentage values are statistics—for the mean, that’s kind of like average, standard deviation, smallest, largest, and 25th through 75th percentiles.

04:34 For some data that isn’t helpful for others, this information might be useful. For example, by looking at the 50th and 75th percentile values here for the height, you can see that they’re the same thing.

04:46 This tells you your data isn’t a normal distribution bell curve. You can also access individual rows of a DataFrame using square brackets like you do with list items.

04:58 The first row, the last row,

05:04 and slices. To add another column, you call the .with_columns() method. That method takes a Series.

05:21 A Series is constructed similar to one of the key list pairs in the dictionary.

05:27 Evaluating a series in the REPL shows its name, the data type, and the values as a column. Also note the shape. A single value in the tuple here indicates four rows, but no columns.

05:43 Calling .with_columns() adds our column to the DataFrame. Be careful here. Operations on DataFrames return a new DataFrame. They don’t modify the existing one.

05:53 That’s why I’ve overwritten our buildings variable.

05:58 And there you go. You’ve got one more column. Interesting that the world’s tallest building doesn’t have much below it. Parking must be a bit of a challenge.

06:07 By the way, if you’re coding along, I’ll be using this same data in the next lesson, so if you keep your REPL session open, it’ll save you some typing.

06:16 Now that you’ve got data in a DataFrame, it’s time to do something with it.

Become a Member to join the conversation.