Make Toy Data Structures With pandas' Testing Module

Idiomatic pandas: Tricks & Features You May Not Know Joe Tatusko 03:07

Ever find youself setting up fake data to test certain functions you’ve written? Let pandas do that for you! pandas’ testing module provides a number of convenient functions for building quasi-realistic Series and DataFrames. After watching this video you’ll know how to quickly create a simple pandas DataFrame and how to find out, which functions the testing module provides to create fake data.

00:00 In this video, you’re going to learn how to make toy data structures with Pandas’ testing module. When you’re working with DataFrames, you may find yourself spending quite a bit of time setting up fake data sets to test out different functions.

00:14 If you just need to make a couple sets of somewhat realistic data, however, Pandas has a testing module that can do this very quickly. So, let’s see how this works. Go ahead and open up a terminal.

00:26 I can bring that up.

00:30 And start your interpreter. import pandas.util.testing as tm, and you can set a module-level default row and columns just by calling that and saying tm.N and tm.K.

00:50 Let’s just do 15 and 3. And just a little more housekeeping, if you want to see the same data that I am, you can import numpy as np and set the random.seed() to 444. So, for our first example, you can call .makeTimeDataFrame() and set the frequency equal to month, and let’s just print out the head.

01:18 So, all this is going to do is generate a quick DataFrame here that has a DatetimeIndex. The rest of the columns are just populated with some random continuous data, but you can see how quick that was to produce.

01:32 If you don’t want to use a DatetimeIndex DataFrame, you can call something like makeDataFrame(). Okay.

01:42 And let’s just pull the head off that. And this has more of an ID, where it’s just a random string here to identify each row. And you still have the continuous data in your columns over here. While having two types of DataFrames like this are probably more than enough for most cases, Pandas actually includes quite a few of these.

02:01 If you want to see all of them you could use the dir() function. So let’s just make a little list comprehension here.

02:12 We’ll pull all the attributes off that with that dir() function, just something like

02:21 .startswith('make'), because these are all like, make DataFrames. Okay. And, of course, I forgot the in in the list comprehension!

02:36 So you can see, there’s quite a few in here. A couple to note—there’s DataFrames that’ll have missing values, different mixed data. And some other kinds, like categories, ones you have a little bit more control over, like a range index.

02:51 So, depending on how specific you need your data to be to test whatever you’re looking at, there’s a good chance Pandas has a way to create that data for you. And that’s it!

03:00 You should now know how to make some fake data pretty quickly with Pandas. Thanks for watching.

Sciencificity on April 4, 2020

Hello! Thanks for teaching me about creating fake data. I am not sure if this is a result of a newer version of pandas (I have version 1.0.0) but now the N and K can no longer be set, as done in the video. The _K, _N are attributes in the testing module and set to 4 and 30 respectively. You can change _N by entering nper = 15 in the method call, but I can’t see how to change _K. If you know of a way, let me know, please! Thanks!

Brad Solomon RP Team on April 7, 2020

Hi @Sciencificity, what error are you seeing?

It looks like pandas.util.testing was deprecated in Pandas 1.0 (pandas.pydata.org/docs/whatsnew/v1.0.0.html#deprecations), though you can still set those attributes:

>>> import pandas.util.testing as tm
__main__:1: FutureWarning: pandas.util.testing is deprecated. Use the functions in the public API at pandas.testing instead.
>>> tm.N, tm.K = 15, 3
>>> tm.makeTimeDataFrame(freq='M').head()
                   A         B         C         D
2000-01-31  1.446057  0.660831 -1.395632  0.576446
2000-02-29  0.336925 -0.705131 -0.438653  0.336438
2000-03-31  0.534070  0.433786 -1.367734  0.292544
2000-04-30 -0.508290 -0.130769  0.079307 -0.815311
2000-05-31  1.277667  0.878491  1.372388 -1.640210

This is from pandas 1.0.3 on a python:3-slim-jessie Docker container.

Note also that the “replacement” module (pandas.testing) only exposes assert_extension_array_equal, assert_frame_equal, assert_series_equal, and assert_index_equal.

Brad Solomon RP Team on April 7, 2020

@Sciencificity oops, my mistake, I see what you’re saying now. It looks like setting N and K directly won’t have effect because the new attributes (which the Pandas developers don’t seem to want to be part of the public API) are _N and _K: github.com/pandas-dev/pandas/blob/master/pandas/_testing.py.

Sciencificity on April 14, 2020

Yip, thanks for the feedback Brad. It’s a pity (would be cool to generate fake data up to a number of cols and rows you want), but it’s not a train smash ;). As an aside, if looking for fake data this website is very cool: www.mockaroo.com/ (I have generated dummy data via their site before for testing :)).

Ranga on April 6, 2024

The util module is not available in the current Pandas version (2.2).

Pandas>=2s testing module does not have any methods for creating data frames.

The solution available with no additional library is to use pd.DataFrame() on NumPy randomly generated arrays.

This must be included as a caution to this lesson, indicating which Pandas version is appropriate or providing an alternate method for other Pandas versions.

Become a Member to join the conversation.