Make Toy Data Structures With pandas' Testing Module
Ever find youself setting up fake data to test certain functions you’ve written?
Let pandas do that for you!
pandas’ testing module provides a number of convenient functions for building quasi-realistic Series and DataFrames.
After watching this video you’ll know how to quickly create a simple pandas DataFrame and how to find out, which functions the testing
module provides to create fake data.
00:00 In this video, you’re going to learn how to make toy data structures with Pandas’ testing module. When you’re working with DataFrames, you may find yourself spending quite a bit of time setting up fake data sets to test out different functions.
00:14
If you just need to make a couple sets of somewhat realistic data, however, Pandas has a testing
module that can do this very quickly. So, let’s see how this works. Go ahead and open up a terminal.
00:30
And start your interpreter. import pandas.util.testing as tm
, and you can set a module-level default row and columns just by calling that and saying tm.N
and tm.K
.
00:50
Let’s just do 15
and 3
. And just a little more housekeeping, if you want to see the same data that I am, you can import numpy as np
and set the random.seed()
to 444
. So, for our first example, you can call .makeTimeDataFrame()
and set the frequency equal to month, and let’s just print out the head.
01:18
So, all this is going to do is generate a quick DataFrame here that has a DatetimeIndex
. The rest of the columns are just populated with some random continuous data, but you can see how quick that was to produce.
01:32
If you don’t want to use a DatetimeIndex
DataFrame, you can call something like makeDataFrame()
. Okay.
01:42 And let’s just pull the head off that. And this has more of an ID, where it’s just a random string here to identify each row. And you still have the continuous data in your columns over here. While having two types of DataFrames like this are probably more than enough for most cases, Pandas actually includes quite a few of these.
02:01
If you want to see all of them you could use the dir()
function. So let’s just make a little list comprehension here.
02:12
We’ll pull all the attributes off that with that dir()
function, just something like
02:21
.startswith('make')
, because these are all like, make
DataFrames. Okay. And, of course, I forgot the in
in the list comprehension!
02:36 So you can see, there’s quite a few in here. A couple to note—there’s DataFrames that’ll have missing values, different mixed data. And some other kinds, like categories, ones you have a little bit more control over, like a range index.
02:51 So, depending on how specific you need your data to be to test whatever you’re looking at, there’s a good chance Pandas has a way to create that data for you. And that’s it!
03:00 You should now know how to make some fake data pretty quickly with Pandas. Thanks for watching.
Brad Solomon RP Team on April 7, 2020
Hi @Sciencificity, what error are you seeing?
It looks like pandas.util.testing was deprecated in Pandas 1.0 (pandas.pydata.org/docs/whatsnew/v1.0.0.html#deprecations), though you can still set those attributes:
>>> import pandas.util.testing as tm
__main__:1: FutureWarning: pandas.util.testing is deprecated. Use the functions in the public API at pandas.testing instead.
>>> tm.N, tm.K = 15, 3
>>> tm.makeTimeDataFrame(freq='M').head()
A B C D
2000-01-31 1.446057 0.660831 -1.395632 0.576446
2000-02-29 0.336925 -0.705131 -0.438653 0.336438
2000-03-31 0.534070 0.433786 -1.367734 0.292544
2000-04-30 -0.508290 -0.130769 0.079307 -0.815311
2000-05-31 1.277667 0.878491 1.372388 -1.640210
This is from pandas 1.0.3
on a python:3-slim-jessie
Docker container.
Note also that the “replacement” module (pandas.testing
) only exposes assert_extension_array_equal, assert_frame_equal, assert_series_equal, and assert_index_equal.
Brad Solomon RP Team on April 7, 2020
@Sciencificity oops, my mistake, I see what you’re saying now. It looks like setting N
and K
directly won’t have effect because the new attributes (which the Pandas developers don’t seem to want to be part of the public API) are _N
and _K
: github.com/pandas-dev/pandas/blob/master/pandas/_testing.py.
Sciencificity on April 14, 2020
Yip, thanks for the feedback Brad. It’s a pity (would be cool to generate fake data up to a number of cols and rows you want), but it’s not a train smash ;). As an aside, if looking for fake data this website is very cool: www.mockaroo.com/ (I have generated dummy data via their site before for testing :)).
Ranga on April 6, 2024
The util module is not available in the current Pandas version (2.2).
Pandas>=2s testing module does not have any methods for creating data frames.
The solution available with no additional library is to use pd.DataFrame() on NumPy randomly generated arrays.
This must be included as a caution to this lesson, indicating which Pandas version is appropriate or providing an alternate method for other Pandas versions.
Become a Member to join the conversation.
Sciencificity on April 4, 2020
Hello! Thanks for teaching me about creating fake data. I am not sure if this is a result of a newer version of pandas (I have version 1.0.0) but now the N and K can no longer be set, as done in the video. The
_K
,_N
are attributes in the testing module and set to 4 and 30 respectively. You can change_N
by entering nper = 15 in the method call, but I can’t see how to change_K
. If you know of a way, let me know, please! Thanks!