Pure Python Histograms
When you’re preparing to plot a histogram, it’s simplest to not think in terms of bins but rather to report how many times each value appears (a frequency table). A Python dictionary is well-suited for this task:
>>> # Need not be sorted, necessarily
>>> a = (0, 1, 1, 1, 2, 3, 7, 7, 23)
>>> def count_elements(seq) -> dict:
... """Tally elements from `seq`."""
... hist = {}
... for i in seq:
... hist[i] = hist.get(i, 0) + 1
... return hist
>>> counted = count_elements(a)
>>> counted
{0: 1, 1: 3, 2: 1, 3: 1, 7: 2, 23: 1}
count_elements()
returns a dictionary with unique elements from the sequence as keys and their frequencies (counts) as values. Within the loop over seq
, hist[i] = hist.get(i, 0) + 1
says, “For each element of the sequence, increment its corresponding value in hist
by 1.”
In fact, this is precisely what is done by the collections.Counter
class from Python’s standard library, which subclasses a Python dictionary and overrides its .update()
method:
>>> from collections import Counter
>>> recounted = Counter(a)
>>> recounted
Counter({0: 1, 1: 3, 3: 1, 2: 1, 7: 2, 23: 1})
You can confirm that your handmade function does virtually the same thing as collections.Counter
by testing for equality between the two:
>>> recounted.items() == counted.items()
True
Technical Detail: The mapping from count_elements()
above defaults to a more highly optimized C function if it’s available. Within the Python function count_elements()
, one micro-optimization you could make is to declare get = hist.get
before the for
loop. This would bind a method to a variable for faster calls within the loop.
00:00 To expand on our definition of what a histogram is, they take a series of data and group them into bins, counting the number of times data points fall into each bin range.
00:09 We can simplify this by making a histogram that just reports back how many times a value occurs in a dataset. Python dictionaries are perfect for this. So in your text editor, go ahead and create a list.
00:23 These numbers don’t need to be sorted for what you’ll be doing, but I’m going to keep them sorted anyway.
00:32
Next, define a function called count_elements()
that will take a seq
(sequence) and return a dictionary. So go ahead, make an empty dictionary, and then for i in seq:
you’re going to take the hist
dictionary at that i
index and call .get()
. So if the value is there, you’ll get the value.
00:58
Otherwise, you’ll just get a 0
. And add 1
to it, then return the hist
dictionary. All right! Try this out. Just say counted = count_elements()
, pass in that a
list, and go ahead and just print out counted
.
01:26
Let’s see what we get. All right! You can see that 0
appears once, 1
appears three times, 2
, 3
, and then 7
appears twice , and 23
appears once.
01:40
In fact, if you’re familiar with collections
,
01:47
the Counter
class does the exact same thing. If you wanted to say something like recounted = Counter(a)
, and actually I’ll just print this out below.
02:04
And run that. You’ll see that you get this Counter
object here, which has the same values. Note that they’re not in the same order. This is sorted from the most occurring to the least occurring.
02:15
But if you wanted to just make sure, you could say something like recounted.items() == counted.items()
, and you should get True
when you run this.
02:29 Yep! So they’re functionally equivalent. This is great, but these outputs aren’t anything to really look at. Histograms are supposed to be visual charts to look at your data, and these are just printed dictionaries.
02:41
Let’s go ahead and define another function to give a more visual output. I’m going to get rid of all of this Counter
stuff.
02:54
And now go and define an ascii_histogram()
that will also take in a sequence, and return None
. Everything from this function will be printed out into the terminal.
03:11
The first thing you’ll do is use that count_elements()
function
03:22
And go ahead and save this as something like counted
, which is pretty similar to this, so I’m going to delete that. And now that dictionary, you’re going to loop through. So for k in
—and this time you’ll want to sort that.
03:41
And you can go ahead, use some f-string formatting, and print out k
. And then in here—notice I’m using double quotes ("
), because I’ve already got single quotes ('
) out here.
03:53
Now you’ll print out a plus sign ("+"
) as many times as that value k
appears in counted
. Now to clean this up some more, up here just go ahead and import random
, because we’re going to need a slightly bigger dataset.
04:10
And set your random
seed value to 1
, and this will ensure that you get the same values that I do. Okay. Make a list called vals
, and set this equal to [1, 3, 4, 6, 8, 9, 10]
.
04:28
And then do a little list comprehension here for the frequencies. You’ll take that random
library, generate a random integer between 5
and 15
for each value in vals
.
04:45
And I can use an underscore (_
) here because we don’t really care about which value this is when calculating these random integers. Now create a data
list, and then for f, v
and zip the frequency and the vals
together.
05:09 And then you’re just going to extend the list.
05:22
Alrighty! And now you should be able to just run that ascii_histogram()
function and pass in data. Open up the terminal, see what you get!
05:40
Alrighty, let me just open this up here. And you can see, you kind of have a little bit of a histogram going on here. And if you use the same seed number I did, your values should show up identically. But looking at this real quick, you can see that 3
appears the most times, and then 4
and 8
are much less. That’s pretty cool!
05:59 Now you can make histograms pretty much from scratch just using the Python standard library. You probably wouldn’t want to share this in a presentation, however, and you might want some more control over your bins. Right now, this is just a frequency table. So in the next video, you’re going to learn how to use NumPy to create bins and group your data like a more traditional histogram. Thanks for watching.
Dan Bader RP Team on Sept. 6, 2019
@chrismarkella: Yep! Those are called type hints, check out this article:
chrismarkella on Sept. 6, 2019
Great. Thank you!
Pygator on Sept. 16, 2019
Dictionaries don’t have indices, they have keys. o.w. good material. It’s a beginners mistake because lists / tuples use subscript notation are indexed with [] too .
Become a Member to join the conversation.
chrismarkella on Sept. 6, 2019
Great presentation. Do you have any tutorial about the arrow syntax with the return type? I tried it with a dummy function but it didn’t force the type.