Avoiding Bias
00:00 In the previous lesson, I showed you the code for rounding down. In this lesson, I am going to talk about an important factor in your rounding choice: rounding bias.
00:10 Think about a big set of numbers that you want to apply rounding upon. Whatever rounding you do, you do not really want to change the overall shape of your data.
00:20 Say you had some data that had a Gaussian distribution, that’s a good old bell curve. When applying rounding to the values in your data, you would not want the result to have shifted the curve at all.
00:32 Mathematically, this is known as symmetry around zero. A function is symmetric around zero if summing the results of the function applied to x and then again to negative x produces zero. Let’s go to the REPL and explore some symmetry.
00:55 Going to use this list of numbers to illustrate the point about symmetry. The list isn’t large, so the results will not be perfect, but the difference will be big enough to be clear.
01:06 Remember that bell curve I was talking about? Well, the center point on that curve is the mean, colloquially known as the average. If an algorithm is symmetric around zero, the mean on the processed data should be the same as the mean of the original i.e., the average should not move.
01:26 The statistics package in Python has a function that calculates the mean. Let’s use that on our data first.
01:35 Alright, that’s our mean. Let’s start by truncating. First, let’s see the truncation of all the values in numbers.
01:52
This is a list comprehension if you have not seen those before, it’s a quick way of calculating a new list. It’s kind of like a shortened version of a for
loop.
02:01
This one is looping through all the values of numbers and calling truncate()
on each of them, producing a new list, which is truncated. Because I truncated to decimal one, I am dropping the hundredth place.
02:15 Now that you have seen that, let’s calculate the mean on that data.
02:25
And the mean of the truncated values is pretty close to the original. It’s only off by about 1%, and that’s because truncate()
is actually symmetric around zero.
02:35
Now let’s do the same with the round_up()
function. Here are the rounded values
02:56 Wow, that’s significant. It’s pulled the mean up by almost 30%. The shape of the data has changed. Let’s do the same for the round down.
03:24 And again, a big difference. Even with only six data points, you can see how the shape of the data is being changed shifting the mean up or down depending on which algorithm you use.
03:37
Great. The truncate()
function, which you’ll remember costs us $99 in the stock experiment is symmetric around zero. But of course, it isn’t really rounding.
03:48 Round up trends towards positive infinity. That means as you use it, your mean is going to shift in the positive direction. Round down trends towards negative infinity.
04:00 It goes in the other direction. Truncate might be bad at rounding, but it does not change the shape of the data. So how do you get around this? Well, part of the problem is in how the algorithms deal with a certain specific case.
04:15 Consider the values between 1.2 and 1.3. 1.23 is to the left of center. Using round up, it moves to the right, even though this is to the left of center using round down, it’s fine because it is to the left of center.
04:30 Your grade school algorithm says this should go to the left, which kind of makes sense. Likewise for 1.28, rounding up shifts to the right, no problem, but rounding down shifts to the left, introducing the bias in the other direction.
04:47 Your grade school algorithm says it should go to the right. So far, the grade school is fairly unbiased, but there’s one problem. What do you do with 1.25?
05:00 Your grade school algorithm says it should go to the right. Well, that’s a bias because this is right in the middle and it’s a tie. Pretty much the rest of this course is how to deal with that tie and how that tie affects the bias.
Become a Member to join the conversation.