New Statistics Functions
00:00
In the previous lesson, I showed you how to do asynchronous coding with the new aiter()
and anext()
functions. In this lesson, I’ll cover the new additions to the statistics
module.
00:11
The statistics
module was added to Python in version 3.4. This release has added three new functions: correlation()
, covariance()
, and linear_regression()
. Note that although this module has lots of useful stuff in it, if you’re really deep into the stats and math, you probably want to use one of the popular third-party libraries to do your thing instead.
00:34 All three of the new methods help you evaluate the dependencies between two sets of data. To help illustrate this idea, assume you have two lists—one with the number of words in a series of articles and the second with the corresponding views on those same articles.
00:51 Let me import the module
00:56
and then run the new covariance()
function.
01:04
covariance()
indicates how much a change in one variable influences the change in another variable. A positive result means that as the first variable gets bigger, so does the second variable. A negative result means the second variable gets smaller when the first gets bigger.
01:21 The magnitude of a covariance depends on the magnitude of the data feeding into it, which makes it a bit hard to interpret, which leads us to…
01:31 correlation! Correlation is a normalized covariance. It ranges between -1 and 1. The closer to 1, the more positive a correlation. The closer to -1, the more negative a correlation. The closer to 0, the less correlation. For the data here, the resulting correlation is 0.45.
01:54 This means that there is some relationship between the two values but not an extreme one. Note that correlation does not indicate causation. Although these two values have some correspondence, they both might be caused by a third unknown factor. For example, maybe an author who writes longer articles is more popular.
02:14 The correlation between the size of the article and the views would have nothing to do with the length, but due to the popularity of the author. Removing the popular author’s data might cause the correlation to plummet.
02:28 If you have a correlation between data, you might want to use it to estimate values not in the original sets. Linear regression helps you do this.
02:43 Linear regression creates a line of best fit through the data, and using that line, you can plot values between the data points. This function returns an object with two values in it: the slope of the best fit line and the intercept point on the graph.
02:59
Using these two values, you can predict other data. I’m going to run the function again, this time storing the result in lr
.
03:10 You can access the slope and intercept using dot notation. And now, I can estimate what views there would be for a 10,000-word article based on this linear regression.
03:27 I take the 10,000 words, multiply that by the slope, and then add the intercept. I get a result of 3,528 and a bit. So based on the data given and the linear regression, an article with 10,000 words would probably get about 3,500 views.
03:47
That’s enough math for one day. Next up, list manipulation with zip()
and its new footgun safety mechanism.
Become a Member to join the conversation.