Here are resources for more information about the dataset and about timing your own code:
Introducing the Dataset and Benchmarking
00:00 In this lesson, I’ll tell you a little bit about the data set that this tutorial uses and some of the benchmarking tools that were used to get the runtimes that are used in these tutorials. If you’re not interested in following along with the data or implementing these algorithms yourself and you just want to learn some of the theory, then you can feel free to skip ahead to the next lesson.
00:22 The dataset I’ll be using for this tutorial is a subset of the IMDb, the Internet Movie Database. This subset includes millions of actor names. The whole database is available for non-commercial use at this address, and it looks something like this when you get there.
The data that I’ll be using is from this first file here, and it’s actually the first column of this tab-separated values kind of spreadsheet. So if you want to try to download and separate that data into a file called
names.txt and then a file called
sorted_names.txt on your own, you’re welcome to.
But otherwise, you can use this
download_imdb.py script from the sample code that you can download below this video. That’ll be a lot easier, but it could also be a great exercise to try to do some of this data wrangling on your own. But be careful however you do it, because these files will take up a large amount of space on your computer, and they’ll be difficult to open with a traditional spreadsheet viewer like Excel or something because they’re just so darn big.
Here’s what the
download_imdb script looks like in practice when you run it. Be warned that it’ll probably take a while to download all this data unless you have lightning speed wireless. So after a couple of minutes, there we go, and if I use
ls to show the contents of my directory, you can see that there is
Now, once you have those
sorted_names.txt files, there are many different ways to measure the performance of your code, meaning your binary search or linear search algorithms that I’ll be showing you how to implement throughout the rest of this series.
02:24 You can analyze the performance of an algorithm based on its time complexity, its space complexity. You could do control-flow analysis and many other kinds of analysis, but I’ll mostly be talking about runtime over the course of this series.
There are a lot of Python libraries for timing your code, including the built-in
time module, the
timeit library, and then various other libraries, but the runtimes that I’ll show you were generated by a custom script using the
time.perf_counter_ns (nanoseconds) function under the hood.
This is a really precise timer used for measuring short increments of time that’s in-built through the
time module. But if you’re interested in doing some timing of your own code in your own work, then below this course I’ll provide some links to resources that will show you a little bit more information on how to time your code. So, bottom line: I’ll be showing you a bunch of runtimes throughout the rest of this series to illustrate the performance of these different algorithms, here’s how they were generated, but if you want to measure your own code in some other way, that’s totally fine.
Become a Member to join the conversation.