Introducing the Dataset and Benchmarking
Here are resources for more information about the dataset and about timing your own code:
00:00 In this lesson, I’ll tell you a little bit about the data set that this tutorial uses and some of the benchmarking tools that were used to get the runtimes that are used in these tutorials. If you’re not interested in following along with the data or implementing these algorithms yourself and you just want to learn some of the theory, then you can feel free to skip ahead to the next lesson.
00:22 The dataset I’ll be using for this tutorial is a subset of the IMDb, the Internet Movie Database. This subset includes millions of actor names. The whole database is available for non-commercial use at this address, and it looks something like this when you get there.
00:39
The data that I’ll be using is from this first file here, and it’s actually the first column of this tab-separated values kind of spreadsheet. So if you want to try to download and separate that data into a file called names.txt
and then a file called sorted_names.txt
on your own, you’re welcome to.
01:00
But otherwise, you can use this download_imdb.py
script from the sample code that you can download below this video. That’ll be a lot easier, but it could also be a great exercise to try to do some of this data wrangling on your own. But be careful however you do it, because these files will take up a large amount of space on your computer, and they’ll be difficult to open with a traditional spreadsheet viewer like Excel or something because they’re just so darn big.
01:27
Here’s what the download_imdb
script looks like in practice when you run it. Be warned that it’ll probably take a while to download all this data unless you have lightning speed wireless. So after a couple of minutes, there we go, and if I use ls
to show the contents of my directory, you can see that there is names.txt
and sorted_names.txt
.
01:50
Now you can load those files using the with open() as
pattern, and then you can simply read the text in those files into actual Python lists.
02:01 Now, all of that of course is covered in the source code that comes with this lesson, so make sure to download that so you can play with this on your own.
02:09
Now, once you have those names.txt
and sorted_names.txt
files, there are many different ways to measure the performance of your code, meaning your binary search or linear search algorithms that I’ll be showing you how to implement throughout the rest of this series.
02:24 You can analyze the performance of an algorithm based on its time complexity, its space complexity. You could do control-flow analysis and many other kinds of analysis, but I’ll mostly be talking about runtime over the course of this series.
02:37
There are a lot of Python libraries for timing your code, including the built-in time
module, the timeit
library, and then various other libraries, but the runtimes that I’ll show you were generated by a custom script using the time.perf_counter_ns
(nanoseconds) function under the hood.
02:52
This is a really precise timer used for measuring short increments of time that’s in-built through the time
module. But if you’re interested in doing some timing of your own code in your own work, then below this course I’ll provide some links to resources that will show you a little bit more information on how to time your code. So, bottom line: I’ll be showing you a bunch of runtimes throughout the rest of this series to illustrate the performance of these different algorithms, here’s how they were generated, but if you want to measure your own code in some other way, that’s totally fine.
03:22 Just make sure that you’re careful about it. Okay, now that all that preamble is done with, let’s move into understanding some actual search algorithms.
Become a Member to join the conversation.