Introducing the Dataset and Benchmarking

Creating a Binary Search in Python Liam Pulsifer 03:33

Here are resources for more information about the dataset and about timing your own code:

00:22 The dataset I’ll be using for this tutorial is a subset of the IMDb, the Internet Movie Database. This subset includes millions of actor names. The whole database is available for non-commercial use at this address, and it looks something like this when you get there.

00:39 The data that I’ll be using is from this first file here, and it’s actually the first column of this tab-separated values kind of spreadsheet. So if you want to try to download and separate that data into a file called names.txt and then a file called sorted_names.txt on your own, you’re welcome to.

01:27 Here’s what the download_imdb script looks like in practice when you run it. Be warned that it’ll probably take a while to download all this data unless you have lightning speed wireless. So after a couple of minutes, there we go, and if I use ls to show the contents of my directory, you can see that there is names.txt and sorted_names.txt.

01:50 Now you can load those files using the with open() as pattern, and then you can simply read the text in those files into actual Python lists.

02:01 Now, all of that of course is covered in the source code that comes with this lesson, so make sure to download that so you can play with this on your own.

02:09 Now, once you have those names.txt and sorted_names.txt files, there are many different ways to measure the performance of your code, meaning your binary search or linear search algorithms that I’ll be showing you how to implement throughout the rest of this series.

02:24 You can analyze the performance of an algorithm based on its time complexity, its space complexity. You could do control-flow analysis and many other kinds of analysis, but I’ll mostly be talking about runtime over the course of this series.

02:37 There are a lot of Python libraries for timing your code, including the built-in time module, the timeit library, and then various other libraries, but the runtimes that I’ll show you were generated by a custom script using the time.perf_counter_ns (nanoseconds) function under the hood.

03:22 Just make sure that you’re careful about it. Okay, now that all that preamble is done with, let’s move into understanding some actual search algorithms.

Become a Member to join the conversation.