Here are resources for more information about Timsort:
The final algorithm I’m going to talk about today is actually the native sorting algorithm for Python, which is called Timsort. It’s named for Tim Peters, who is the guy who implemented the algorithm in the first place. Now when you use in Python the
sorted() function, which is part of the standard library, Timsort is the sorting function that that uses.
00:40 So, let’s take a look at how it works. Timsort doesn’t lend itself super easily to diagram, so I’m just going to explain it as best I can and hopefully your experience with insertion sort and merge sort will help you understand this, because Timsort is kind of like a combination of those two kinds of sorts.
00:57 So the basic idea is you divide your list into super small sections—say, for example, smaller than 32 items. So small sublists, and then you use insertion sort to sort all of those sections because insertion sort is really fast on small lists because of its low overhead, it doesn’t have to allocate any new space, and it sorts in place, which is really nice.
01:18 Then you can merge those sections together using the merge sorted lists algorithm that I demonstrated earlier as part of merge sort. And you merge those together two at a time until the whole list is sorted.
01:29 Timsort is an example of what some call a block merge sorting algorithm, which is you divide the data into blocks, you sort those blocks with some other algorithm, and then you merge them using the ability to merge sorted lists in O(n) time.
01:52 So if the list is already sorted, insertion sort will see that really quickly and it won’t ever need to do any extra work. And then it also uses the strengths of merge sort, which is that merge sort is much faster on larger lists, it has a worst-case performance of O(n log n), and the great thing is that it mitigates the weaknesses of both of these algorithms, right?
02:14 Merge sort can sometimes be a little slow in practice as an algorithm because it often needs to allocate extra space. In fact, it always needs to allocate extra space unless you do some serious extra work.
02:26 And this algorithm allows you to use the merging sorted lists part of merge sort without necessarily needing to use the recursive part of it. Whereas with insertion sort, which is pretty slow on large lists, you don’t actually have to use that part of insertion sort here.
02:40 So it combines these two algorithms in a really nice way. What I’m going to do now is I’m going to go through an implementation of this and it will require that I do some slight modifications to insertion sort, but then I’ll be able to use that insertion sort with some extra logic to implement Timsort.
left defaults to
right defaults to the length of
1, and now I just want to go through—instead of going through the whole list, I want to go through from
left + 1 to
right + 1, and it will only go up to but not including
right + 1, right?
So, Timsort is going to start out just like the other sorting functions that I’ve used here. So I’m going to take in—and I’m not going to call it an
array, I’m going to call it
items—and I’ll say
timsort(items), and there are a couple of different things to do here. The first is to define, let’s say, a
min_subsection_size, and I’ll define it to be
05:15 I’m not going to worry too much about that right now because it’s just to show you how the general part of the algorithm works, but if you’re interested, I would encourage you to check out the Python source code, which is readily available online, and you can read the logic for how this actually works in the Python source code. Be warned, though, it is a little dense and it can be kind of obtuse, so there are a wealth of resources online as well that will explain this algorithm in minute detail. So go check those out for sure.
So now, first thing to do is you want to go through the whole list and say
for i in range(0, len(items)), so go through the whole array, and essentially what you’re trying to do is divide it into chunks based on the
Then what you want to do there is you want to insertion sort each of these subsections. And as you recall, I can put in
items and then I can put in
i, and I can put in the minimum of, let’s see, it should be
i + min_subsection_size - 1 because this is just to make the arithmetic work out, and then
1. And this is a super long line, but I hope you’ll forgive me for that.
Essentially, this is just saying “Insertion sort
i to the minimum of
i plus this subsection size minus
1 or length of
items minus 1.” So if the last subsection goes over the end of the array, then you want to just use the length of the
1, instead of adding the next
I’ll start out here and I’ll say the size of the lists that I’m working with is going to be the
min_subsection_size. That’s the initial size of each array that I’m working with. And I’ll say
while size < n, so of course, when the size of the full array becomes the size of
items—and I wrote
n just because I originally had that written as the size of the array, but I’ll just say
len(items) here. And now that I’m looking at this, it occurs to me that I wrote
len(items - 1), which is not an expression with a length. This is what it should be.
while size < len(items): I need to first determine which subarrays are going to be merged together. So I’ll say
for start in range(0, len(items)), and going up by steps of
size * 2, because I’m trying to merge each array together, I’ll calculate the
midpoint. So essentially what that is is just the end of the first subarray that I’m trying to merge here.
So now I have a
midpoint and the
end, so I know where the two subarrays start, so I can say
merged_array = merge_sorted_lists() where the left sorted list is
midpoint + 1, and then the next subarray is
midpoint + 1 to
end + 1.
And again, we want to add that
+ 1 in there to make this so that it goes up to but not including the actual end. And if you’re a stickler for Python style, I’m going to put these like this so that it’s kind of more clear what’s happening here.
10:22 So, this is kind of a complicated implementation, as I’m sure you’re becoming aware here. But really, it’s doing simple things. It just requires some logic to make sure that your indices are all right.
So now I just need to multiply the
2, because now I’ve merged all of these arrays into chunks that are twice the size of the previous chunks, and you’ll just keep doubling the size of those chunks.
And then, finally, you’ll just return
items at the end of it. So, that is Timsort! And I need to make sure I remember to say
from merge_sort import merge_sorted_lists, and that should be good because it’ll allow me to actually merge these together.
And then I’ll just say
timsort(items) and the timer will tell me how long it took. As you can see, it takes about five thousandths of a second to run for a list of a thousand elements, which is pretty darn good. Let’s crank it up to
10000, see how that goes. And so, as you can see, this has very fast performance on the order of merge and quicksort. For
100000, it takes a little bit longer. It takes about a second.
And then I’ll just bump it up to
1000000 and we’ll see what happens. As you can see, that one took a little bit longer to run—about 13 seconds—but for a million elements in a pure Python sorting algorithm, that’s pretty darn good. And the real Timsort, of course—the one that’s actually used in Python—is much, much faster because it’s written purely in C on the backend and then Python simply calls those C functions.
12:26 So that’s an important performance optimization that’s done under the hood in the actual version. And as I said, you’re welcome to check out that algorithm as it’s actually implemented in the Python source code because that’s all online.
12:39 There are many great things about Timsort. The first is that it’s stable, unlike quicksort, and its worst-case performance is O(n log n), so it doesn’t have any really terrible cases, any pathological cases where the complexity soars.
12:55 And finally, its best-case performance when the list is already sorted is actually order of n, so it’s super fast with almost-sorted lists. And in reality, it turns out, for general purpose programming languages most of the things that people try to sort have some elements of sorted data already, have some subsections.
13:12 And this algorithm really feeds on that because whenever a subsection is sorted, insertion sort can sort it incredibly quickly. So, Timsort—I mean, I wouldn’t go so far as to say it’s a perfect sorting algorithm because we’re always improving sorting algorithms, but it is pretty darn good and it’s about the best that you can find out there.
13:30 Luckily, it comes standard in Python, so hopefully you won’t ever have to implement it on your own. But hey, if you write your own programming language at some point, which is something that is really fun and interesting to do, you might very well want to write a Timsort implementation for it, and that would be super cool. Next up is the summary.
Become a Member to join the conversation.