Typed Arrays and Strings
00:00
In the previous lesson, I discussed Python’s list
and tuple
built-in types. In this lesson, I’m going to discuss typed arrays and how strings themselves are also arrays.
00:12
Python provides an implementation of a typed array found inside of the array
library called array
. Because it’s typed, you have to specify what kind of information is being stored in the array at construction. The list
type, although it’s an array, isn’t very densely packed, and that’s due to the fact that any item in the list can be of a different type and therefore a different size. The array.array
is typed, and so sizing is extremely predictable and memory efficient.
00:42
Similar to list
, the array
is still mutable and dynamic. You can change any part of it and you can add things to the end of it and it’s dynamically resized.
00:53
Because it’s not a built-in type, you’ll have to import the array
to use it. You construct a new array
by creating the class and specifying what is going to be contained inside of it.
01:12
The capital "I"
here indicates that this array is going to store integers of an unsigned type. That means integers that are 0
or greater.
01:23
looking at the .__repr__()
representation of extensions
, you can see that it’s an array
type of capital 'I'
and its contents.
01:33 As you would expect with an array, you can get at information through a subscript.
01:40 You can assign the subscript, changing the values.
01:55
And then, similar to the list
, you can append. Because it’s typed, if you try to put something in that doesn’t fit to the typing, you will get an error.
02:08 A string isn’t an integer, so that’s a problem.
02:15
Notice that it’s also very specific. 3.14
doesn’t get converted into an integer. It sees that as a float and it also can’t be put inside of the array.
02:27
The array
class supports many different types of content. Little 'b'
means a signed character, large 'B'
meaning an unsigned character.
02:38
Little 'u'
is a Unicode character. Notice that the size column here is the minimum size in bytes. This is because, depending on what platform you’re running on, the number of bytes to represent certain things may be larger.
02:51
32-bit versus 64-bit platforms will have different sizes for some of these types. 'h'
and 'H'
are for signed and unsigned shorts, 'i'
and 'I'
for signed and unsigned integers, 'l'
and 'L'
for signed and unsigned longs, 'q'
and 'Q'
for signed and unsigned long longs, 'f'
for float, and 'd'
for double. With the exception of the Unicode character and the floats, everything maps to Python’s integer type.
03:23 Why there’s so many has to do with how this is translated into C. If you’re reading in a stream of content from another program, in order to pack it into the array, you may need to know what the C type is.
03:35 This is also important if you’re working with a Python extension which is written inside of the C language. Strings themselves are actually arrays. They’re contiguous memory representations of characters. Prior to Python 3, the string was based on ASCII.
03:53 Starting with Python 3 and moving forward, it’s based on Unicode. Strings are actually immutable—they can’t be changed. Now you might be thinking to yourself, “But I do things to strings all the time!” Well, that’s actually an illusion. Anytime you make a change to a string, that entire string is being replaced with a newly-created string replacing the original. Interestingly enough, there’s no concept of a character in Python. There’s strings of length 1.
04:22 A string which is a word is comprised of an array of strings of length 1. It’s a recursive definition. Because the length of each character is well known, this kind of array is very tightly packed.
04:37 Everything’s right next to each other inside of memory. Let me start by creating an example string. Like any other array, you can get at the parts of this string using the subscripts.
04:53
Because strings aren’t mutable, you can’t use the subscript to do assignment. This results in a TypeError
. You also can’t remove parts of the string.
05:06 If you wish to do this kind of manipulation, the most common approach is to convert the string into a list, do the manipulation, and then convert it back.
05:18
Casting the string to a list iterates over each of the characters in the string and assigns it into the list. The variable letters
is a list containing each one of the characters from the original string.
05:34
Because this is now a list, you can assign using subscripts. Everything in Python is an object, including strings. One of the methods on a string is .join()
. Using the empty string and the .join()
method on that, you can convert your list back into a string.
05:54
So now you’ve gone from the original string with 'Jaspreet'
with an 'a'
, taken it into a list
, changed the second letter, and then put it back into a string with 'Jespreet'
, with an 'e'
.
06:06
And here’s that little oddity of the recursive definition. If I ask Python what is the type of the string "abc"
, I get back the <class 'str'>
(string).
06:20
If I do the same thing with the first letter in that string, I also get back the <class 'str'>
. So characters are single-letter strings and strings are multi-letter strings.
06:33 In the next lesson, I’m going to talk about dealing with arrays of binary data.
Christopher Trudeau RP Team on Sept. 24, 2022
Hi Michael,
The references in a list
are dense, but that actually isn’t usually what is meant by being “dense”. Because it is a list of references and those references can point anywhere, this is not considered “dense”.
An array of a single type has each item being the same size and is allocated in chunks. If I have an array of integers, I can skip directly to the fifth integer by moving ahead 4 * sizeof(int)
in memory. Python is doing this for you when you access array[4]
.
Each integer in the array is next to each other. This is “dense” packing.
With the list, you can get at the reference the same way, but then you have to de-reference it.
This is a computing trade-off. It is much harder for the processor to pre-load the list into a cache, for example. Whereas with an array, you just grab the chunk of continuous memory.
On the other hand, with the dense representation, when you’re out of space, you’re out of space.
Hope that clarifies. Happy coding!
michael on Sept. 25, 2022
Hi Chris,
Ok, I think I got it. The chunks of memory, the references in a list are pointing to, are not a continuous chunk of memory, but somehow “distributed” over the memory space. Loading of the list cannot be done by a memcpy() call.
Thanks for the explanation.
Michael
Christopher Trudeau RP Team on Sept. 25, 2022
Yep, you got it Michael.
Become a Member to join the conversation.
michael on Sept. 23, 2022
Hi Chris,
very clear presentation of the data structures. Thank you very much.
I have a question regarding the list array. On the slide with title ‘Typed Arrays’ you wrote that the list isn’t densely packed.
In my understanding, the list consists of an array of references. These references are located in a contiguously memory block and therefore are densely packed.
What do you mean exactly regarding the list isn’t densely packed?
Thank you very much in advance.
Cheers
Michael