Byte Order and Bit Packing

Binary, Bytes, and Bitwise Operators in Python Christopher Trudeau 10:25

Transcript
Discussion

00:00 In the previous lesson, I showed you some examples of bitwise operators in practice. In this lesson, I’m going to talk about the two different ways that bytes are strung together to make larger numbers.

00:12 Before digging into endianness, let’s take a look at ways of getting at binary information for different types of data. In a previous lesson, I showed you how to use a bitmask as well as the struct module. Well, masking only works for integers, so let’s dig into struct some more.

00:31 Let me import the math module so I can get at a rather famous float. Who doesn’t want some pie? No, I’m a cake man, myself. Anyway, as I mentioned, masking only works with integers. In case you were curious, floats aren’t integers.

00:52 So, what if you want to see the bits that make up a float? The struct module is your friend. Let me start by importing the pack() function.

01:03 Then I’m going to use a template.

01:08 The ">d" (greater than d) string here indicates how the value is to be packed into bytes. The messy result is due to how Python shows raw binary data.

01:18 If a byte is within ASCII range, it shows it as the equivalent of an ASCII character. That first character is the at symbol (@) because the first byte grab from pi turns into the number 64, which is ASCII for @. Some of the other characters here are also printable ASCII, and some of them are hex escape sequences.

01:40 Putting the packed bytes into a list makes it a bit easier to see the raw values. As promised, the first element was 64. Same data, just a different format for printing it out.

01:53 What if you want to see the bits? Then you need to loop over each part of that list and change it into bits. I’m going to use a list comprehension with an eight-bit f-string format to do just that.

02:14 This loops over every byte inside of the list and prints it out using the f-string. The result is the floating-point approximation of pi in binary. Well, technically it’s a string with the binary digits inside of it, but you get my meaning. pack() goes forward.

02:31 You might guess that unpack() goes the other direction.

02:37 Just copy and paste that bit string into a variable and then unpack it.

02:47 Same template as before.

02:57 unpack() is expecting a series of bytes. The bytes() function provides this. I’m passing in a comprehension that converts eight bits of that bit string at a time into an integer, essentially reconstructing the output of pack() from just a moment ago.

03:15 And the end result is a tuple with pi inside. Why a tuple? Well, unpack() can return multiple chunks depending on the template. I’ll do the same thing, but with a different template to demonstrate this.

03:38 And this time, you get a tuple with four bytes inside. It is the equivalent of the same binary data as before, but this time it isn’t in the float format, so you just see the pieces.

03:50 That first number contains the float’s sign bit, the eleven exponent bits, and four bits of the mantissa. The remaining three numbers are sixteen bits each of the remainder of the mantissa, for a total of sixty-four bits in a double-precision float.

04:07 Each of the pack and unpack templates that I’ve used so far has had a greater than symbol (>) in it. That symbol has to do with byte order, also known as endianness. That’s what I’ll talk about next.

04:21 When dealing with values comprised of multiple bytes, the order of the bytes in the value is a convention. It can either be left to right or right to left.

04:32 Big-endian is left to right, meaning the most significant byte is at the lowest address in memory. This format is common in mainframes and some of the POWER and ARM family of processors. On the other hand, little-endian is the opposite.

04:49 It’s right-to-left. The least significant byte is at the lowest address in memory. The x86 architecture family is little-endian. An example might shine some light on the differences between these two.

05:05 The Python language isn’t named after a snake, but after Monty Python, the comedy troupe. Monty Python’s first show aired in 1969. Seems like as good a value as any to show off endianness. I’ve broken 1969 into a four-byte integer here.

05:23 Each of these four bytes would have to be stored in memory. Assume that the starting value to store these bytes is at memory address 5000, with each address spot being one byte in width.

05:37 The big-endian case puts the bytes in Western reading order. The first address has the leftmost byte. By contrast, little-endian stores the least significant byte in the lowest address, acting like a stack.

05:53 This is the same four bytes of content but stored in two different fashions. The terms big- and little-endian are a reference to the book Gulliver’s Travels, where Gulliver meets two warring factions whose primary argument was over which end of a boiled egg to crack first: the big end or the little end.

06:13 Depending on who you ask, this was biting satire about religious wars or a child’s book filled with nonsense. So, why would you prefer one of these over the other? Well, neither of them really has an advantage.

06:28 Little-endian allows you to treat different sized values the same way. One-, two-, four-, and eight-byte values all get pulled out of memory the same way, popping the byte stack one at a time.

06:40 This makes certain kinds of math operations a tiny bit more efficient in the CPU. Big-endian, on the other hand, has the advantage of knowing the sign of a number easily. The most significant byte is always the first one, and of course, the most significant bit in that gives you easy access to the sign bit of a number. This difference isn’t a big deal when you’re dealing with a modern high-level language.

07:04 It’s all abstracted away. In the early days of the internet, it was decided that byte order on networks was big-endian. I suspect this is because most of the networking stuff was done in the early days at universities, most of which had mainframes, which were big-endian.

07:20 Some processors are actually configurable. These are called bi-endian, and a setting changes their mode. Some of the newer ARM processors that are common in mobile phones are bi-endian.

07:33 Let’s crack open the REPL and take a look at endianness.

07:42 The byte order value inside of the sys module will tell you what endianness you have. I’m recording this on an Intel-based Mac, so little-endian it is.

08:00 The .to_bytes() method on an integer object turns the value into its raw byte representation. When doing so, you specify how many bytes you want and in which endianness.

08:14 Remember that big-endianness means the most significant bytes are up top. Our number is mostly zeros for the first few bytes, so the rostering starts with zeros.

08:29 You can reconstruct the int from those raw bytes, getting 1969 back.

08:41 And this is why you have to be careful with endianness. If you assume the wrong one and reconstruct, you’ll get the wrong value. Building from the raw bytes with little-endian order gives you something that is definitely not 1969 when it was sent two bytes using big-endian order.

09:03 The socket module is used for doing network stuff. Within the module are four methods for switching values back and forth between endian orders.

09:14 This is htons() (host to network short). It converts the host’s format, whatever that is, to the network’s format for a short int. This is htonl() (host to network long). Same thing but converting to a long int.

09:31 And, of course, you can go the other way. This is ntohs() (network to host short),

09:40 and ntohl() (network to host long). If you’re using Python to interact directly with the network, the socket library has your endianness needs covered.

09:51 One final warning on endianness. Some file formats have specific byte orders. If you’re reading raw data from a Windows bitmap file, you’ll be using little-endian bytes no matter your platform.

10:04 JPEG is even trickier, as it supports both. Tired yet? Next up, the penultimate lesson: overloading bitwise operators for fun and profit. Hmm, sorry. My producer’s telling me that profit thing is optional.

10:20 Guess we’ll just have to stick with the fun part then.

Become a Member to join the conversation.