Join us and get access to thousands of tutorials and a community of expert Pythonistas.

Unlock This Lesson

This lesson is for members only. Join us and get access to thousands of tutorials and a community of expert Pythonistas.

Unlock This Lesson

Using Built-In Functions

Christopher Trudeau

Unicode in Python: Working With Character Encodings Christopher Trudeau 05:38

In this lesson, you’ll take a tour of Python built-in functions that can help you with ASCII, Unicode, and conversion between numeric representations. These can be used individually or logically grouped together based on their purpose. Here are links to the Python 3 documentation for built-in functions that relate to numbering systems and character encoding:

00:00 In the previous lesson, I talked about digraph, ligatures, and other ways of combining characters in Unicode. In this lesson, I’m going to give you a tour of useful built-in functions when you’re dealing with text and code points.

00:12 The first function I’m going to show you is ascii(). ascii() returns the repr() compatible string. This means what comes out of it is something that could be used in repr() or in eval() to get the contents.

00:28 Passing in a simple string that contains ASCII characters results in a string that is quoted. Notice in here that the single ticks are actually part of the string. Because this is for ASCII only, anything 128 or higher in the code points gets converted into an escape sequence. Again, because this is a repr() compatible string, even the escape sequence gets escaped, so you get the \\ in here.

00:56 And passing in an integer value gets the quoted decimal returned. Next up is bin(). bin() gives you the string representation of a binary number.

01:10 0 is easy. 9 decimal gets converted. And E6 hex gets converted as well. The bytes() function returns a bytes object—raw byte data.

01:30 This function takes a variety of inputs. If you pass in an iterable it will construct the byte data based on the content.

01:43 Those are the decimal values for the ASCII code points for the word 'hello'. You get the binary data object with the ASCII values 'hello' inside of it.

01:56 Same thing going on here. You can also pass in a string specifying the encoding. If you pass in an integer, it will give you that many blank bytes. So, bytes(5) returns five sets of null.

02:18 You can also call its companion method .fromhex(). This gives you a byte string with the hex values from the string inside.

02:31 You’ve seen the chr() function before—it takes an integer code point value and returns the Unicode character.

02:48 Any number can be used. The hex() function is similar to the bin() function, but instead of returning a string representation of a binary number, it returns the string representation of a hex number.

03:02 int() returns an integer based on what you pass in.

03:12 Passing in an integer gives you that integer. Passing in a float gives you a cut. Passing in a string containing an integer parses it and returns the integer.

03:30 If you pass in a string, you can also specify the base. '11' is 3. '11' in base 8 is 9.

03:44 And '11' in base 16 is 17. This can be done with anything that’s a number, including math. The companion method for int() also allows you to use bytes as input.

04:00 Within your computer, larger numbers are stored in one of two ways: either big-endian or little-endian. This specifies the order of how the bytes are read together. In big-endian order, the order of the string is the order of the number.

04:14 So this byte string is being read as 00 10 in hex, which of course is 16. By contrast, little-endian byte order swaps the order around.

04:27 The string '0010' is being stored as \x10\x00, which when converted from hex into decimal is 4096. Big-endian and little-endian are processor architecture choices.

04:41 Programmers generally don’t have to pay attention to them—they’re abstracted away. But depending on how the processor uses certain kinds of special data, if you’re fiddling with bits, you might need to know which order they come in.

04:53 You’ve seen the ord() method before—this converts a character into its integer code point.

05:04 And finally, the str() method converts its input into a string.

05:12 Passing in a string returns the string. Passing in a number converts the number into a string. Numbers are converted into decimal before the conversion to string happens.

05:24 And finally, you can specify the encoding type in order to decode binary information into a string. UTF-8 isn’t the only way to go. Next up, I’ll show you a couple other encodings.

Become a Member to join the conversation.