When you use
urllib.request.urlopen(), the body of the response is a bytes object. In this lesson, you’ll decode those bytes into strings using their character encoding. You’ll probably be safe defaulting to UTF-8 because it is the dominant character encoding, with 98 percent of web pages today being encoded with UTF-8.
Go From Bytes to Strings
When you use
00:09 The first thing you may want to do is convert the bytes object to a string. To do this, you need to decode the bytes. With Python, all you need to do is find out the character encoding used. Encoding, especially when referring to character encoding, is often referred to as a character set.
00:27 You’ll probably be safe defaulting to UTF-8 because it is the dominant character encoding, with 98 percent of web pages today being encoded with UTF-8. This number is the result of a recent web technology survey.
When you print
decoded_body, you can see that it’s now a string. That said, leaving it up to chance is really not a good strategy. Fortunately, headers are a great place to get character set information.
You can see what this looks like by making some modifications to your previous example. This time you can make a
character_set variable and set
response.headers.get_content() to it, so
body.decode(character_set). This means now instead of passing in
utf-8, you’re passing in the correct character set. Finally, you can print this like before and just print out the first thirty.
In this example, you call
.get_content_charset() on the
headers object of
response, and use that to decode. This is a convenient method that parses the
Content-Type header so that you can painlessly decode bytes into text.
Become a Member to join the conversation.