Go From Bytes to Strings
When you use urllib.request.urlopen()
, the body of the response is a bytes object. In this lesson, you’ll decode those bytes into strings using their character encoding. You’ll probably be safe defaulting to UTF-8 because it is the dominant character encoding, with 98 percent of web pages today being encoded with UTF-8.
00:00
In this lesson, you will learn how to go from bytes to strings. When you use urllib.request.urlopen()
, the body of the response is a bytes object.
00:09 The first thing you may want to do is convert the bytes object to a string. To do this, you need to decode the bytes. With Python, all you need to do is find out the character encoding used. Encoding, especially when referring to character encoding, is often referred to as a character set.
00:27 You’ll probably be safe defaulting to UTF-8 because it is the dominant character encoding, with 98 percent of web pages today being encoded with UTF-8. This number is the result of a recent web technology survey.
00:38
You can find a link to it in the text below this video. You can see what decoding this looks like in the code. First, make sure you have urlopen
imported from the urllib.request
module.
00:52
Next, you’ll use the with
statement to create a context in which you can open the website and read the response.
01:01
You’ll set the body
variable to the response.read()
method. The contents of the website are now stored in the body
variable, but they’re in binary format.
01:10
To convert them to a string of UTF-8, you’ll need to decode the binary data using the .decode()
method.
01:20
Pass "utf-8"
into the .decode()
method. To output the decoded content, you can use the print()
function and to not show the whole content, you can slice the content—for example, at position 30
.
01:33 This will give you a good impression of how the decoded body looks.
01:38
Once again, you can run your script, so that will be py urllib_requests.py
and hit Enter. You’ll get back a UTF-8 string that has printed the first thirty characters of the HTML document.
01:54
The output will show <!doctype html>
<html>
and <head>
, all of which are the beginning of an HTML document.
01:58
and head
, all of which are the beginning of an HTML document.
02:04
In this example, you take the bytes object returned from response.read()
, and decode it with the bytes object’s .decode()
method, passing in utf-8
as an argument.
02:14
When you print decoded_body
, you can see that it’s now a string. That said, leaving it up to chance is really not a good strategy. Fortunately, headers are a great place to get character set information.
02:26 How about you jump back into the code and see what this looks like?
02:30
You can see what this looks like by making some modifications to your previous example. This time you can make a character_set
variable and set response.headers.get_content()
to it, so character_set
equals response.headers.get_content_charset()
.
02:52
Next, your example will look similar to before: decoded_body
02:59
equals body.decode(character_set)
. This means now instead of passing in utf-8
, you’re passing in the correct character set. Finally, you can print this like before and just print out the first thirty.
03:15 Thirty is just an arbitrary number used here. You can honestly print out any amount, whether that’s first thirty characters, twenty, or the entire thing, and just omit the brackets altogether.
03:28 Once again, you can run your script. The output for this will look exactly like before, but this time you can be sure that you’re always using the correct character set.
03:40
In this example, you call .get_content_charset()
on the headers
object of response
, and use that to decode. This is a convenient method that parses the Content-Type
header so that you can painlessly decode bytes into text.
Become a Member to join the conversation.