Locked learning resources

Join us and get access to thousands of tutorials and a community of expert Pythonistas.

Unlock This Lesson

Locked learning resources

This lesson is for members only. Join us and get access to thousands of tutorials and a community of expert Pythonistas.

Unlock This Lesson

Go From Bytes to Strings

When you use urllib.request.urlopen(), the body of the response is a bytes object. In this lesson, you’ll decode those bytes into strings using their character encoding. You’ll probably be safe defaulting to UTF-8 because it is the dominant character encoding, with 98 percent of web pages today being encoded with UTF-8.

00:00 In this lesson, you will learn how to go from bytes to strings. When you use urllib.request.urlopen(), the body of the response is a bytes object.

00:09 The first thing you may want to do is convert the bytes object to a string. To do this, you need to decode the bytes. With Python, all you need to do is find out the character encoding used. Encoding, especially when referring to character encoding, is often referred to as a character set.

00:27 You’ll probably be safe defaulting to UTF-8 because it is the dominant character encoding, with 98 percent of web pages today being encoded with UTF-8. This number is the result of a recent web technology survey.

00:38 You can find a link to it in the text below this video. You can see what decoding this looks like in the code. First, make sure you have urlopen imported from the urllib.request module.

00:52 Next, you’ll use the with statement to create a context in which you can open the website and read the response.

01:01 You’ll set the body variable to the response.read() method. The contents of the website are now stored in the body variable, but they’re in binary format.

01:10 To convert them to a string of UTF-8, you’ll need to decode the binary data using the .decode() method.

01:20 Pass "utf-8" into the .decode() method. To output the decoded content, you can use the print() function and to not show the whole content, you can slice the content—for example, at position 30.

01:33 This will give you a good impression of how the decoded body looks.

01:38 Once again, you can run your script, so that will be py urllib_requests.py and hit Enter. You’ll get back a UTF-8 string that has printed the first thirty characters of the HTML document.

01:54 The output will show <!doctype html> <html> and <head>, all of which are the beginning of an HTML document.

01:58 and head, all of which are the beginning of an HTML document.

02:04 In this example, you take the bytes object returned from response.read(), and decode it with the bytes object’s .decode() method, passing in utf-8 as an argument.

02:14 When you print decoded_body, you can see that it’s now a string. That said, leaving it up to chance is really not a good strategy. Fortunately, headers are a great place to get character set information.

02:26 How about you jump back into the code and see what this looks like?

02:30 You can see what this looks like by making some modifications to your previous example. This time you can make a character_set variable and set response.headers.get_content() to it, so character_set equals response.headers.get_content_charset().

02:52 Next, your example will look similar to before: decoded_body

02:59 equals body.decode(character_set). This means now instead of passing in utf-8, you’re passing in the correct character set. Finally, you can print this like before and just print out the first thirty.

03:15 Thirty is just an arbitrary number used here. You can honestly print out any amount, whether that’s first thirty characters, twenty, or the entire thing, and just omit the brackets altogether.

03:28 Once again, you can run your script. The output for this will look exactly like before, but this time you can be sure that you’re always using the correct character set.

03:40 In this example, you call .get_content_charset() on the headers object of response, and use that to decode. This is a convenient method that parses the Content-Type header so that you can painlessly decode bytes into text.

03:54 Next up, you’ll learn how to go from bytes to file.

Avatar image for titimoby

titimoby on Sept. 22, 2025

As of September, 22th 2025, I do not get any content charset from example.com.

A quick check:

from urllib.request import urlopen

with urlopen("https://www.example.com") as response:
    body = response.read()

character_set = response.headers.get_content_charset()
print(character_set is None)

returns:

True
Avatar image for Bartosz Zaczyński

Bartosz Zaczyński RP Team on Nov. 18, 2025

@titimoby Thanks for flagging this. You’re right. The host serving example.com no longer includes an explicit charset in its response headers. Instead, it now returns this header:

Content-Type: text/html

If you want to test with a domain that does specify a charset, then you can use realpython.com:

>>> from urllib.request import urlopen

>>> with urlopen("https://realpython.com/") as response:
...     body = response.read()
...
>>> response.headers.get_content_charset()
'utf-8'
Avatar image for VisibleHand

VisibleHand on Feb. 5, 2026

You have changed one error for another. When I run your code above for “realpython.com/” I receive at the 2nd line urlib.error.HTTPError: HTTP Error 403: Forbidden

Instead of opening the url that you provided, I tried the url for Harvard University: “www.harvard.edu”. All worked well, but be warned, it is a big website.

Avatar image for Martin Breuss

Martin Breuss RP Team on Feb. 6, 2026

Whoops, you’re right VisibleHand 😅 thanks for reporting this!

urlopen("https://realpython.com/") now returns a 403 Forbidden error. This happens because some websites (including realpython.com) block requests that come with urllib’s default Python-urllib/3.x User-Agent header.

Your workaround of using a different URL is a perfectly fine solution for following along with this lesson. If you’d like to stick with the original URL, you can set a custom User-Agent by using a Request object:

from urllib.request import urlopen, Request

request = Request(
    "https://realpython.com/",
    headers={"User-Agent": "realpython-learning"},
)

with urlopen(request) as response:
    body = response.read()

That said, for this lesson the important thing is just getting any HTML response to practice decoding bytes to strings, so any URL that works for you is fine.

Thanks for flagging this and also noting a solution!

Become a Member to join the conversation.