Handle Non-ASCII Unicode Characters (Task)

In this task, you’ll enhance your wordcount command to handle non-ASCII Unicode characters correctly. Your program will decode multi-byte characters, such as those found in various languages spoken around the globe, ensuring an accurate byte count.

Acceptance Criteria

Your program should correctly decode multi-byte characters from standard input using the UTF-8 encoding.
The reported byte count should accurately reflect the number of bytes in the text.
You should treat sequences of characters between whitespace as single words, even if they contain non-ASCII characters.

Examples

Multi-byte character without a trailing newline:

Shell
      
$ echo -n "caffè" | wordcount
0 1 6

Multi-byte character with a trailing newline:

Shell
      
$ echo "caffè" | wordcount
1 1 7

Additional Resources

To better understand how to handle Unicode in Python, consider exploring these resources:

If you need further assistance, you can go to the next lesson for a step-by-step solution.

Python represents text as Unicode, but the underlying byte count depends on the character encoding. UTF-8 uses a variable number of bytes per character. How can you ensure that your program correctly counts the actual bytes rather than just the number of characters?

What’s Next?

🕵️‍♂️ Continue to the next lesson to review the sample solution and compare your approach to solving this task.

Become a Member to join the conversation.