In this task, you’ll enhance your wordcount
command to handle non-ASCII Unicode characters correctly. Your program will decode multi-byte characters, such as those found in various languages spoken around the globe, ensuring an accurate byte count.
Acceptance Criteria
- Your program should correctly decode multi-byte characters from standard input using the UTF-8 encoding.
- The reported byte count should accurately reflect the number of bytes in the text.
- You should treat sequences of characters between whitespace as single words, even if they contain non-ASCII characters.
Examples
Multi-byte character without a trailing newline:
$ echo -n "caffè" | wordcount
0 1 6
Multi-byte character with a trailing newline:
$ echo "caffè" | wordcount
1 1 7
Additional Resources
To better understand how to handle Unicode in Python, consider exploring these resources:
- Python’s Unicode Support
- Reading and Writing Files in Python (Guide)
- Unicode & Character Encodings in Python: A Painless Guide
If you need further assistance, you can go to the next lesson for a step-by-step solution.
Python represents text as Unicode, but the underlying byte count depends on the character encoding. UTF-8 uses a variable number of bytes per character. How can you ensure that your program correctly counts the actual bytes rather than just the number of characters?
What’s Next?
🕵️♂️ Continue to the next lesson to review the sample solution and compare your approach to solving this task.