In this lesson, you’ll walk through one possible solution for mimicking how the Unix wc
command handles data supplied through standard input. Such input most often comes from your keyboard.
Know Your Starting Point
As a quick reminder, here’s where you left off in your previous task:
src/wordcount.py
def main():
pass
When you run pytest
now, you’ll get to see the acceptance criteria for the current task, as well as any previous tasks that you might have completed.
After having read the acceptance criteria for the current task, you can start tackling them one by one. However, remember that you don’t need to strictly follow them in order. Pick a criterion that feels most approachable to you at first, and then revisit other criteria as needed.
Handle Empty Input
One of the acceptance criteria for this task states the following:
If the input stream is empty, then your program should return zero for lines, words, and bytes.
This is illustrated by a complementary example:
$ echo -n | wordcount
0 0 0
When the standard input data stream is empty, then your wordcount
command should report three zeros in its output. As a temporary solution, you can hard-code the expected output literally from the above example in your Python script to satisfy this first requirement:
src/wordcount.py
def main():
print("0 0 0")
You replaced the pass
statement in your main()
function’s body with a call to the built-in print()
function, passing the string literal "0 0 0"
as an argument.
When you save the file and rerun pytest
, you’ll see the first acceptance criterion of the task light up green. This indicates that you’re already making some progress! Naturally, as you add more constraints, they’ll nudge you to further refine the code of your solution.
Another useful step is to manually test your command and compare its output to the wc
counterpart. You can try running both commands from the terminal using the provided example:
$ echo -n | wc
0 0 0
$ echo -n | wordcount
0 0 0
That looks promising if you ignore the formatting differences for a minute. However, if you change the input slightly by letting the echo
command append its usual newline character at the end of the stream, then your output won’t yet reflect this:
$ echo | wc
1 0 1
$ echo | wordcount
0 0 0
The expected outcome should account for the invisible newline character that you’ve just introduced. More specifically, the output should indicate one line, zero words, and one byte. This mismatch happens because you haven’t yet read anything from standard input. You’ll address that now.
Read Text From Standard Input
In Python, there are a couple of ways you can access standard input. One of the most straightforward ones involves the sys
module, which exposes the standard input stream through its .stdin
attribute. It’s a read-only, file-like object that comes with the .read()
method, which you can call to obtain string data:
src/wordcount.py
import sys
def main():
text = sys.stdin.read()
If you don’t pass any arguments to sys.stdin.read()
, then it’ll read everything until the EOF character or when data runs out. This method returns a Python string object, which you can assign to a variable for further processing.
Note: It’s always a good idea to give your variables descriptive names, reflecting their purpose. In this case, text
conveys well enough that you’re most likely dealing with a string of characters.
Having the text stored in a variable lets you refer to it from different places in your code without having to read the data stream again. You’ll take advantage of this fact to compute the number of lines, words, and bytes next.
Find the Number of Lines
To determine the number of lines in the text, you can count the number of line breaks. A line break, also known as a line feed (LF) or simply a newline, is an invisible control character that tells your terminal emulator to start a new line. In Python, you can count the number of occurrences of the given substring in a string by calling the string object’s .count()
method.
But how do you represent a line break using a string literal in Python? To do that, you must use a special sequence of characters:
src/wordcount.py
import sys
def main():
text = sys.stdin.read()
num_lines = text.count("\n")
A backslash (\
) in a string literal starts an escape sequence, allowing you to change how the following letter is interpreted. In this case, the combination of the backslash character and the lowercase letter n represents a single newline character (\n
).
In other words, the line of code you just added counts the number of newline occurrences in the text. It even accounts for different newline styles across operating systems, as you’ll soon find out. Now, it’s time to count the number of words in the text.
Count Words in the Text
Counting words in the text is a bit more tricky. According to the definition provided in the task’s description, a word is any sequence of characters delimited by whitespace, like spaces or tabs. That’s not exactly how the original Unix wc
command defines words, but it’ll do for the purposes of this challenge.
To extract words from a piece of text in Python, you can leverage the string object’s .split()
method, which divides a string into substrings based on the given separator. By default, if you don’t provide any arguments to this method, then it splits the text on whitespace, which is precisely what you need. Consider this example:
>>> text = "One, two\t\tthree\nfour\n\n five."
>>> text.split()
['One,', 'two', 'three', 'four', 'five.']
The sample text above contains a sentence whose individual words are delimited with various kinds of whitespace, including spaces, tabs, and newlines, some of which are mixed or repeated. What you get in response is a list of words with all types of whitespace removed, effectively splitting the text into its constituent words.
Note: While the whitespace is removed, any punctuation marks directly attached to the words, like commas and periods, remain part of those words in the resulting list. But that’s fine since you’re only interested in the number of words rather than their individual characters.
To count the number of words, you can call the built-in len()
function on the resulting list, which returns its size:
>>> len(text.split())
5
It’s the correct answer, as there are five words in the sentence.
And, here’s how you can incorporate the .split()
method and the len()
function into your solution in the script:
src/wordcount.py
import sys
def main():
text = sys.stdin.read()
num_lines = text.count("\n")
num_words = len(text.split())
First, you split the text on whitespace, then calculate the length of the resulting list of words, and assign the result to yet another variable, num_words
. With the number of lines and words, you’re only missing the number of bytes, which you’ll find now.
Get the Number of Bytes
To keep things simple, you’ll limit yourself to the English alphabet only. When you do, you can equate the byte count with the length of the string stored in your text
variable. That’s because each of the ASCII characters occupies exactly one byte of memory.
Therefore, finding the number of bytes in the text boils down to getting the length of the string using the len()
function:
src/wordcount.py
import sys
def main():
text = sys.stdin.read()
num_lines = text.count("\n")
num_words = len(text.split())
num_bytes = len(text)
Note how you can call len()
on two different data types. One is a list and the other one is a string. Both belong to Python sequences, which share a common interface that lets you treat them somewhat similarly.
That’s it! You now have all three essential pieces of information, which you can display in the output.
Display the Lines, Word, and Byte Counts
Finally, you can print the three variables in the expected order—that is, starting with the number of lines, followed by the number of words and bytes. When you pass those variables as arguments to print()
, it’ll automatically separate them with a single space for you:
src/wordcount.py
import sys
def main():
text = sys.stdin.read()
num_lines = text.count("\n")
num_words = len(text.split())
num_bytes = len(text)
print(num_lines, num_words, num_bytes)
Save your file and rerun pytest
one more time to verify if all acceptance criteria are met and if you’ve unlocked the next task. You can also use the examples provided in the task’s description to double-check your implementation’s correctness. Specifically, see whether your solution can cope with the different styles of newline characters:
$ echo -n -e "Hello\n" | wordcount
1 1 6
$ echo -n -e "Hello\r" | wordcount
0 1 6
$ echo -n -e "Hello\r\n" | wordcount
1 1 7
If everything went according to plan, then only the line feed character (\n
) should be recognized as a legitimate line break.
Summary
That concludes this task. Now, you’ve got a rudimentary word count implementation that can read text from standard input and report the corresponding number of lines, words, and bytes.
💬 Have you arrived at a similar solution? If not, what did you do differently? Tells us in the comments section!
What’s Next?
🎯 Jump into the next lesson to get started with your next task.