Understand the Goal of the Challenge

In this lesson, you’ll get a birds-eye view of the coding challenge ahead of you. It’ll help you understand the intended outcome, your learning objectives, as well as any potential prerequisites you might need to review.

Prerequisites

This is an intermediate coding challenge in which you’re going to mimic the Unix wc command. Therefore, you should already understand how the wc command works. But don’t worry if you haven’t used it before, as you’ll find a brief explanation below. Furthermore, it would help if you were familiar with how command-line interfaces work and had a basic grasp of file handling in programming.

Learning Outcomes

Solving this coding challenge is an excellent opportunity to practice your Python coding skills. By the end of it, you’ll have revisited or learned about the following concepts and techniques:

You’ll see how these might be used in a real-life project by following the best Python practices.

Project Demo

Once you’ve completed the challenge, you’ll have a command-line utility named wordcount, which you can run in the terminal to count the number of lines, words, bytes, and characters in one or more files. Don’t worry if the demonstration below feels overwhelming at first. You’ll get there in small steps!

Reading Data From Standard Input

When you run the wordcount command without any arguments, it’ll read bytes from standard input (stdin), letting you either directly type characters on the keyboard or use a Unix pipeline (|) to send data from another command:

Shell
      
$ wordcount
caffe
latte
 2  2 12

$ echo "caffe latte" | wordcount
 1  2 12

In the first case, you typed the words caffe and latte on two separate lines and terminated the data stream by pressing Ctrl+D to send the End-of-File (EOF) control character. According to the command’s output, there were two lines, two words, and twelve bytes in total. Then, you echoed the same words but on a single line and piped them to the wordcount command, which reported only one line this time.

Note that the echo command automatically appends a trailing newline character to the text. If you were to type the same text directly from standard input but without pressing the Enter key when you’re done, then you’d get a different result:

Shell

$ wordcount
caffe latte 0  2 11

The number of reported lines is zero because there were no newline characters (\n) found in the data stream. Because of this, the output appears on the same line as the input.

Note: Pressing Ctrl+D once sends an end-of-input for the current line, but if that line isn’t terminated by a newline, then the program may expect more input. A second Ctrl+D confirms there’s nothing more to read by signaling end-of-file.

Now, if you want to be more explicit, then you can optionally pass a special dash character (-) as an argument to achieve a similar effect:

Shell
      
$ wordcount -
caffe
latte
 2  2 12

Indicating standard input with a dash character is a common Unix convention. However, it has the additional benefit of allowing you to read from standard input more than once. To do so, you can pass the dash character several times, for example:

Shell
      
$ wordcount - - -

Such an invocation would report three separate outputs for each data stream treated individually.

Reading Data From a File

When you pass a valid path to an existing file in your file system instead of the dash character, then the wordcount command will read its contents and display the corresponding number of lines, words, and bytes, respectively:

Shell
      
$ wordcount file.txt
 1  2 12 file.txt

Notice that it also includes the path itself next to the numbers. That’s because you can specify more than one path in one go.

Reading Data From Multiple Files

You may supply the wordcount command with more than one path at a time. When you do, the command will append an extra line with the total number of lines, words, and bytes of all the specified files:

Shell
      
$ wordcount file.txt ~/books/index.csv
 1  2 12 file.txt
 1  3 16 /home/user/books/index.csv
 2  5 28 total

It’s also possible to repeat the same file—or the dash character—more than once:

Shell
      
        
      
    
$ wordcount file.txt file.txt file.txt
 1  2 12 file.txt
 1  2 12 file.txt
 1  2 12 file.txt
 3  6 36 total

But what if a file doesn’t exist?

Handling Directories and Missing Files

When you specify an invalid path or a directory instead of a file, then you’ll see the following:

Shell
      
$ wordcount missing.txt
0 0 0 missing.txt (no such file or directory)

$ wordcount src/
0 0 0 src/ (is a directory)

The counts in these outputs indicate zero lines, words, and bytes, as the specified paths don’t point to valid files.

Selecting the Counts

By default, the wordcount command reports the number of lines, words, and bytes. However, you can request a specific count by using one of the optional flags:

Shell
      
$ wordcount file.txt
 1  2 12 file.txt

$ wordcount file.txt --lines
1 file.txt

$ wordcount file.txt --words
2 file.txt

$ wordcount file.txt --bytes
12 file.txt

You can also combine these flags to request a subset of the corresponding counts:

Shell
      
$ wordcount file.txt --lines --bytes
 1 12 file.txt

$ wordcount file.txt --words --bytes --lines
 1  2 12 file.txt

The selected counts carry over to all specified file paths and the sum at the bottom:

Shell
      
$ wordcount file.txt ~/books/index.csv --words --bytes
 2 12 file.txt
 3 16 /home/bartek/books/index.csv
 5 28 total

Note that you can’t change their order. The counts always appear in the same order regardless of the flags used.

Dealing With Unicode Characters

Under most circumstances, the number of bytes coincides with the number of characters. However, that’s not always the case. Consider the following Python string:

Python
      
>>> text = "caffè latte\n"

>>> len(text)
12

>>> len(text.encode("utf-8"))
13

There are twelve characters in the text above, including the newline (\n). But when you convert that string to a binary representation using the common UTF-8 encoding, then you end up with thirteen bytes. That’s because, unlike the Latin letter e, the grave accent è requires an extra byte to correctly represent its Unicode code point.

To account for that, your wordcount command gives you another option with which you can differentiate between the number of bytes and characters:

Shell
      
$ echo "caffè latte" | wordcount
 1  2 13

$ echo "caffè latte" | wordcount --chars
12

$ echo "caffè latte" | wordcount --lines --words --chars
 1  2 12

As a result, you can request a superset of the default counts with this additional flag:

Shell
      
$ echo "caffè latte" | wordcount --lines --words --bytes --chars
 1  2 12 13

Now, you have four distinct numbers describing the data. Their order is almost the same as before, except that the number of characters is now second to last, between the number of words and the number of bytes.

Formatting the Numbers

Each number is displayed right-aligned relative to the maximum number of digits in the selected counts:

Shell
      
$ wordcount ~/books/frankenstein.txt
  7742  78545 448937 /home/user/books/frankenstein.txt

Since the number of bytes in this case is the largest, all numbers are aligned to match its width, occupying six spaces. This makes it easier to read the output, especially when processing multiple files:

Shell
      
$ wordcount file.txt ~/books/frankenstein.txt
     1      2     12 file.txt
  7742  78545 448937 /home/bartek/books/frankenstein.txt
  7743  78547 448949 total

Notice that the same formatting applies to all numbers across the whole output, including the line with the total count. However, when you select a different set of counts, then the alignment might look different:

Shell
      
$ wordcount file.txt ~/books/frankenstein.txt --lines --words
    1     2 file.txt
 7742 78545 /home/bartek/books/frankenstein.txt
 7743 78547 total

Now, all numbers occupy five spaces instead of six as before.

Non-Goals

As mentioned in the introduction, your custom wordcount command will mimic most but not all features of the original wc counterpart. Here’s what you won’t cover.

Binary Files

Although the original wc command can handle binary files to some extent, it wasn’t designed with this in mind. Therefore, your wordcount command will always assume that the supplied paths refer to text files.

Short Parameters

When invoking the wc command, you can either use the long or short form of its options. For example, you can type -l instead of --lines or -c instead of --bytes. In contrast, your custom wordcount command won’t support these short parameters. Instead, it’ll only recognize the long-form options like --lines and --bytes.

Other Parameters

Apart from selecting the counts, with wc, you can also specify when to print a line with total counts (--total). You can request to print the maximum display width (-L or --max-line-length). Finally, you can tell wc to read a list of input files to process from another file or from standard input using the --files0-from option. Again, you won’t cover any of these.

Sample Solution

At any point, you can download the complete source code of a sample solution of this coding challenge. However, you’re encouraged to try solving the problem on your own first to enhance your learning experience.

Download

Sample Solution (.py)

3.2 KB

What’s Next?

🔧 Continue to the next lesson, where you’ll set up your work environment to tackle this challenge.