The purpose of this tutorial is to get an experienced Python programmer up to speed with the basics of the C language and how it’s used in the CPython source code. It assumes you already have an intermediate understanding of Python syntax.
That said, C is a fairly limited language, and most of its usage in CPython falls under a small set of syntax rules. Getting to the point where you understand the code is a much smaller step than being able to write C effectively. This tutorial is aimed at the first goal but not the second.
In this tutorial, you’ll learn:
- What the C preprocessor is and what role it plays in building C programs
- How you can use preprocessor directives to manipulate source files
- How C syntax compares to Python syntax
- How to create loops, functions, strings, and other features in C
One of the first things that stands out as a big difference between Python and C is the C preprocessor. You’ll look at that first.
Note: This tutorial is adapted from the appendix, “Introduction to C for Python Programmers,” in CPython Internals: Your Guide to the Python Interpreter.
Free Download: Get a sample chapter from CPython Internals: Your Guide to the Python 3 Interpreter showing you how to unlock the inner workings of the Python language, compile the Python interpreter from source code, and participate in the development of CPython.
The C Preprocessor
The preprocessor, as the name suggests, is run on your source files before the compiler runs. It has very limited abilities, but you can use them to great advantage in building C programs.
The preprocessor produces a new file, which is what the compiler will actually process. All the commands to the preprocessor start at the beginning of a line, with a #
symbol as the first non-whitespace character.
The main purpose of the preprocessor is to do text substitution in the source file, but it will also do some basic conditional code with #if
or similar statements.
You’ll start with the most frequent preprocessor directive: #include
.
#include
#include
is used to pull the contents of one file into the current source file. There’s nothing sophisticated about #include
. It reads a file from the file system, runs the preprocessor on that file, and puts the results into the output file. This is done recursively for each #include
directive.
For example, if you look at CPython’s Modules/_multiprocessing/semaphore.c
file, then near the top you’ll see the following line:
#include "multiprocessing.h"
This tells the preprocessor to pull in the entire contents of multiprocessing.h
and put them into the output file at this position.
You’ll notice two different forms for the #include
statement. One of them uses quotes (""
) to specify the name of the include file, and the other uses angle brackets (<>
). The difference comes from which paths are searched when looking for the file on the file system.
If you use <>
for the filename, then the preprocessor will look only at system include files. Using quotes around the filename instead will force the preprocessor to look in the local directory first and then fall back to the system directories.
#define
#define
allows you to do simple text substitution and also plays into the #if
directives you’ll see below.
At its most basic, #define
lets you define a new symbol that gets replaced with a text string in the preprocessor output.
Continuing in semphore.c
, you’ll find this line:
#define SEM_FAILED NULL
This tells the preprocessor to replace every instance of SEM_FAILED
below this point with the literal string NULL
before the code is sent to the compiler.
#define
items can also take parameters as in this Windows-specific version of SEM_CREATE
:
#define SEM_CREATE(name, val, max) CreateSemaphore(NULL, val, max, NULL)
In this case, the preprocessor will expect SEM_CREATE()
to look like a function call and have three parameters. This is generally referred to as a macro. It will directly replace the text of the three parameters into the output code.
For example, on line 460 of semphore.c
, the SEM_CREATE
macro is used like this:
handle = SEM_CREATE(name, value, max);
When you’re compiling for Windows, this macro will be expanded so that line looks like this:
handle = CreateSemaphore(NULL, value, max, NULL);
In a later section, you’ll see how this macro is defined differently on Windows and other operating systems.
#undef
This directive erases any previous preprocessor definition from #define
. This makes it possible to have a #define
in effect for only part of a file.
#if
The preprocessor also allows conditional statements, allowing you to either include or exclude sections of text based on certain conditions. Conditional statements are closed with the #endif
directive and can also make use of #elif
and #else
for fine-tuned adjustments.
There are three basic forms of #if
that you’ll see in the CPython source:
#ifdef <macro>
includes the subsequent block of text if the specified macro is defined. You may also see it written as#if defined(<macro>)
.#ifndef <macro>
includes the subsequent block of text if the specified macro is not defined.#if <macro>
includes the subsequent block of text if the macro is defined and it evaluates toTrue
.
Note the use of “text” instead of “code” to describe what’s included or excluded from the file. The preprocessor knows nothing of C syntax and doesn’t care what the specified text is.
#pragma
Pragmas are instructions or hints to the compiler. In general, you can ignore these while reading the code as they usually deal with how the code is compiled, not how the code runs.
#error
Finally, #error
displays a message and causes the preprocessor to stop executing. Again, you can safely ignore these for reading the CPython source code.
Basic C Syntax for Python Programmers
This section won’t cover all aspects of C, nor is it intended to teach you how to write C. It will focus on aspects of C that are different or confusing for Python developers the first time they see them.
General
Unlike in Python, whitespace isn’t important to the C compiler. The compiler doesn’t care if you split statements across lines or jam your entire program into a single, very long line. This is because it uses delimiters for all statements and blocks.
There are, of course, very specific rules for the parser, but in general you’ll be able to understand the CPython source just knowing that each statement ends with a semicolon (;
), and all blocks of code are surrounded by curly braces ({}
).
The exception to this rule is that if a block has only a single statement, then the curly braces can be omitted.
All variables in C must be declared, meaning there needs to be a single statement indicating the type of that variable. Note that, unlike Python, the data type that a single variable can hold can’t change.
Here are a few examples:
/* Comments are included between slash-asterisk and asterisk-slash */
/* This style of comment can span several lines -
so this part is still a comment. */
// Comments can also come after two slashes
// This type of comment only goes until the end of the line, so new
// lines must start with double slashes (//).
int x = 0; // Declares x to be of type 'int' and initializes it to 0
if (x == 0) {
// This is a block of code
int y = 1; // y is only a valid variable name until the closing }
// More statements here
printf("x is %d y is %d\n", x, y);
}
// Single-line blocks do not require curly brackets
if (x == 13)
printf("x is 13!\n");
printf("past the if block\n");
In general, you’ll see that the CPython code is very cleanly formatted and typically sticks to a single style within a given module.
if
Statements
In C, if
works generally like it does in Python. If the condition is true, then the following block is executed. The else
and else if
syntax should be familiar enough to Python programmers. Note that C if
statements don’t need an endif
because blocks are delimited by {}
.
There’s a shorthand in C for short if
… else
statements called the ternary operator:
condition ? true_result : false_result
You can find it in semaphore.c
where, for Windows, it defines a macro for SEM_CLOSE()
:
#define SEM_CLOSE(sem) (CloseHandle(sem) ? 0 : -1)
The return value of this macro will be 0
if the function CloseHandle()
returns true
and -1
otherwise.
Note: Boolean variable types are supported and used in parts of the CPython source, but they aren’t part of the original language. C interprets binary conditions using a simple rule: 0
or NULL
is false, and everything else is true.
switch
Statements
Unlike Python, C also supports switch
. Using switch
can be viewed as a shortcut for extended if
… elseif
chains. This example is from semaphore.c
:
switch (WaitForSingleObjectEx(handle, 0, FALSE)) {
case WAIT_OBJECT_0:
if (!ReleaseSemaphore(handle, 1, &previous))
return MP_STANDARD_ERROR;
*value = previous + 1;
return 0;
case WAIT_TIMEOUT:
*value = 0;
return 0;
default:
return MP_STANDARD_ERROR;
}
This performs a switch on the return value from WaitForSingleObjectEx()
. If the value is WAIT_OBJECT_0
, then the first block is executed. The WAIT_TIMEOUT
value results in the second block, and anything else matches the default
block.
Note that the value being tested, in this case the return value from WaitForSingleObjectEx()
, must be an integral value or an enumerated type, and each case
must be a constant value.
Loops
There are three looping structures in C:
for
loopswhile
loopsdo
…while
loops
for
loops have syntax that’s quite different from Python:
for ( <initialization>; <condition>; <increment>) {
<code to be looped over>
}
In addition to the code to be executed in the loop, there are three blocks of code that control the for
loop:
-
The
<initialization>
section runs exactly once when the loop is started. It’s typically used to set a loop counter to an initial value (and possibly to declare the loop counter). -
The
<increment>
code runs immediately after each pass through the main block of the loop. Traditionally, this will increment the loop counter. -
Finally, the
<condition>
runs after the<increment>
. The return value of this code will be evaluated and the loop breaks when this condition returns false.
Here’s an example from Modules/sha512module.c
:
for (i = 0; i < 8; ++i) {
S[i] = sha_info->digest[i];
}
This loop will run 8
times, with i
incrementing from 0
to 7
, and will terminate when the condition is checked and i
is 8
.
while
loops are virtually identical to their Python counterparts. The do
… while
syntax is a little different, however. The condition on a do
… while
loop isn’t checked until after the body of the loop is executed for the first time.
There are many instances of for
loops and while
loops in the CPython code base, but do
… while
is unused.
Functions
The syntax for functions in C is similar to that in Python, with the addition that the return type and parameter types must be specified. The C syntax looks like this:
<return_type> function_name(<parameters>) {
<function_body>
}
The return type can be any valid type in C, including built-in types like int
and double
as well as custom types like PyObject
, as in this example from semaphore.c
:
static PyObject *
semlock_release(SemLockObject *self, PyObject *args)
{
<statements of function body here>
}
Here you see a couple of C-specific features in play. First, remember that whitespace doesn’t matter. Much of the CPython source code puts the return type of a function on the line above the rest of the function declaration. That’s the PyObject *
part. You’ll take a closer look at the use of *
a little later, but for now it’s important to know that there are several modifiers that you can place on functions and variables.
static
is one of these modifiers. There are some complex rules governing how modifiers operate. For instance, the static
modifier here means something very different than if you placed it in front of a variable declaration.
Fortunately, you can generally ignore these modifiers while trying to read and understand the CPython source code.
The parameter list for functions is a comma-separated list of variables, similar to what you use in Python. Again, C requires specific types for each parameter, so SemLockObject *self
says that the first parameter is a pointer to a SemLockObject
and is called self
. Note that all parameters in C are positional.
Let’s look at what the “pointer” part of that statement means.
To give some context, the parameters that are passed to C functions are all passed by value, meaning the function operates on a copy of the value and not on the original value in the calling function. To work around this, functions will frequently pass in the address of some data that the function can modify.
These addresses are called pointers and have types, so int *
is a pointer to an integer value and is of a different type than double *
, which is a pointer to a double-precision floating-point number.
Pointers
As mentioned above, pointers are variables that hold the address of a value. These are used frequently in C, as seen in this example:
static PyObject *
semlock_release(SemLockObject *self, PyObject *args)
{
<statements of function body here>
}
Here, the self
parameter will hold the address of, or a pointer to, a SemLockObject
value. Also note that the function will return a pointer to a PyObject
value.
Note: For an in-depth look at how to simulate pointers in Python, check out Pointers in Python: What’s the Point?
There’s a special value in C called NULL
that indicates a pointer doesn’t point to anything. You’ll see pointers assigned to NULL
and checked against NULL
throughout the CPython source. This is important since there are very few limitations as to what values a pointer can have, and accessing a memory location that isn’t part of your program can cause very strange behavior.
On the other hand, if you try to access the memory at NULL
, then your program will exit immediately. This may not seem better, but it’s generally easier to figure out a memory bug if NULL
is accessed than if a random memory address is modified.
Strings
C doesn’t have a string type. There’s a convention around which many standard library functions are written, but there’s no actual type. Rather, strings in C are stored as arrays of char
(for ASCII) or wchar
(for Unicode) values, each of which holds a single character. Strings are marked with a null terminator, which has a value 0
and is usually shown in code as \\0
.
Basic string operations like strlen()
rely on this null terminator to mark the end of the string.
Because strings are just arrays of values, they cannot be directly copied or compared. The standard library has the strcpy()
and strcmp()
functions (and their wchar
cousins) for doing these operations and more.
Structs
Your final stop on this mini-tour of C is how you can create new types in C: structs. The struct
keyword allows you to group a set of different data types together into a new, custom data type:
struct <struct_name> {
<type> <member_name>;
<type> <member_name>;
...
};
This partial example from Modules/arraymodule.c
shows a struct
declaration:
struct arraydescr {
char typecode;
int itemsize;
...
};
This creates a new data type called arraydescr
which has many members, the first two of which are a char typecode
and an int itemsize
.
Frequently structs will be used as part of a typedef
, which provides a simple alias for the name. In the example above, all variables of the new type must be declared with the full name struct arraydescr x;
.
You’ll frequently see syntax like this:
typedef struct {
PyObject_HEAD
SEM_HANDLE handle;
unsigned long last_tid;
int count;
int maxvalue;
int kind;
char *name;
} SemLockObject;
This creates a new, custom struct type and gives it the name SemLockObject
. To declare a variable of this type, you can simply use the alias SemLockObject x;
.
Conclusion
This wraps up your quick walk through C syntax. Although this description barely scratches the surface of the C language, you now have sufficient knowledge to read and understand the CPython source code.
In this tutorial, you learned:
- What the C preprocessor is and what role it plays in building C programs
- How you can use preprocessor directives to manipulate source files
- How C syntax compares to Python syntax
- How to create loops, functions, strings, and other features in C
Now that you’re familiar with C, you can deepen your knowledge of the inner workings of Python by exploring the CPython source code. Happy Pythoning!
Note: If you enjoyed what you learned in this sample from CPython Internals: Your Guide to the Python Interpreter, then be sure to check out the rest of the book.