Combining Characters
In this lesson, you’ll learn about digraphs, ligatures, and accent symbols. The Unicode standard allows for the combination of characters into a single glyph. This is used for applying accents to characters and changing the skin tone of an emoji. Some kinds of common digraphs, or letter combinations, have their own code point. Others require the combination of two code points, with one dedicated to the accent symbol.
00:00 In the previous lesson, I took you deep into the guts of how UTF-8 encodes at the binary level. In this lesson, I’m going to be talking about digraphs and other ways of combining characters in Unicode. So, what’s a digraph?
00:12 A digraph is a pair of letters that make a single sound. This is similar to a ligature, which is the name for the actual combination of the glyphs to make the character.
00:20 Old English had a bunch of these inherited from Latin—you may have seen them before. The word archeology originally contained the grapheme “ash”, which is the combination of the a and the e. Some ligatures like these are single code points in Unicode. Others are combinations.
00:37
The 'æ'
“ash” combination is a single character. It has a code point of E6
. Hindi and many of the languages on the Indian subcontinent use Devangari as their script.
00:50
The symbol for “ni” is one of these kinds of ligatures. It is a combination of two code points: code point 928
and 93F
. You can see this in practice inside the REPL.
01:01
Starting with the code point for 'æ'
, this is a single code point with a single letter. This is the symbol for the Devangari sound “na”, which is part of the combined symbol for the sound “ni”. The 928
code point can be used on its own.
01:19
The 93f
code point is not able to be used on its own. If you examine just this code point, you’ll notice that there’s a little dotted circle.
01:28 That dotted circle is the placeholder for the character that it’s being combined to. This ability to make combinations is also used to adjust emojis. The original emojis were all Simpsons-esque yellow.
01:42
Using code points, you can combine these to change the skin tone. 1F3FB
is the lightest color possible.
02:00
By working your way up, you can continue to make the skin tone darker until you reach 1F3FF
. This allows our Vulcan salute to be multicultural.
02:12 A quick note on terminology. Although for this course I’ve strictly defined what a character is, using that word amongst non-programmers is going to cause some confusion.
02:22 If you speak to somebody like a graphic designer or a typographer, they may use the word character loosely to mean the symbol that shows up on the screen. When you’re combining characters in ligatures and using digraphs, this may not actually represent a single code point. As a word of caution, be careful when you’re talking to somebody about how this works.
02:42 Character, glyph, grapheme are all words that might be used to mean the symbol. Occasionally you’ll need to be very clear about whether or not you’re talking about the symbol or a code point. This ability to combine characters can also cause you problems.
02:59 You may recall this from an earlier lesson. It’s the phrase “da svidaniya” in Russian.
03:05
It’s 11
characters long. When I was first putting the lesson together, I copy and pasted this from the web, and while working with it, I noticed something odd.
03:20
The length didn’t actually match the number of characters. So, what’s going on here? That string looks the same. Well, let’s look at it with the hex_code_points()
method.
03:36 There’s the first string.
03:41
And now the second. Everything looks good up until the eighth character. You’ll notice the difference here between code point 430
and code point e1
. Let’s examine these more closely.
03:55
e1
is the symbol 'á'
with an accent. 430
and the following is a combination to get the same result. This is why the two strings have different lengths—there’s actually two different ways of showing this character.
04:09 This is called a homograph, or homoglyph—a symbol from two different character sets that looks the same or close to being the same.
04:18
To give maximum flexibility in combining accents with characters, there’s an entire block in Unicode dedicated to just the accent symbols. The code point 301
seen here is the accent combined with the 'a'
from the Cyrillic alphabet in the code earlier.
04:35
This is the longer combination from the string with the length 12
.
04:41
Homographs allow you to play some rather nasty tricks. Nothing going on special here. Now, what do you expect in value
? Even someone who’s just learning how to program would now expect value
to contain 4
.
04:56
You may recall Python 3 allows you to specify an identifier using Unicode. The 'a'
in the first value is from the ASCII table… and the 'а'
from the second value is from the Cyrillic alphabet.
05:09 This homograph actually causes the first value and the second value to be two different identifiers. This is not a bug you’re going to be able to find. There’s an entire phishing attack based on this concept, where you manipulate URLs using these kinds of characters to convince someone they’re going to a safe site when they’re not. That’s enough dirty tricks for one day. In the next lesson, I’m going to talk about the functions built into Python you can use to help you examine and manipulate strings and code points.
Become a Member to join the conversation.
William on July 2, 2020
I’m surprised that there aren’t any comments yet! You deserve better, Mr. Trudeau.
This was an •excellent• progression of lesson videos, Christopher.
Thank you.