In this lesson, you’ll learn about digraphs, ligatures, and accent symbols. The Unicode standard allows for the combination of characters into a single glyph. This is used for applying accents to characters and changing the skin tone of an emoji. Some kinds of common digraphs, or letter combinations, have their own code point. Others require the combination of two code points, with one dedicated to the accent symbol.
00:00 In the previous lesson, I took you deep into the guts of how UTF-8 encodes at the binary level. In this lesson, I’m going to be talking about digraphs and other ways of combining characters in Unicode. So, what’s a digraph?
00:20 Old English had a bunch of these inherited from Latin—you may have seen them before. The word archeology originally contained the grapheme “ash”, which is the combination of the a and the e. Some ligatures like these are single code points in Unicode. Others are combinations.
Starting with the code point for
'æ', this is a single code point with a single letter. This is the symbol for the Devangari sound “na”, which is part of the combined symbol for the sound “ni”. The
928 code point can be used on its own.
01:28 That dotted circle is the placeholder for the character that it’s being combined to. This ability to make combinations is also used to adjust emojis. The original emojis were all Simpsons-esque yellow.
02:22 If you speak to somebody like a graphic designer or a typographer, they may use the word character loosely to mean the symbol that shows up on the screen. When you’re combining characters in ligatures and using digraphs, this may not actually represent a single code point. As a word of caution, be careful when you’re talking to somebody about how this works.
02:42 Character, glyph, grapheme are all words that might be used to mean the symbol. Occasionally you’ll need to be very clear about whether or not you’re talking about the symbol or a code point. This ability to combine characters can also cause you problems.
e1 is the symbol
'á' with an accent.
430 and the following is a combination to get the same result. This is why the two strings have different lengths—there’s actually two different ways of showing this character.
To give maximum flexibility in combining accents with characters, there’s an entire block in Unicode dedicated to just the accent symbols. The code point
301 seen here is the accent combined with the
'a' from the Cyrillic alphabet in the code earlier.
Homographs allow you to play some rather nasty tricks. Nothing going on special here. Now, what do you expect in
value? Even someone who’s just learning how to program would now expect
value to contain
05:09 This homograph actually causes the first value and the second value to be two different identifiers. This is not a bug you’re going to be able to find. There’s an entire phishing attack based on this concept, where you manipulate URLs using these kinds of characters to convince someone they’re going to a safe site when they’re not. That’s enough dirty tricks for one day. In the next lesson, I’m going to talk about the functions built into Python you can use to help you examine and manipulate strings and code points.
Become a Member to join the conversation.