New Year's Sale:Up to 69 % OFF
September 30, 2020
1 min read
UTF-8, which stands for 8-bit Unicode Transformation Format, is an encoding method for Unicode characters. It uses a sequence of at least eight binary digits known as code units.
Since a computer only understands numbers, it needs a system that can process characters and symbols in binaries. Character encoding is a way to do that.
For the longest time, we used the American Standard Code for Information Interchange (ASCII) — an encoding standard for electronic communication. However it can only represent English characters.
Consequently, it has meant that non-English speaking countries have had to invent their own standards. You can’t open a document written in Japanese characters, as an example, if your computer doesn’t support the encoding it uses.
There’s also the issue of overlapping number designation for different characters. For example, in the format of Latin/Cyrillic encoding (ISO 8859-5), the letter “Л” can be translated into 187. However, in Latin1 (ISO 8859-1), the symbol “»” represents the same number.
Enter UTF-8, a method developed by Ken Thompson and Rob Pike that overcomes this limitation and unifies different encoding methods. It solves three significant problems:
UTF-8 is an encoding method while Unicode, short for Universal Character Code, is a character set. The latter is a map that groups and assigns decimal values to characters.
These values are called code points and are written in hexadecimal notation to simplify how binaries are expressed.
A hexadecimal number is called a ‘nibble’, and is equivalent to half a byte. Therefore, one byte can be written with only two hex digits.
For example, the decimal value of “A” in Unicode is 65, but its code point is written as U+0041. The “U” stands for Unicode, and 41 is the hex value.
Meanwhile, the zeros are placed to complete the required 4-bit minimum of a hex to binary conversion.
Unicode can support up to 1,114,112 characters. However, only 1,111,998 can be encoded in UTF-8. The reason is that 2048 of these characters are surrogate pairs, which are two-byte sequences with high and low values that can’t be separated and are only valid in UTF-16.
Additionally, 66 of them are non-characters that are reserved for internal private use only.
As a result, Unicode can cover pretty much all language systems currently known. Aside from UTF-8, it can also be encoded using the UTF-16 or UTF-32 formats.
Unicode has 17 levels known as planes, each of them containing 65,536 characters. The first plane (plane 0) is called the Basic Multilingual Plane. It covers commonly used languages and writing systems.
The character set is constantly updated to include more characters. Currently, there’s a total of 143,859 mapped characters in Unicode 13.0.
Character encoding transforms a character’s decimal value into binary. Before it’s converted, the character is transformed into hexadecimal for efficiency.
UTF-8 encoding then calculates the character’s binary sequence and matches it with its format. It has a prefix code that denotes when a string begins and how many bytes (1 to 4) it needs.
Characters that only need one byte have 0 as their first digit. Meanwhile, the characters that need multiple bytes start with 1.
For instance, a character that needs three bytes will start with 1110xxxx and so on. Multi-byte chains also have 10 as a sequence code that specifies which byte resumes a string.
Based on our example, the character’s binary sequence will be similar to this: 1110xxxx 10xxxxxx 10xxxxxx.
The encoding will fill in the blank spaces with the binary digits, starting from right to left. The bit value increases as you move to the left, which is why the leftmost digit is called the most significant bit.
Once this process is done, the computer will display the intended character.
ASCII is another character set for encoding. It precedes Unicode and uses the 7-bit system, which limits it to 128 code points.
Unlike UTF-8 characters, ASCII can only work in the English language. It was later extended to 8-bit so it can fit 256 characters. However, it’s still not enough for non-English usages.
UTF-8, on the other hand, uses a minimum of 8 binary digits to a maximum of 32 digits. Thanks to this attribute, it can contain more characters and binary combinations. It’s not surprising that UTF-8 is the most popular encoding system used on the internet.
UTF-8 even managed to be a superset of ASCII characters. As previously explained, UTF-8 is backward compatible because it contains the same English characters as ASCII.
To work around the 7-bit system of the early ASCII, Unicode adds a placeholder 0 at the start of an ASCII character’s binary. This same 0 also automatically conveys to the system to use only one byte, because all ASCII characters only use one.
Another difference is that ASCII is not standardized, while UTF-8 is. Before Unicode, countries that don’t speak English invented their own encoding methods, causing incompatibility issues.