§11.5.

Internationalization

Designing a web application that is ready for an international audience involves more than text translation:

Problem

Consequence

Dates, times, currencies and addresses vary between countries in subtle ways

Each locale requires separate formatting and parsing

Most languages require non-ASCII characters

Applications must support Unicode

Translations in some languages can be shorter or longer than others

Fixed-size layouts may look strange in different translations

Some languages (e.g., Arabic, Hebrew, Urdu) write right-to-left

These languages may require a site’s entire page-layout be reversed or mirrored

Colors and images have different meanings in different cultures

Each locale may require separate style-sheets

Languages have different sort orders for names and words

Each locale needs a sorting function

Some languages vary based on gender, status and quantity

Translations may depend on contextual information that English web applications typically do not capture

There are two challenges:

  1. Designing an application for international audiences. This is called internationalization or i18n (i — 18 characters — n)

  2. Taking an internationalized application and adapting it to a specific location or locale. This is called localization or l10n (l — 10 characters — n).

Small steps have a large impact

As with a11y, small changes can have an enormous impact.

In particular, two simple techniques can assist internationalization efforts:

  • Avoid making assumptions about user input (e.g., ask for a user’s ‘name’, rather than separately requesting ‘given name’ and ‘family name’)

  • Use Unicode to store text wherever possible (fortunately, JavaScript does this by default!)

Unicode

In the 1960s and 1970s, The United States pioneered much of the technology of the internet and telecommunications. The market power of American technology companies ensured the global adoption of their standards. The legacy of this influence is technology biased to English.

All networks today transmit data in 8-bit bytes. Many of the original protocols of the internet (including HTTP, SMTP, POP, IMAP, Gopher, Finger, FTP) assumed text would be encoded and transmitted in ASCII (the American Standard Code for Information Interchange).

An 8-bit byte can represent 256 different values. In ASCII, each byte represents a separate character. ASCII defines a mapping for the first 128 values in a byte (the mapping appears in the following table). The remaining 128 values were free for application-specific uses. For example, the extended-ASCII character set defines mathematical symbols and diacritics/accents (áàä) for these remaining values.

0 → NUL

32 → SP

64 → “@”

96 → “`”

1 → SOH

33 → “!”

65 → “A”

97 → “a”

2 → STX

34 → “"”

66 → “B”

98 → “b”

3 → ETX

35 → “#”

67 → “C”

99 → “c”

4 → EOT

36 → “$”

68 → “D”

100 → “d”

5 → ENQ

37 → “%”

69 → “E”

101 → “e”

6 → ACK

38 → “&”

70 → “F”

102 → “f”

7 → BEL

39 → “'”

71 → “G”

103 → “g”

8 → BS

40 → “(”

72 → “H”

104 → “h”

9 → HT

41 → “)”

73 → “I”

105 → “i”

10 → LF

42 → “*”

74 → “J”

106 → “j”

11 → VT

43 → “+”

75 → “K”

107 → “k”

12 → FF

44 → “,”

76 → “L”

108 → “l”

13 → CR

45 → “-”

77 → “M”

109 → “m”

14 → SO

46 → “.”

78 → “N”

110 → “n”

15 → SI

47 → “/”

79 → “O”

111 → “o”

16 → DLE

48 → “0”

80 → “P”

112 → “p”

17 → DC1

49 → “1”

81 → “Q”

113 → “q”

18 → DC2

50 → “2”

82 → “R”

114 → “r”

19 → DC3

51 → “3”

83 → “S”

115 → “s”

20 → DC4

52 → “4”

84 → “T”

116 → “t”

21 → NAK

53 → “5”

85 → “U”

117 → “u”

22 → SYN

54 → “6”

86 → “V”

118 → “v”

23 → ETB

55 → “7”

87 → “W”

119 → “w”

24 → CAN

56 → “8”

88 → “X”

120 → “x”

25 → EM

57 → “9”

89 → “Y”

121 → “y”

26 → SUB

58 → “:”

90 → “Z”

122 → “z”

27 → ESC

59 → “;”

91 → “[”

123 → “{”

28 → FS

60 → “<”

92 → “\”

124 → “|”

29 → GS

61 → “=”

93 → “]”

125 → “}”

30 → RS

62 → “>”

94 → “^”

126 → “~”

31 → US

63 → “?”

95 → “_”

127 → DEL

ASCII does not support international languages. For example, Japan expects students to know the 2,136 characters in the Jōyō kanji (常用漢字) character list. The Zhonghua Zihai (中華字海) Chinese dictionary is said to contain over 85,500 different characters. None of these characters appear in the ASCII table. They could not possibly appear, because the byte-per-character assumption of ASCII allows for only 256 possible characters.

Many nations around the world attempted to resolve this problem by defining new character encoding standards. Countries with small alphabets defined ASCII variations: adopting new meanings for bytes with values greater than 127. East Asian nations with large character sets introduced multiple-byte encodings such as Extended Unix Code (EUC), that provided a way for Japanese, Korean and simplified Chinese characters to be encoded in two-byte combinations. Unfortunately, any given character encoding standard was incompatible with other standards. The incompatibilities resulted in garbled text (or Mojibake) and excluded the possibility of writing documents containing multiple languages.

In the late 1980s, efforts began to produce a global character encoding standard that is today known as Unicode.

Initially, Unicode aimed to encode all living human languages in a two-byte encoding. Two bytes is enough to represent 65536 distinct values. Thus, many technologies from the 1990s (including JavaScript, Java and Windows XP) represent text strings as arrays of 16-bit numbers (rather than the 8-bit arrays used in ASCII).

16-bit Unicode quickly ran out of space. Consequently, Unicode was expanded to 21-bits, theoretically allowing for 2,097,152 characters (though not all of these are valid, for technical reasons).

Unicode defines each possible character (or ‘code point’), as a value. However, those values must be stored in memory and transmitted over the internet, so Unicode defines conversions of code points into sequences of 8-bit bytes.

There are three encodings in widespread use:

UTF-8

UTF-8 has become ubiquitous as the standard encoding of Unicode for transmission over a network or storage in files.

UTF-8 uses a variable-length encoding, storing each code point in one, two, three or four bytes depending on the value. More frequent characters on the internet (such as the English alphabet) take up less space than more obscure characters (such as Egyptian Hieroglyphics): [1]

  • Code points 0–127 (7-bits) stored as-is in a single byte: 0xxxxxxx

  • Code points 128–2047 (11 bits) stored in two bytes: 110xxxxx 10xxxxxx

  • Code points 2048–65535 (16 bits) are stored in three bytes: 1110xxxx 10xxxxxx 10xxxxxx

  • Code points 65536–1114111 (21 bits) are stored in four bytes: 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx

UTF-16

UTF-16 is a scheme whose main application today is allowing for full Unicode support on systems originally designed when Unicode was only 16-bits. [2] UTF-16 is rarely used in network transmission, but it is extremely relevant to JavaScript developers because JavaScript strings are UTF-16 encoded Unicode. UTF-16 is a variable-length encoding. The vast majority of characters on the internet take two bytes, whereas rare code points require four bytes: [3]

  • Code points 0–55295 and 57344–65535 are stored as-is, in two bytes: xxxxxxxx xxxxxxxx

  • Code points 65536–1114111 (21 bits) are stored with the first bit ignored and the remaining 20 bits across four bytes: 110110xx xxxxxxxx 110111xx xxxxxxxx (these 16-bit pairs are the high and low surrogates)

UTF-32

UTF-32 is a fixed-length encoding of Unicode. In UTF-32, each Unicode code point is stored directly as a 32-bit integer (i.e., in 4 bytes). This encoding is not memory efficient. However, it is sometimes used in text-processing libraries because its fixed-length can speed up text processing.

Reflection: JavaScript string encodings

JavaScript strings offer two functions with similar names:

  • str.charCodeAt(i)

  • str.codePointAt(i)

What is the difference between these methods? (You may need to read the documentation for these methods)

Reflection: JavaScript string lengths

Can you use character encoding to explain the following:

  • 'a'.length is equal to 1. So, why does '😺'.length equal 2 in JavaScript, when '😺' is just one character?

  • Why does 'ab'.charCodeAt(1) equal 98 but '😺b'.charCodeAt(1) is equal to 56890?

Unicode is a very complex system. It has support for many challenges of dealing with human languages:

  • Languages vary between left-to-right (e.g., English) or right-to-left (e.g., Arabic and Hebrew) writing systems. Sometimes both directions are used in a single sentence.

  • Languages have different rules for translating between uppercase and lowercase

  • Some languages do not have concepts of uppercase or lower case.

  • The same character might appear slightly different in other languages.

  • Characters in one language might look similar or identical to characters in another language [4]

  • In some languages, the appearance of some characters depends on other nearby characters

  • Characters in some languages (e.g., Korean) are combinations of simpler symbols

  • Some languages use diacritics in formal or religious texts but not in informal writing

Exercise: Bidirectional text

How does Unicode handle bidirectional text (i.e., text that includes left-to-right and right-to-left writing systems)? Spend a moment researching on the web.

Exercise: Internationalization audit

Perform a quick internationalization audit on your web application or an application that you regularly use:

Enter non-English text into user inputs and form fields (e.g., create a username out of Thai and Emoji: สวัสดี😃). Does the text appear correctly when the system shows it later?

Internationalization Frameworks

After you have checked that your works with Unicode, the next step in internationalization is to translate the application to multiple languages.

Fortunately, there are many frameworks for managing translations. Translation typically involves extracting strings into separate resource files and then deploying the application with the translated resources.

In Angular’s internationalization framework, this is achieved by adding the i18n attribute to HTML elements in the template:

<h1 i18n>Welcome</h1>
<p i18n>Hello, world!</p>.

Angular provides the ng i18n-extract command for extracting all of these strings into a translation file:

...
    <source>Welcome</source>
    <target>Bienvenue</target>
...
    <source>Hello, world!</source>
    <target>Bonjour, le monde!</target>
...

Angular will then (with an appropriate configuration in angular.json) build translated versions of the project with the ng build command.

In React, the i18next framework has a similar approach. Translatable strings are defined using either the t function or <Trans> component:

<h1>{t('Welcome')}</h1>
<p><Trans>Hello, world!</Trans></p>.

Translations are defined using JavaScript dictionaries:

...
    resources: {
        fr: {
            translations: {
                "Welcome": "Bienvenue",
                "Hello, World!": "Bonjour, le monde!"
            }
        }
    }
...

1. In the encodings, I use 'x' to denote bits of the original code point.
2. 16-bit Unicode is correctly known as ‘UCS-2’.
3. In the encodings, I use 'x' to denote bits of the original code point.
4. This issue raises philosophical questions about language: is an ‘e’ in French the same as an ‘e’ in English? What about ‘爱’ in simplified Chinese versus ‘愛’ in traditional Chinese? Should a mathematical subtraction (‘-’) be the same character as the hyphen used to create compound-words (‘-’)?