Squeak SmalltalkJoker Squeak Smalltalk : Fonts : prevnext Unicode Glossary

Glyph	A glyph is a particular image which represents a character or part of a 
character.

Coded Character Set	A mapping from a set of abstract characters to the set of 
non-negative integers. This range of integers need not be contiguous.

Locale	A specific language, geographic location and character set (and 
sometimes script). An example is 'fr_FR.ISO-8859-1', although locale strings 
are seldom standard on across platforms at present. Often only language_country 
is specified, or even just language. In a client-server environment (like the 
web), 3 locales are usually considered - server, data, and client locales.

Internationalization (i18N)	The technical aspects (character sets, date 
formats, sorting, number formatting, string resources) of supporting multiple 
locales in 1 product.

Localization (L10N)	The practical aspects (language, custom, fashion, color, 
etc. ) of expressing an application in a particular locale. Roughly, i18N is 
considered an engineering process while L10N is considered a translation 
process.

Globalization (g11n)	The cultural aspects of supporting multiple locales in a 
non-offensive and universally intuitive manner. Think Olympic or airport 
signage.

Canonicalization (c14n)	The process of standardizing text according to 
well-defined rules.

Japanization	The conversion of a product into the Japanese language and 
character set. There are 4 character sets used in Japanese computer text: 
hiragana, katakana, kanji and romaji (English alphabet.) Sometimes kanji 
characters have ruby aka furigana ("attached character") annotation above them 
to aid in irregular or difficult readings of personal or geographics names and 
for school children. Mojibake means "scrambled character" and is used to 
describe the unreadable appearance of electronic displays when the wrong 
character decoding is used. Because kanji may have multiple readings (meanings) 
depending on context, machine conversion to hiragana is unreliable. The kinsoku 
rule says that Japanese sentences are separated by periods that may not wrap to 
the beginning of a line.

CJKV	Chinese, Japanese, Korean and Vietnamese are often considered together 
because they all use multi-byte encodings.

Unicode	Unified Code for characters. Current version is 3.0. Most modern 
character sets have been incorporated already, and also many ancient ones (I 
have been informed that Indonesian Jawi is represented by Arabic and Extended 
Arabic codepoints. I need to doublecheck that since there is also a Indic 
script used in Java.) Unicode is a complex character set that is unlike ASCII 
in many ways, some of them being: A glyph may be composed from multiple 
codepoints in more than 1 ordering; national character sets may not consist of 
contiguous codepoints; symbols such as bullets, smiley faces and braille are 
included; binary sorting of Unicode character sequences is likely to be 
meaningless unless the sequences are normalized first.

UTF-32	32-bit Unicode 3.0 Transformation Format
UTF-16	16-bit Unicode 3.0 Transformation Format is a 16-bit encoding, with 
16-bit surrogates for private characters and future use (Chinese characters, 
ancient languages, special symbols, etc.)

UTF-8	8-bit Unicode 3.0 Transformation Format is a variable width encoding form 
updated from UTF-2, often used with older C libraries and to save space with 
European text. UTF-8 may be 1 to 6 octets long, although usually 4 octets 
without surrogates. It is defined in RFC 2279

UTF-2	8-bit Unicode 1.1 Transformation Format is a variable width encoding form 
that was superseded by UTF-8. UTF-2 was used in Oracle 7.3.4

UTF-7	7-bit Unicode 2.0 Transformation Format is a variable width encoding form 
(used with older email that was not 8-bit clean and not MIME.) See RFC 2152

Unicode compliant	Character encoding implementation that conforms to a 
particular version of the Unicode spec, for certain features. May only 
implement a subset if so documented. (For example, a Unicode compliant app 
might only support certain languages (typically Western European), or even 
allow only US-ASCII!)

Code value (codepoint)	Unicode value for a character that is all or part of a 
glyph. The same codepoint may represent multiple glyphs, especially in Han 
unification (Chinese, Japanese, Korean.) Accents alone may have their own 
codepoint.

Pre-composed character	A Unicode character consisting of one code value. Some 
accented characters, notably Western European, have their own codepoints.

Base character	A Unicode character that does not graphically combine with 
preceding characters, and is not a control or format character.

Combining character	A Unicode character that graphically combines with a 
preceding base character. Typically accents and diacritical marks.

Composed Character	A Unicode character made of combined codepoints, usually 
non-spacing mark (accent) characters. Often the same accented glyph may consist 
of codepoints in different orderings, for example a character with accents 
above and below the character (like Thai.)

Combining character	A character that normally appears after a base character, 
and is an accent or other diacritical mark that is added to the previous base 
character.

Compatibility character	A character included in the Unicode standard that has 
been included for compatibility with a legacy encoding. Usually it looks 
similar enough to another non-compatibility character to be replaced with it 
when appropriate. An example is the set of Japanese half-width hiragana code 
values, which were included for round-trip compatibility with other character 
set encodings for use in smaller character cells, even though a Unicode 
application could achieve the same appearance with application-defined font 
rendering.

Normalization	There are four functions performed on Unicode character sequences 
so that two sequences may be compared in a meaningful way. Normalization is 
necessary because decomposed characters may have accents in different orders 
before normalization, but be the same glyph. Normalization is especially 
important to perform when computer language identifiers, filenames, mail folder 
names, digital signatures and emitting XML or JavaScript are involved.

Collation Order	Table and/or algorithm for sorting strings specific to a locale 
and usage (dictionary, phonebook, etc.) Unicode has not specified collation 
order at this time, but should in 3.0.

UCS	ISO/IEC 10646 Universal Multiple-Octet Coded Character Set. Both UCS and 
Uncode standards now share identical code values. The major difference between 
UCS and Unicode is that UCS is mostly concerned with defining code values, 
while Unicode adds semantics to the code values.

UCS-2	16-bit Universal Character Set (no surrogate pairs)
UCS-4	31-bit Universal Character Set.
Character Property	Unicode code values have default properties such as case, 
numeric value, directionality and mirrored as defined in the Unicode Character 
Database.

Combining Class	A numeric value given to each combining Unicode character that 
determines with which other combining characters it typographically interacts.

Byte Order Mark (BOM)	Unicode code value U+FEFF may optionally be prepended in 
serialized forms (files, streams) of Unicode characters. By default, files are 
assumed to be in network byte ordering (big-endian). BOM is discussed at 
greater length in the document.