Glyph A glyph is a particular image which represents a character or part of a
character.
Coded Character Set A mapping from a set of abstract characters to the set of
non-negative integers. This range of integers need not be contiguous.
Locale A specific language, geographic location and character set (and
sometimes script). An example is 'fr_FR.ISO-8859-1', although locale strings
are seldom standard on across platforms at present. Often only language_country
is specified, or even just language. In a client-server environment (like the
web), 3 locales are usually considered - server, data, and client locales.
Internationalization (i18N) The technical aspects (character sets, date
formats, sorting, number formatting, string resources) of supporting multiple
locales in 1 product.
Localization (L10N) The practical aspects (language, custom, fashion, color,
etc. ) of expressing an application in a particular locale. Roughly, i18N is
considered an engineering process while L10N is considered a translation
process.
Globalization (g11n) The cultural aspects of supporting multiple locales in a
non-offensive and universally intuitive manner. Think Olympic or airport
signage.
Canonicalization (c14n) The process of standardizing text according to
well-defined rules.
Japanization The conversion of a product into the Japanese language and
character set. There are 4 character sets used in Japanese computer text:
hiragana, katakana, kanji and romaji (English alphabet.) Sometimes kanji
characters have ruby aka furigana ("attached character") annotation above them
to aid in irregular or difficult readings of personal or geographics names and
for school children. Mojibake means "scrambled character" and is used to
describe the unreadable appearance of electronic displays when the wrong
character decoding is used. Because kanji may have multiple readings (meanings)
depending on context, machine conversion to hiragana is unreliable. The kinsoku
rule says that Japanese sentences are separated by periods that may not wrap to
the beginning of a line.
CJKV Chinese, Japanese, Korean and Vietnamese are often considered together
because they all use multi-byte encodings.
Unicode Unified Code for characters. Current version is 3.0. Most modern
character sets have been incorporated already, and also many ancient ones (I
have been informed that Indonesian Jawi is represented by Arabic and Extended
Arabic codepoints. I need to doublecheck that since there is also a Indic
script used in Java.) Unicode is a complex character set that is unlike ASCII
in many ways, some of them being: A glyph may be composed from multiple
codepoints in more than 1 ordering; national character sets may not consist of
contiguous codepoints; symbols such as bullets, smiley faces and braille are
included; binary sorting of Unicode character sequences is likely to be
meaningless unless the sequences are normalized first.
UTF-32 32-bit Unicode 3.0 Transformation Format
UTF-16 16-bit Unicode 3.0 Transformation Format is a 16-bit encoding, with
16-bit surrogates for private characters and future use (Chinese characters,
ancient languages, special symbols, etc.)
UTF-8 8-bit Unicode 3.0 Transformation Format is a variable width encoding form
updated from UTF-2, often used with older C libraries and to save space with
European text. UTF-8 may be 1 to 6 octets long, although usually 4 octets
without surrogates. It is defined in RFC 2279
UTF-2 8-bit Unicode 1.1 Transformation Format is a variable width encoding form
that was superseded by UTF-8. UTF-2 was used in Oracle 7.3.4
UTF-7 7-bit Unicode 2.0 Transformation Format is a variable width encoding form
(used with older email that was not 8-bit clean and not MIME.) See RFC 2152
Unicode compliant Character encoding implementation that conforms to a
particular version of the Unicode spec, for certain features. May only
implement a subset if so documented. (For example, a Unicode compliant app
might only support certain languages (typically Western European), or even
allow only US-ASCII!)
Code value (codepoint) Unicode value for a character that is all or part of a
glyph. The same codepoint may represent multiple glyphs, especially in Han
unification (Chinese, Japanese, Korean.) Accents alone may have their own
codepoint.
Pre-composed character A Unicode character consisting of one code value. Some
accented characters, notably Western European, have their own codepoints.
Base character A Unicode character that does not graphically combine with
preceding characters, and is not a control or format character.
Combining character A Unicode character that graphically combines with a
preceding base character. Typically accents and diacritical marks.
Composed Character A Unicode character made of combined codepoints, usually
non-spacing mark (accent) characters. Often the same accented glyph may consist
of codepoints in different orderings, for example a character with accents
above and below the character (like Thai.)
Combining character A character that normally appears after a base character,
and is an accent or other diacritical mark that is added to the previous base
character.
Compatibility character A character included in the Unicode standard that has
been included for compatibility with a legacy encoding. Usually it looks
similar enough to another non-compatibility character to be replaced with it
when appropriate. An example is the set of Japanese half-width hiragana code
values, which were included for round-trip compatibility with other character
set encodings for use in smaller character cells, even though a Unicode
application could achieve the same appearance with application-defined font
rendering.
Normalization There are four functions performed on Unicode character sequences
so that two sequences may be compared in a meaningful way. Normalization is
necessary because decomposed characters may have accents in different orders
before normalization, but be the same glyph. Normalization is especially
important to perform when computer language identifiers, filenames, mail folder
names, digital signatures and emitting XML or JavaScript are involved.
Collation Order Table and/or algorithm for sorting strings specific to a locale
and usage (dictionary, phonebook, etc.) Unicode has not specified collation
order at this time, but should in 3.0.
UCS ISO/IEC 10646 Universal Multiple-Octet Coded Character Set. Both UCS and
Uncode standards now share identical code values. The major difference between
UCS and Unicode is that UCS is mostly concerned with defining code values,
while Unicode adds semantics to the code values.
UCS-2 16-bit Universal Character Set (no surrogate pairs)
UCS-4 31-bit Universal Character Set.
Character Property Unicode code values have default properties such as case,
numeric value, directionality and mirrored as defined in the Unicode Character
Database.
Combining Class A numeric value given to each combining Unicode character that
determines with which other combining characters it typographically interacts.
Byte Order Mark (BOM) Unicode code value U+FEFF may optionally be prepended in
serialized forms (files, streams) of Unicode characters. By default, files are
assumed to be in network byte ordering (big-endian). BOM is discussed at
greater length in the document.