There is no such thing as a "UTF-*" character. There are Unicode
Characters, and Unicode Strings, and there are UTF-encoded string (UTF
means Unicode Transformation Format).
All characters in Squeak use Unicode now. For example, the cyrillic ?
char := Character value: 16r0411.
this can be made into a String:
wideString := String with: char.
which of course has the same Unicode code points:
wideString asArray collect: [:each | each hex]
The string can be encoded as UTF-8:
utf8String := wideString squeakToUtf8.
and to see the values there
utf8String asArray collect: [:each | each hex]
which is the UTF-8 representation of the character we began with (but
if you try to pront utf8String directly you get nonsense, because
Squeak does not know it is UTF-8 encoded).
The decoding of UTF-8 to a String is similar:
#(16rC3 16rBC) asByteArray asString utf8ToSqueak
which returns the String 'ü' and probably is what you wanted in the
first place - but please try to understand and use the Unicode terms
correctly to minimize confusion.
Anyway, to convert between a String in UTF-8 and a regular Squeak
String, it's simplest to use utf8ToSqueak and squeakToUtf8.
- Bert -
Convert from "Squeak" to UTF-8 aString convertToEncoding: 'utf-8'
Convert from UTF-8 to "Squeak" aString converFromEncoding: 'utf-8'
For checking out all the encodings your image supports: TextConverter
> If you look at UTF8TextConverter it will give every
> incoming character with an index higher than 255 the language of the
> image. I don't need to explain why this is problematic in the context
> of a web application, do I?
Actually, it *is* worthwhile to explain this. The problem is that
since UTF-8 doesn't have the notion of a leading char there is no way
to tag incoming data correctly. The leading char will be taken from
the running image, so an image running in the US (like our servers)
will tag input coming from Chinese browsers as Latin1. In these
situations the leading char isn't just useless, it is actively