Squeak SmalltalkJoker Squeak Smalltalk : Language : prevnext Unicode Utf8Byte String Wide String

There is no such thing as a "UTF-*" character. There are Unicode 
Characters, and Unicode Strings, and there are UTF-encoded string (UTF 
means Unicode Transformation Format).

All characters in Squeak use Unicode now. For example, the cyrillic ? 
is

    char := Character value: 16r0411.

this can be made into a String:

    wideString := String with: char.

which of course has the same Unicode code points:

    wideString asArray collect: [:each | each hex]

gives

     #('16r411')

The string can be encoded as UTF-8:

    utf8String := wideString squeakToUtf8.

and to see the values there

    utf8String asArray collect: [:each | each hex]

yields

     #('16rD0' '16r91')

which is the UTF-8 representation of the character we began with (but 
if you try to pront utf8String directly you get nonsense, because 
Squeak does not know it is UTF-8 encoded).

The decoding of UTF-8 to a String is similar:

    #(16rC3 16rBC) asByteArray asString utf8ToSqueak

which returns the String '' and probably is what you wanted in the 
first place - but please try to understand and use the Unicode terms 
correctly to minimize confusion.

Anyway, to convert between a String in UTF-8 and a regular Squeak 
String, it's simplest to use utf8ToSqueak and squeakToUtf8.

- Bert -

-----

Convert from "Squeak" to UTF-8 aString convertToEncoding: 'utf-8'

Convert from UTF-8 to "Squeak" aString converFromEncoding: 'utf-8'

For checking out all the encodings your image supports: TextConverter 
allEncodingNames

Cheers Philippe

> If you look at UTF8TextConverter it will give every
> incoming character with an index higher than 255 the language of the
> image. I don't need to explain why this is problematic in the context
> of a web application, do I?

Actually, it *is* worthwhile to explain this. The problem is that 
since UTF-8 doesn't have the notion of a leading char there is no way 
to tag incoming data correctly. The leading char will be taken from 
the running image, so an image running in the US (like our servers) 
will tag input coming from Chinese browsers as Latin1. In these 
situations the leading char isn't just useless, it is actively 
misleading.