Re: Mix encodings in a document?

John Cowan (cowan@locke.ccil.org)
Wed, 23 Sep 1998 12:27:14 -0400


Jerome McDonough wrote:

> Under Unicode version 2.0,
> what I should've said is:
>
> Unicode == ISO-10646-UCS-2 != UTF-16
>
> as Unicode and 10646 in UCS-2 format should be identical, but UTF-16
> differs from both of these in it allows the use of code surrogate
> pairs to enable encoding the BMP and next 16 planes of UCS-4. From
> what I can see at Unicode's home page, it now looks like Unicode is
> dropping UCS-2 character encoding and now only endorses UTF-8 and
> UTF-16, so that the situation now is:
>
> Unicode != ISO-10646-UCS-2
>
> and Unicode sometimes does/sometimes does not equal UTF-16. Is that
> more or less the case at the moment?

"Unicode 2.0" and "Unicode 2.1" always mean UTF-16. UCS-2 proper
(that is, the encoding that does not allow references to what
10646 calls Planes 1 to 10) has never been Unicode since the
distinction between UCS-2 and UTF-16 was invented. Before that,
there was only UCS-2 and Unicode = UCS-2.

So Unicode = UTF-16 != UCS-2, but the distinction is usually
trivial: UCS-2 per se does not define any meaning for surrogate
characters.

-- 
John Cowan	http://www.ccil.org/~cowan		cowan@ccil.org
	You tollerday donsk?  N.  You tolkatiff scowegian?  Nn.
	You spigotty anglease?  Nnn.  You phonio saxo?  Nnnn.
		Clear all so!  'Tis a Jute.... (Finnegans Wake 16.5)