Re: How best to represent unrepresentable characters in NAME tokens?

James Clark (jjc@jclark.com)
Tue, 04 Nov 1997 10:46:40 +0700


Andrew Greene wrote:
>
> If you have a Unicode-friendly XML environment, then users can create
> elements whose GIs or attribute names contain "interesting"
> characters. (Yes? A NAME token can contain "BaseChars", which includes
> characters beyond ASCII and even beyond Latin-1.)
>
> So, if a user requests that the document instance be saved as an ASCII
> file, what is the best way for a Unicode-aware and standards-compliant
> application to represent these characters?

I would use numeric character references wherever XML allows them; if
there are non-ASCII characters in places where numeric character
references aren't allowed I would use UTF-8 and give a warning to the
user. The ASCII characters will still be there as ASCII, and the
non-ASCII characters won't get lost, although they will look a bit funny
in an 8-bit editor. An interesting case is when there are non-ASCII
characters in places where numeric character references are not
recognized but do not cause an error (eg PIs, comments); one could have
an application convention that recognizes numeric character references
in these cases.

> 2. Rename all the offending elements and attributes, and use PIs to
> ensure that when they're read back in we can patch things up.
> So, for example, the file could contain:
>
> <?GoodCitizen MangledGI Strae1="Stra&#x00DF;e"?>
> <Strae1>foo bar</Strae1>
>
> Advantages: It's fully compliant.

If I was going to do this sort of thing, I think I would use a variation
on URL % encoding. I would have a convention that underscore (say)
followed by 4 hex digits represented the Unicode character with that hex
code.

James