expat and encodings

Steve Kearon (stevek@fineline-software.co.uk)
Tue, 15 Dec 1998 10:07:56 -0000


Can someone clarify the issue of character encodings for me - I think this
is an expat issue, but it may be a more general thing.

I'm trying to save/load text that might contain accented characters (>127).
Running on Windows95. I realise that when writing XML, I either have to
convert such characters to "&#xxx;" form, or note that the file format
encoding is "iso-8859-1", otherwise the XML parser (expat)objects when
subsequently reading the file.

The snag is that whether the file has utf-8 or iso-8859-1 encoding, the text
the application receives from the parser seems to be always utf-8. I've
tried specifying "iso-8859-1" as the encoding to the XML_CreateParser()
call, but this seems to have no effect (I guess the parameter actually
overrides the default (rtf-8) file encoding, rather than specifying the
encoding the client would like to see).

The questions...
Is my understanding correct - does expat feed UTF-8 text to clients when
parsing?
Can expat be asked to feed clients iso-8859-1?
If the client must convert manually, are there any helper functions in
expat/xmltok?
If I use the unicode build of expat, does it feed utf-8, unicode or utf-16?

Many thanks,
Steve Kearon
FineLine Software