Re: SAX: New Idea for Entity Resolution

David Megginson (ak117@freenet.carleton.ca)
Fri, 17 Apr 1998 07:43:55 -0400

Messages sorted by: [ date ][ thread ][ subject ][ author ]
Previous message: David Megginson: "SAX: Character Stream vs. Byte Stream proposal..."

James Clark writes:

[my example omitted]

> This is fine except that it should use byte streams not character
> streams. What you get if you are reading from the net or from an
> archive or a database or whatever is bytes not characters and it is
> part of the function of an XML processor to manage the conversion
> into bytes using the encoding declaration and the XML specified
> mechanisms for encoding auto-detection. You could provide both,
> but the fundamental one is a for a stream of bytes. Also the
> EntityResolver needs to be able to indicate an externally specified
> encoding (as with the additional argument for parse with a
> SAXByteStream). In other words SAXEntityResolver needs to return
> an object with two members: a SAXByteStream and a (possibly null)
> String.

I hope that people will at least admire my wisdom if I admit that I am
not smart enough to figure this one out myself. I suspect that this
will be the Last Great Issue with SAX before we can finalise it, so
help will be appreciated.

Here are what seem to me to be the costs and benefits of supporting
character streams, byte streams, or both:

* Character streams only

Pro: - the application writer has specialised knowledge about the
information source that the parser writer lacks; as a
result, the application writer can better optimise the
conversion, if necessary
- information from dialogue boxes, internal buffers, and
(eventually, with internationalisation) databases will all be
characters rather than bytes
- most programming languages are moving towards characters and
away from processing raw bytes
- many programming languages (such as Java) already have
standard methods for converting byte streams to character
streams, and application writers can use these if needed or
desired

Con: - the application may have to convert from bytes to characters
itself if an input source is not available
- the parser may have its own, internal, efficient mechanism
for byte-stream conversion

* Byte streams only

Pro: - supports the minimum common denominator: all platforms have
some concept of a byte stream
- allows parsers to use their own, efficient, internal methods
for byte-stream conversion

Con: - adds serious inefficiencies, since characters (say, from a
dialog box, an internal buffer, or a database with I18N
support) will have to be decomposed back into bytes to be
passed to the parser, then reassembled back into characters
by the parser
- requires a new SAX class encapsulating a ByteStream and its
recommended encoding

* Both Byte and Character streams

Pro: - keeps everyone happy

Con: - requires more interfaces
- requires another method in the Parser interface
- requires a new SAX class encapsulating a ByteStream and its
recommended encoding (or perhaps the ByteStream interface
will have a getEncoding() method)
- will greatly complicate the EntityResolver mechanism (the
application will need to be able to return a byte stream _or_
a character stream -- how could I handle this?)

Thanks, and all the best,

David

-- David Megginson ak117@freenet.carleton.ca Microstar Software Ltd. dmeggins@microstar.com http://home.sprynet.com/sprynet/dmeggins/

xml-dev: A list for W3C XML Developers. To post, mailto:xml-dev@ic.ac.uk Archived as: http://www.lists.ic.ac.uk/hypermail/xml-dev/ To (un)subscribe, mailto:majordomo@ic.ac.uk the following message; (un)subscribe xml-dev To subscribe to the digests, mailto:majordomo@ic.ac.uk the following message; subscribe xml-dev-digest List coordinator, Henry Rzepa (mailto:rzepa@ic.ac.uk)

Previous message: David Megginson: "SAX: Character Stream vs. Byte Stream proposal..."