Re: Simple approaches to XML implementation

Gavin Nicol (gtn@ebt.com)
Thu, 6 Mar 1997 08:52:18 -0500

>>class XMLParser {
>>...
>>parser(XMLEventHandler handler);
>>...
>>}
>
>That's one way of doing things. The main problem I see with this interface
>is that there are quite a few possible methods (I count 71 classdefs in
>the SGML property set, though of course not all of those are applicable to
>XML), and it becomes difficult to expand the set of events.

I use about 8 event handlers for most of my API's...

>As much as possible, a good reusable component should not force the
>user's hand when choosing what node to grab onto. As an example,
>YACC is pretty bad about this. You supply it with a lexer (with a
>fixed name) and a set of handlers to be called when productions are
>reduced. The YACC-generated parser insists on being in charge.

Sure. The important thing with is that if you want to query into
a document, you have to have parsed at least as far as the nodes you
want to access, and that haveing a tree representation for such cases
makes it a *lot* easier. For cases where you "want to be in control",
I would have the event handler be a grove constructor, and have the
application work upon the grove. Note that accessing a grove, or
querying a document is *different* to *parsing* a document.

>1. An external entity manager, responsible for obtaining document
> instances (the "start" document and others), DTD's, etc. from
> local storage, the web, some database, etc. This should probably
> be user-customizable.

I'm not sure about this. In some ways, I cannot see the reason for
*exposing* an entity manager, but then again, I can imagine an
implementation without one either....

>2. An encoding manager, responsible for mapping one of the possible
> XML document encodings (Latin-n, UTF-7, UTF-8, UCS-2, UTF-16, whatever)
> onto ISO10646 characters.

Streams...

>3. The parser itself, responsible for turning characters into XML events,
> and possibly into grove structures.

Push grove building off to later stages.

>[Browser] gives the most complicated parser, since it has to asynchronously
>handle information from several different documents.
>
>[YACC] is the easiest to write, but it's less flexible. Given [Browser],
>it's easy to write [YACC]. (Given [XMLEventStream] you can also derive
>[YACC], but with greater overhead.)
>
>[XMLEventStream] and [Grove] give you the most flexibility with respect to
>the grove plan.

I think these confluge many different processing layers.

>languages, but the only firm conclusion I've come to is that I really wish
>I could use coroutines.

Amen to that sentiment.