Re: YAXPAPI (Yet Another XML Parser API)- an XDEV proposal

David Megginson (ak117@freenet.carleton.ca)
Sat, 13 Dec 1997 21:01:04 -0500


Tim Bray writes:

> > attribute(XmlParser, String, String, boolean)=20
>=20
> It seems completely wrong to have an attribute event separate from
> start-element events.

I have worried about this myself. My design goal with =C6lfred has bee=
n
to limit myself to two class files: one for the parser itself, and one
for the interface for the callbacks -- hence the separate event for
attributes. This decision has forced some pretty severely hacked-up
internal code accompanied by very careful documentation.

I could send a hashtable of attribute names and values with the
startElement() callback, and let users look up types (etc.) with my
query methods, but I would have to lose a bit on two counts:

1) Allocating a new hashtable for every start tag will slow down the
parser a fair bit.

2) I'd have no way to show which attributes were specified and which
were defaulted (see below).

> What's the boolean? I don't think the application author should
> to have to deal with anything but the name and value of attributes.

The boolean tells whether the attribute was specified or defaulted. I
include this to allow people to do useful XML-to-XML transformations.

> > data(XmlParser, String)=20
>=20
> I feel that the 2nd argument should not be a String. It is a recipe=

> for disastrous inefficiency if the processor has to cook up a=20
> java.lang.String object for every little chunk of text. =20

The overhead isn't that bad with =C6lfred because I coalesce my data
into the largest chunks possible before allocating the String. I
think that returning a char[] array would be confusing for users, and
would lead to many bugs in their code as they ignored our warnings not
to rely on the value in the char[] array outlasting the callback.

> Lark uses two
> arguments, a char[] array and a character count; the app can
> make a String if it needs to. If you find this awkward, create
> a new data type called Text so that if you need a String you
> can make it with lazy-evaluation in Text.toString(), but if you
> don't need it you don't build it.

Again, I'm reluctant to create new classes beyond XmlParser and
XmlProcessor.

> Also, it shouldn't be named "data" - it should be named
> characterData or charData or text or some such term that can
> be mapped directly to the spec.

Agreed. I will not change =C6lfred now, but I think that this is a goo=
d
idea.

> > resolveEntity(XmlParser, String, String, URL)=20
>=20
> I don't think entities have any place in the first cut of this=20
> interface. The processor exists to make these problems go away.

Normally, you should just return the URL argument; however, this
callback gives users a chance to do public-identifier resolution, URL
substitution, etc., and to return a different URL if desired. For
example, if we had a DTD at

http://www.microstar.com/XML/msldoc.dtd

and you had a local copy, you could substitute a local URL on your own
computer. Likewise, you could do a catalogue lookup on the public
identifier "-//microstar//DTD Microstar Sample Document//EN" and
choose a different system identifier than the default supplied in the
document.

That said, I agree that this probably doesn't belong in the common
event API.

> Generalities:=20
> Lark has a thing where if any callback returns 'true', the
> parser drops out of its loop... which is awfully useful and easy
> I think. Lark will also re-enter, but this need not be a requiremen=
t.

Awfully easy with a DFA-driven parser, but trickier with a
recursive-descent parser like =C6lfred. I'd probably have to throw an
exception, and could not allow any kind of re-entry.

> Also, for application programmers, especially dealing with smallish
> objects, a tree interface is very natural. I've written both
> event-stream and tree apps using Lark, and the trees are a lot
> easier to use for anything even moderately complex. So the API=20
> should have Element, Attribute, and Text classes.=20

Perhaps -- I may have to give in an allow =C6lfred to use more than one=

class file; or alternatively, these would be an optional extra, along
with the SAX-J layer.

> And it shouldn't (sorry Peter) be called YAXPAPI - how about SAX, Si=
mple
> API for XML? Maybe SAX-J for the Java bindings. -Tim

How about RUSTY?

All the best,

David

--=20
David Megginson ak117@freenet.carleton.ca
Microstar Software Ltd. dmeggins@microstar.com
http://home.sprynet.com/sprynet/dmeggins/