Re: XML vs the Dreaded Whitespace

Peter Murray-Rust (peter@ursus.demon.co.uk)
Thu, 11 Dec 1997 14:37:39


At 06:41 11/12/97 -0500, David Megginson wrote:
>Peter Murray-Rust writes:
>
> > As a corollary: Is anyone testing the ESIS output of the current crop of
> > XML parsers (4 Java + nsgmls, I think)? Regardless of the whitespace=
model
> > or the value of xml:space they should all produce identical ESIS=
(right?)
> > If not, then one or more is wrong. And all applications should (IMO) be
> > prepared to work with ESIS which I think is isomorphous with a WF XML
> > document.
>
>There are quite a few more XML parsers out there, including at least
>one in TCL -- see=20
>
> http://www.sil.org/sgml/XML.html#xmlSoftware

Apologies to anyone I missed. I am a great fan of tcl and wrote costwish in
it to sit on top of Joe English's CoST...

>
>As for ESIS, there are some problems that we'd have to overcome first:

Are there? How does a WF document differ from the corresponding ESIS
stream? IOW if I do the transformation:
WF -> ESIS -> WF shouldn't I be able to recover the original?

>
>1) How should empty elements be represented? Right now, =C6lfred generates=
a
> startElement event immediately followed by an endElement event.

Yes - and JUMBO is happy with that. As far as JUMBO os concerned
<FOO></FOO> and <FOO/> are processed in the same way and I will need a very
clear argument to convince me that it should do different.

>
>2) How should the XML declaration be represented? Should it appear as
> a processing instruction, or should it be ignored?

JUMBO regards it as a PI. I hang all PIs off the preceding ELEMENT (not
PCDATA). In that way the tree can be processed with these intact. JUMBO
understands namespace PIs, <?JUMBO ...?> PIs and will also store the
others. It's useful to store them in case one wants to compare trees. BTW -
although it is nowhere stated most people seem to create PIs as name-value
pairs and JUMBO expects this.

>
>3) How should space in element content be handled? According to the
> spec, a DTD-aware parser should handle whitespace in element
> content differently from whitespace in mixed content (=C6lfred just
> ignores whitespace in element content right now).

This is a critical area for the parser writers to agree on. I assume that
for the DTD-aware stuff there has to be a validating parser (i.e. one that
matches contentspec against element content). I am not sure what algorithms
are being used - JUMBO wants a java one for its birthday, please - but I
can imagine that with certain contentspecs they might get different answers.

>
>4) DTD-aware and non-DTD-aware parsers will handle whitespace in
> attribute values differently. Non-DTD-aware parsers will treat all
> attributes as CDATA, but DTD-aware parsers will treat tokenised
> attributes specially, by stripping all leading an trailing
> whitespace, and normalising internal whitespace to single spaces.

In this case presumably only the TYPE in the ATTLIST is needed.

P.

Peter Murray-Rust, Director Virtual School of Molecular Sciences, domestic
net connection
VSMS http://www.nottingham.ac.uk/vsms, Virtual Hyperglossary
http://www.venus.co.uk/vhg