Re: SAX and whitespace (was Re: Problems with whitespace and

Peter Murray-Rust (peter@ursus.demon.co.uk)
Thu, 01 Jan 1998 18:07:39


[I think this discussion is another good reason why SAX is urgently needed]

At 09:57 01/01/98 -0500, David Megginson wrote:
> > > An XML processor must always pass all characters in a document
> > > that are not markup through to the application. A validating
> > > XML processor must distinguish white space in element content
> > > from other non-markup
>
>What the PR means to say here is that a DTD-driven XML parser has to
>treat whitespace in element content differently than whitespace in
>mixed content -- this, of course, has nothing to do with xml:space.
>If there is no DTD, then all element types are assumed to allow mixed
>content, so a DTD-driven XML parser ("validating XML processor") would
>report all whitespace as significant.

I would agree with this interpretation and prefer the phrase "DTD-driven
XML parser (?processor?)". I interpret this to mean:=20
"a processor which uses any DTD information given in the document, and
which uses it to do as much validation as it and the document are capable=
of."

However, having read the spec more carefully, I am having great difficulty
in deciding *where* it allows whitespace in element content. Take the
document:
<!ELEMENT FOO (BAR)>
<!ELEMENT BAR EMPTY>
...
<FOO>
<BAR>
</BAR>
</FOO>

My reading of the spec suggests that this is an *invalid* document. Please
show me where I have gone wrong...

FOO has declared element content [3.2.1]. "... elements of that type must
contain only child elements ***(no character data)*** [my asterisks]..."

for BAR:
[3.2] An element is valid if there is a declaration matching elementdecl
where the Name matches the element type and ...
1. the declaration matches EMPTY and the element has ***no content***

the context of content is [39]
STag content ETag <!-- no S? --->
and its definition is: [43]
(element | CharData | Reference | CDSect | PI | Comment)*

Again there is no place for whitespace.

Therefore I cannot see where (apart from [2.10] which raises the whitespace
question) whitespace is can be defined as 'non-significant'. IOW whitespace
***in the content of an element*** is only formally allowed as CharData in
mixed content, and in mixed content it must be significant.

I am *sure* I've missed something here as the WG has debated this for ages,
but I can't see where.
>
>What should SAX do with ignorable whitespace?

Assuming that ignorable WS is found only in element content...

>
>1) Report it as a distinct event, like =C6lfred does?
>2) Treat it as regular character data?
>3) Ignore it (as in regular SGML)?
>
>(1) seems to be what the PR requires. Either (2) or (3) could cause
>strange results.

(3) is forbidden - it has to be passed through. I think it has to be (2)
and (1) simultaneously. IOW in an event mode you must report whitespace
(space, 3 tabs, one newline, 10 spaces) occurs "now"; in tree mode you
report "I have made you an element/node consisting of PCDATA, all
whitespace - it's up to you to keep/destroy it..."

P.

Peter Murray-Rust, Director Virtual School of Molecular Sciences, domestic
net connection
VSMS http://www.nottingham.ac.uk/vsms, Virtual Hyperglossary
http://www.venus.co.uk/vhg