XML syntax (was Re: external subset syntax)

james anderson (mecom-gmbh@mixx.de)
Tue, 16 Dec 1997 20:31:30 +0100


greetings,

perhaps it's time for a new role to complement the mcsgs, namely the npw =
- or
niggeling parser writer - not rebelling, just niggeling.
i admit to that fault.

my problem is, whenever i come to a point in the proposed recommendation =
at
which a parser is required to report an error and "must not continue norm=
al
processing" even though the result which the stream would denote would be
sufficiently unambiguous if allowed, then i feel compelled to ask, "why d=
oes one
have to exclude this"?
which does not mean "in which production does the standard exclude or pre=
scribe
it", but rather why does the standard exclude or prescribe it. what is t=
he
useful purpose? particularly when excluding it makes the parser more comp=
lex and
the document encoding more exacting.

more than likely, when i've followed discussions of similar questions, th=
e
design goal #3 gets hoisted like a commandment: "XML shall be compatible =
with
SGML". as a npw i tend to adhere more to #'s 1,4, 6, and 9: it should be =
easy to
generate, easy to program, and easy to read. SGML processors are already =
pretty
complex, so an argument to increase the complexity of XML in strictly ord=
er to
keep SGML processors simpler is difficult to accept on logical terms. (i =
know
i'm being naive here, and i'm ignoring the past, but i would wager that t=
he
future is going to bear me out...)

the simplest thing would have been a document form which distinguished i=
nline
definitions, external references (ie XLL built-in), content, and (maybe) =
a
declaration (autorecognition of encoding being the criteria on the latter=
). it
is true, that that is all there, but the standard requires at least twice=
as
many syntactic forms as are necessary. so despite having read mr murray-=
rust's
note on background to the list itself (re: XML-DEV (was Re: YAXPAPI)) wh=
ich
gave me some sense of the effort which has gone into the proposed
recommendation, the distance between the simple form of the denoted data =
and the
complexity of the syntactic form often leads me to ask "why?"

one such example concerns the external subset, xml declaration, doctype
declaration, and text declaration. in particular, the productions
[24] XMLDecl ::=3D '<?xml' VersionInfo EncodingDecl? SDDecl? S? '?>'
[29] doctypedecl ::=3D '<!DOCTYPE' S Name (S ExternalID)? S? ('[' (mar=
kupdecl |
PEReference | S)* ']' S?)? '>'
[78] TextDecl ::=3D '<?xml' VersionInfo? EncodingDecl S? '?>'
[80] ExtPE ::=3D TextDecl? extSubset

i observe that, while one can well label the XMLDecl and TextDecl product=
ions
differently, lexically speaking they are not disjoint, and practically sp=
eaking
there is no difference between their situation and that concerning the pr=
esence
of a doctype form at a location analogous to that of the textdecl. yet on=
e is
"standard" and the other is "nonsense". not to a niggeling parser writer.=
from
the stream content, the permitted case (almost) appears (by analogy to t=
he
remarks below) as one xml document within another. the other thing which =
is
disconcerting is that the standard goes to great length to, on one hand=
,
specify that the presence of an xml document may be introduced by a form =
with
the (not)PI keyword 'xml' (all lower case only) but on the other hand eng=
enders
lexical ambiguity where it does not introduce a distinct keyword for the
distinctly different purpose and context of specifying the encoding of th=
e
external dtd subset. why?

Per-Ake Ling wrote:

> > From jjc@jclark.com Mon Dec 15 11:59:21 1997

...

> > It is a requirement that the external subset *not* begin with a docum=
ent
> > type declaration.
> >
> If it were permitted, it would mean that there is a doctype declaration
> within a doctype declaration, which is clearly nonsense. It is a common
> misunderstanding that DTD means "document type declaration" instead of
> "document type definition".
>
> Per-=C5ke
> --

(as an aside, i didn't - and still don't - see that as, in itself, a suff=
icient
explanation, since the case would comprise two instances of a "document t=
ype
declaration": one in the xml document and the other in the prolog of the
external portion of the "document type definition", which was referred to=
from
the first, but is not contained in the first, and which serves to constra=
in the
root element <em>if<em> so desired.)

another example is the MDC (']]>') exclusion in CharData which means that=
one
needs a state machine to scan character data. why?

another example is that of [24], in itself, where the npw believes his po=
int (in
a previous posting) was misunderstood, and can only repeat the question
<em>why</em> is a PI-close specified to be '?>' and not '>', which would=
be
easier, or ('?>' | '>'), which would be robuster and observes (wrt to 'XM=
L'
itself) that the standard, cf #6 with irony, engenders an encoding where =
of the
four obvious humanly legible encodings (that is, neglecting 'xMl' et.al.=
:
('<?XML' | '<?xml') ... ('?>' | '>')) only one is legitimized. why?
if the precision of an encoding depends so much on uniqueness, then why d=
oes one
start out with such a level of lexical complexity in the first place, on=
ly to
then exclude much of it as 'malformed'? all you need is <, >, ', & and =
/ (if
you allow element recursion) - and even the distinction between < and > i=
s more
for the eye than anything else.

Ingo Macherius wrote:

> ...
> > how about
> > <?XML version=3D"1.0" ?>
>
> This is wrong, too. "xml" must be lower-case.
>
> > i've yet to understand why, but isn't that the way it needs to be?
>
> Why ? Productions [24] and [25] in section 2.8 !
>
> [24]=A0 XMLDecl ::=3D '<?xml' VersionInfo EncodingDecl? SDDecl? S? '?>=
'
> [25]=A0 VersionInfo ::=3D S 'version' Eq
> ('"VersionNum"'|=A0"'VersionNum'")
>
> So the minimal correct PI is: <?xml version=3D"1.0"?>
>
> ++im
> --