RE: Weak DTDs

akirkpatrick@ims-global.com
Fri, 17 Oct 1997 11:04:25 +0000

Messages sorted by: [ date ][ thread ][ subject ][ author ]
Previous message: Rick Jelliffe: "Re: Weak DTDs"

The strength of the DTD is in giving a limited set of possibilities for
a processing engine to work with=2E There are obviously other ways
to do this (see below) but for a lot of applications, the DTD provides
sufficient constraints for authors of the information=2E A common
example is a title element=2E Often a title is required to provide feedback
in a UI, to act as link text in a hypertext link, etc=2E If your DTD says:

<!ELEMENT anything (title, anything=2Eelse+)>

then you know for a fact that you can pick out the title, given a
valid document=2E Also, the parser will tell you if the document is valid
or not and you can then decide whether to attempt processing it=2E
In our application, the RTF processing engine will still attempt to
process a document but says "hey, you might not get what you
expect"=2E In other situations, an application just says "go away
and come back with something valid"=2E

It sounds like in your situation, you aren't worried about the vast
majority of elements but just want to pick up on key things like
<atom>, <bond>, etc=2E The "Eliot" way to do this would be with an
architecture DTD which defines attributes to identify important
elements=2E Your derived DTD can then use any content model (or
even element names) you want=2E

For example:

<!element atom - - (bond+)>
<!attlist atom
CMLNAME NAME #FIXED atom>

<!element bond - - EMPTY>
<!attlist bond
CMLNAME NAME #FIXED bond>

Your derived DTD might then go something like:

<!ELEMENT myatom - - (title, mybond+, otherstuff)>
<!ATTLIST myatom
CMLNAME NAME #FIXED atom>
<!ELEMENT mybond - - (title, description)>
<!ATTLIST mybond
CMLNAME NAME #FIXED bond>

(I'm still new to AFs, but this is the basic idea)

Now your processing engine can identify items by their fixed
attributes and process according, ignoring all other elements=2E
Other people can happily derive from your architecture DTD to
add their application specific elements=2E

If you are using XML without a DTD, things are exactly the
same except that you need to explicitly set the attribute on
the relevant elements (as I understand it)=2E It should be trivial
to write a normaliser which would generate XML from an SGML
instance (SGMLNORM would probably do it)=2E

I think one of the major problems with the Web today is the
plethora of badly formed HTML pages which have been allowed
to grow and florish by browsers which don't check for validity
in any way at all=2E There is a danger that lack of DTDs in XML
documents will lead to even greater "tag soup"=2E

----------
From: peter@ursus=2Edemon=2Eco=2Euk
Sent: 17 October 1997 08:21
To: xml-dev@ic=2Eac=2Euk
Subject: Weak DTDs

--------------------------------------------------------------------------=
=20
--
I am in the throes of revising CML (Chemical Markup Language - an =20
XML-based
application) and trying to work out what the value of conventional DTDs
are=2E The previous version has a traditional SGML-like DTD - lots of
parameter entities and other clever stuff=2E I am finding this too
restrictive for several reasons, mainly because:
(a) XML-* is moving so rapidly (e=2Eg=2E LINK, STYLE, etc=2E) This is a Go=
od
Thing, but CML has to react to it=2E
(b) RDF, DC, MathML etc will be involved in CML and I can't say exactly
how at present=2E
(c) My ideas on CML itself keep changing as I gain experience of new
problems=2E

I'd like *constructive* views on the value of DTDs in XML=2E [I know that =20=
=20
the
community has strongly held ones, so please avoid too much passion :-)=2E
There was a very interesting discussion a few weeks back on the =20
aesthetics
of DTDs - a good DTD is a thing of beauty=2E] I can see the following =20
reasons
for DTDs=2E
(a) the author has to conform to a pre-defined spectrum of ideas (e=2Eg=2E=
a
tax-return)=2E [This is not required for CML, and any conformance is =20
outside
what a DTD can deliver - e=2Eg=2E value verification=2E]
(b) the document may get corrupted in transmission or elsewhere=2E I =20
suspect
this is not a very important reason these days=2E
(c) it *may* make it easier to develop authoring tools
(d) it *may* give guidance to implementers of applications=2E
(e) it should (but doesn't always) act as an incentive to develop
human-readable documentation of the semantics=2E
(f) it shows that the author has defined the language at some point in =20
time=2E

I'd be grateful for other reasons for CML I expect that (c-e) have some
limited value=2E (f) may impress some people and horrify others=2E

In creating CML documents I find myself:
(a) wanting to introduce foreign names (e=2Eg=2E <DC:author>, or =20
<MathML:EQN>)
These could reasonably come at many places in the document
(b) forgetting my own 'rules', e=2Eg=2E order of elements within a content
model=2E So I can't expect others to follow them :-)
(c) adding new components to content models - for good reasons=2E There is
no reason why an <MOLECULE> cannot contain a <FIGURE>, but I didn't think
of that earlier=2E I don't want to have to think of all combinations and =20
ask
'is that reasonable?'=2E
=20
However the power of structured documents means that I can often use very
fuzzily constructed documents=2E Thus:
'if a MOLECULE contains ATOMS and BONDS, the software can draw a =20
picture'
'if any parent contains a FIGURE, allow that to be displayed by the =20
reader'=2E
'if a VARiable has attribute BUILTIN=3DFOO, inform the software that it
could process this with special FOO-specific code'
and so on=2E

These are powerful conditions, but if we try to express them in DTDs,
validation will fail=2E What I'd like to have is a wildcard #ANY (this has
already been suggested) which can be used for content models something =20
like
the (currently illegal) XML:

<!ELEMENT MOL (#ANY,ATOMS,BONDS)*>

This says that MOL can contain anything, but that ATOMS and BONDS have a
special role=2E The authoring tool might present a menu with the items =20
ATOMS,
BONDS, Other=2E The software for MOL=2Ejava could contain routines to =20
identify
children:
for (int i =3D 0; i < this=2EgetChildCount(); i++) {
Node n =3D getNode(i);
if (n instanceof ATOMS) {
/* atom-specific stuff */;
natom++;
} else if (n instanceof BONDS) {
/* bond-specific stuff */;
nbond++;
}
}
if (natom > 0 && nbond > 0) {
displayMol();
}

Obviously this can't be written automatically, but the 'DTD' helps the =20
author=2E

In some cases there will be stricter rules such as:

<!ELEMENT VAR (PCDATA)>
<!ATTLIST VAR
BUILTIN CDATA #IMPLIED
TYPE (INTEGER,FLOAT,STRING) STRING =2E=2E=2E>

which clearly help both authoring tool authors and applications authors=2E

At present I would like to keep a simple DTD but most of the content =20
models
will be 'ANY' and most of the attribute values will be CDATA=2E It would be
nice to have attribute values which could take a list of values *and* =20
CDATA
:-) - like:
<!ATTLIST VAR TYPE (INTEGER,FLOAT,STRING,#ANY)>
which would inform the software that it should cater for three specific
values, but that the user can add FOO if they really want=2E

Any sympathisers out there :-)?

P=2E

Peter Murray-Rust, Director Virtual School of Molecular Sciences, =20
domestic
net connection
VSMS http://www=2Enottingham=2Eac=2Euk/vsms, Virtual Hyperglossary
http://www=2Evenus=2Eco=2Euk/vhg

xml-dev: A list for W3C XML Developers=2E To post, mailto:xml-dev@ic=2Eac=2E=
uk
Archived as: http://www=2Elists=2Eic=2Eac=2Euk/hypermail/xml-dev/
To (un)subscribe, mailto:majordomo@ic=2Eac=2Euk the following message;
(un)subscribe xml-dev
To subscribe to the digests, mailto:majordomo@ic=2Eac=2Euk the following =20
message;
subscribe xml-dev-digest
List coordinator, Henry Rzepa (mailto:rzepa@ic=2Eac=2Euk)

Previous message: Rick Jelliffe: "Re: Weak DTDs"