PROPSAL: XSchema declarations and constraints

Paul Prescod (papresco@technologist.com)
Wed, 10 Jun 1998 10:50:25 -0400


John Cowan wrote:
>
> I was attempting to adopt the distinction between "what IS so" and
> "what MUST BE so" (which DTDs conflate) and see where it led.

This is an excellent way of phrasing the distinction. But the clarity of
it leads to a troubling question, but also a big opportunity. Metadata is
an important part of what we want XSchema to support. But metadata is
inherently about "what IS so" versus "what MUST BE so". The big
opportunity is the idea going "click" in my head.

--

In the XML SIG, I have promoted the idea that "what IS so" should be cleanly separated conceptually from "what MUST BE so" in the namespaces specification. The "what IS so" document could be called a "vocabulary" or "dictionary" or "namespace" or "directory." Vocabulary documents would have textual descriptions of element types and attribute types, knowledge-representation information, and so forth: metadata. It could also contain or link to a "default" stylesheet or "default" schema. SGML always had dictionaries, but they never existed in standardized formats. TEI's dictionary was in the TEI DTD ("the TEI Guidelines"), DocBook's was in DocBook, HTML's is in HTML ("the HTML 4.0 spec").

A schema is about "what MUST BE so". It defines a class of languages. Of course schemas will (almost?) always be tied to vocabularies, just as stylesheets will be tied to vocabularies. That doesn't mean that they are the same thing, but merely that they are often related. It is undeniably convenient to be able to specify metadata and schema information in the same document.

The distinction is philisophical, technical and also practical. I've just described the philisophical distinction. Now on to the technical and practical ones:

#1. We don't want to duplicate document-type metadata in multiple schemas for closely related schemas. HTML 1.0, 2.0, 3.2 and 4.0 all have more or less the same semantics for the A element. It exists, it stands for "anchor", it is a type of hypertext link, it marks both sides of the links, etc. etc. etc. But they all have different content models and attributes for A. If schemas are separate from metadata, then we don't have to repeat the metadata in each document. Repetition gets us back to the issue of redundancy/conflict. What happens if two versions of HTML accidently describe a different semantic for the A element? That strikes me as a problem.

#2. Elements can have multiple constraints applied to them. The constraints are just each tested one after the other. But an element can only have a single definition. That's why I originally suggested an XSL-like syntax for describing constraints:

<RULE> <PATTERN><TARGET-ELEMENT TYPE="FOO"></PATTERN> <CONSTRAINT>...</CONSTRAINT> </RULE>

<RULE> <PATTERN><BAR><TARGET-ELEMENT TYPE="FOO"></BAR></PATTERN> <CONSTRAINT>...</CONSTRAINT> </RULE>

but now we have ended up with an "element-definition"-like syntax:

<ELEMENTYPE> ... </ELEMENTTYPE>

The first is from a "this must be true and maybe there are other things around that must be true" mentality. The second is from a "this and only this is true" mentality.

I'm not proposing the first TODAY, but I am asking that we leave ourselves open to full patterns and multiple constraint rules per element type, just as there can be full patterns and multiple stylesheet rules per element type in XSL. The only difference would be in XSL a single matching rule is chosen. In our schema language ALL would be tested. When you combine this with schema inclusion (in some future version), you also get the ability to "tighten up" constraints from an imported DTD.

Note that the way I describe it above is inherently extensible in two ways. Patterns can become more advanced. Constraints can also become more advanced.

#3. It would be nice to have a clear way of distinguising metadata about element types from metadata about constraints. For instance we might want to describe an element's semantics separate from where we describe why it has a particular content model in the schema, or allows or doesn't allow a particular attribute.

---

Now I realize that we often want to put schema and vocabulary information in the same document. Simple uses of XML should not be penalized for the complexity of complex applications of it. In fact, simple uses of XML will only have a single stylesheet: maybe those stylesheet rules should be able to go "cheek to cheek" with schema and type declaration information.

What we need is to maintain the distinction between these different kinds of information: type information, constraint information, style information etc., but not require a separate file per information type. We probably also want to allow the declarations to go right beside each other, for maintenance simplicity. In other words, we do not want to *conflate* the types of declarations, but we do want to allow them in the same document.

I propose that the namespaces facility is the perfect solution to this problem. We can break the XSchema specification into three parts (or perhaps three sections of the same specification).

XDocTypeInfo -- a "bag" DTD that contains a list of declarations. XSchema -- the definition for the XSC:Rule, XSC:ContentModel, etc. XVocab -- XVC:ElementType, XVC:Attribute declarations

Here's what I think that this would look like:

<XDTI:XDocTypeInfo>

<XVocab:ElementType Name="A">This is the anchor element.</XVocab>

<XSC:Rule> <TargetElement Name="A"/> <ContentModel><Mixed>...</ContentModel> </XSC:Rule>

<XSC:Rule> <A><TargetAttribute Name="HREF"></A> <CDATA> </XSC:Rule>

<XSL:Rule> <TargetElement Name="A"/> <sequence color="blue"> <children/> </sequence> </XSL:Rule>

<XLink:Rule> <TargetElement Name="A"/>

<XLink:Role Role="Simple"/> </XLink:Rule>

</XDTI:XDocTypeInfo>

A short form for a set of declarations all of the same element might be:

<XDTI:ElementInfo>

</XDTI:ElementInfo>

In this case, you could leave out the patterns in the various rules, because the target would be implied by the context. We can't force the XSL and XLink people to go along with this plan, but we can certainly use it to differentiate our own schema information from our element type information.

I won't have much time in the near future to defend this proposal, so I hope that it stands on its own merits.

No, I still don't think that it necessarily means we want to go back into entity provision territory. At least not text entities. There is still no legal communication mechanism between the parser and the XSchema processor. XML would have to be updated to allow the application to supply entity replacement texts "automatically." And verifying that entities have a particular value still doesn't seem very interesting.

Paul Prescod - http://itrc.uwaterloo.ca/~papresco

Three things are most perilous: Connectors that corrode Unproven algorithms, and self-modifying code http://www.geezjan.org/humor/computers/threes.html