I do not believe that a person with the knowledge level you have described
is going to succeed at the task you have set for him or her.
Entities are going to kill them.
Whitespace in end-tags is going to toast them.
CDATA sections are going to confuse them.
Elements (and tags!) broken across lines are going to destroy them.
This person can only succeed if
a) the data is already normalized, probably due to a corporate standard
such as the one you mention.
b) they download a normalizer.
If I am wrong, it would be easy to prove me so. All someone has to do is
provide a regular expression that can (for instance) change all
occurrences of the GI "FOO" into "BAR" in any XML document corresponding
to a DTD of their choice (but which I can extend in the internal subset).
On the other hand, I can do this *trivially* in a regular expression on
data that has been normalized.
> SGML gives you the option of using empty end tags, and the
> historical fact is that most large users, given this option and a
> sufficient amount of experience with it, choose not to use it.
These "large users" have expensive SGML editors that they have paid
someone thousands of dollars to customize to perfection. Under those
conditions, I would legislate redundancy also -- not just fully expanded
end-tags, but probably redundant IDs in comments of end-tags, public
identifiers on all entity declarations, perhaps even unique identifiers on
all elements.
But XML is about a different world than that.
Paul Prescod - http://itrc.uwaterloo.ca/~papresco
"A writer is also a citizen, a political animal, whether he likes it or
not. But I do not accept that a writer has a greater obligation
to society than a musician or a mason or a teacher. Everyone has
a citizen's commitment." - Wole Soyinka, Africa's first Nobel Laureate