So your proposal is:
(1) transcode into UTF-16 if necessary
(2) digitally sign what you get after (1).
I think this is a sensible way to go. Obviously, there are
anomalies;
<a foo='1' bar="2"/>
will not be the same as
<a
foo="1"
bar='2'
></a>
which is surprising, but trying to find solutions may well not be
cost-effective.
You *might* want to consider losing the prologue and start checking
just at the root element.
You *might* want to consider normalizing namespace prefixes.
You *might* want to normalize whitespace in markup.
You *might*, etc etc etc etc; unless you are willing to commit to
a full grove/propert-set model a la SGML's extended facilities, you
may well be better off signing the instance as it sits.
In particular, I think there are lots of things that would be easier
and less trouble-prone to work around than line-breaking, which is well
known to be highly error-prone. For example, in the line-break HERE->
how many space characters that you can't see follow the ">"?
There might be a useful halfway point as follows; run it through an
XML processor and sign just the combination of element type, attribute
name-value pairs, and textual content that the processor emits; this
allows you to finesse a lot of quoting/white-space/line-end issues;
also it allows authors to use tricks like default attributes and
internal entities that don't "really" change the content.
On the other hand, I'd say that off the top, just digitally signing the
UTF-i-fied characters as they sit is a reasonable way to go. -Tim