Re: Whitespace rules (v2)

Liam Quin (liamquin@interlog.com)
Sat, 16 Aug 1997 01:27:11 -0400 (EDT)

Messages sorted by: [ date ][ thread ][ subject ][ author ]
Previous message: Eric Baatz - Sun Microsystems Labs BOS: "Re: purpose CDATA sections"
In reply to: Andrew Greene: "Re: Whitespace rules (v2)"
Next in thread: Paul Grosso: "Re: Whitespace rules (v2)"

On Sun, 10 Aug 1997, Neil Bradley wrote:

> [...]
> RULE 2. All whitespace preceding the start-tag and following the end-tag
> of a 'block enclosing' element is discarded.
> ---
> Note: a non-validating applications must refer to a style sheet or
> configuration file to identify 'block enclosing' elements (perhaps by
> applying this rule to elements not specified as in-line elements).

No -- "blockness" is not at all the same as element content.
For example, you have to allow for a run-in heading, which starts out
looking like an HTML H3 (say) except that the rest of the paragraph
follow on on the same line. So it isn't a block in the paragraph sense.

> As a validating application cannot easily determine this rule from the
> content model (the first mixed content element in the hierarchy is
> block enclosing, as well as all outer layers), it may choose the same
> approach.

I think this is too complicated, as well as being not 100% right.
I don't think there's a single "right" solution. This is why it's
best to allow the parser to pass _all_ whitespace back to the application,
although it is certainly useful if a DTD-aware parser, even if it isn't
validating, distinguishes element content whitespace from PCDATA whitespace
in some way.

More than this is a bad idea, I think.

> Note: If PI's, comments or empty elements remain in the data stream,
> they are deemed transparent to this process, so:
> [SP]<p>Some text...
>
> becomes:
>
> <p>Some text...

Note that if you have a very large comment, you might need a lot of
lookahead here.

> RULE 3. A sequence of one or more line-end codes immediately
> following a start-tag, or immediately preceding an end-tag, are
> discarded (except in preserved content).

This means that
<Paragraph>This is<Emphasis>
very
</Emphasis>strange.</Paragraph>

becomes
<Paragraph>This is<Emphasis>very</Emphasis>strange.</Paragraph>

or, if you format withut distinguishing emphasis,
<Paragraph>This isverystrange.</Paragraph>

which I don't think is what you want.

But SGML itself is broken in this regard.

> RULE 4. A remaining line-end code is converted into a space, except when it is
> preceded by a normal (hard) hyphen, or by a soft hyphen ('°'),
> in which case it is removed (a soft hyphen is also then removed).
> ---
> Note:
>
> A[CR]
> line-[CR]
> end code sep°[CR]
> erates lines.
>
> becomes:
>
> A line-end code seperates lines.

Well, note that there is no hyphen in that paragraph!!
The character "-" in ISO 8859-1 (Latin 1) and ASCII is _not_ a hyphen.
It is a minus sign.

The hyphen is 0255 octal (173 decimal). It is a hyphen, not a soft hyphen.
There is no soft hyphen in Latin 1.

I don't have the necessary copy of Unicode in front of me, but last time
I checked (Unicode 1.1) it was the same in this regard, and also in having
the ` character be a spacing grave accent, not a single quote.

This should be done by applications. I wouldn't want your mesage:
----------
RULE 5. Consecutive whitespace characters (including translated
turrning into
----------RULE 5. Consecutive whitespace characters (including translated
for example.

> Note: Multiple spaces can be preserved using the non-break space
> character (' ').
>
> <p>Some   spaces.
Er, is this defined in Unicode or in ISO 10646??

Lee

-- Liam Quin -- the barefoot typographer -- Toronto lq-text: freely available Unix text retrieval

email address: l i a m q u i n at host: i n t e r l o g dot c o m

Previous message: Eric Baatz - Sun Microsystems Labs BOS: "Re: purpose CDATA sections"
In reply to: Andrew Greene: "Re: Whitespace rules (v2)"
Next in thread: Paul Grosso: "Re: Whitespace rules (v2)"