Character and byte offsets

Richard Tobin (richard@cogsci.ed.ac.uk)
Thu, 5 Nov 1998 13:39:53 GMT

Messages sorted by: [ date ][ thread ][ subject ][ author ]
Next message: Simon North: "XML and IE5 beta PR2"
Previous message: Kurt Helenelund: "Creation of XML documents"
In reply to: Tim Bray: "Re: SAX, DOM, and Search Engines (was Re: xml parser)"

> Secondly, for proximity, you're worried about counting characters, not
> bytes, but for addressing back into the entity, you're worried about byte,
> not character, offsets. So it's even harder than it looks.

This reminds me - are there good techniques for maintaining a byte
offset in conjunction with character-set translations? Ideally you
want the translation done in big blocks at a low level, but then how
do you access the byte offsets? In RXP/LTXML I keep the offset of the
start of the block (which is actually a line), and then (in the case
of UTF-8) effectively reverse-translate to calculate how much to add
(this relies on UTF-8 being invertible). Surely there must be a better
way...

-- Richard

Next message: Simon North: "XML and IE5 beta PR2"
Previous message: Kurt Helenelund: "Creation of XML documents"
In reply to: Tim Bray: "Re: SAX, DOM, and Search Engines (was Re: xml parser)"