Character and byte offsets

Richard Tobin (richard@cogsci.ed.ac.uk)
Thu, 5 Nov 1998 13:39:53 GMT


> Secondly, for proximity, you're worried about counting characters, not
> bytes, but for addressing back into the entity, you're worried about byte,
> not character, offsets. So it's even harder than it looks.

This reminds me - are there good techniques for maintaining a byte
offset in conjunction with character-set translations? Ideally you
want the translation done in big blocks at a low level, but then how
do you access the byte offsets? In RXP/LTXML I keep the offset of the
start of the block (which is actually a line), and then (in the case
of UTF-8) effectively reverse-translate to calculate how much to add
(this relies on UTF-8 being invertible). Surely there must be a better
way...

-- Richard