> "what's the most immediate containing element of offset X in
> file Y?"
>
> "traverse up the logical structure from offset X until a DIV
> element with a HEAD is found, and return me the offsets of
> that HEAD"
>
> Exact expression language is, uh, gee. These are the kinds of
> questions we could ask with "some XML query language", but if i have a
> gigabyte or so of variously-structured English text marked up this
> way, i really don't want to have to parse the document entity just to
> answer these kinds of simple questions. This is a weak specification
> of what I'm trying to do, i realize. (this all largely because i am
Our LT XML tool set and API were designed for precisely this sort of
application (we regularly work with >1GB language SGML-encoded corpora
such as the BNC). We get good performance because
1) Our parser is written in C, our search and retrieval tools use it
directly via a stream-based API, only custom UI tends to get
written in a scripting language which looks at whole trees;
2) We only produce tree fragments when we get to the interesting bits:
our query processor is optimised to avoid building large amounts of
tree unnecessarily;
3) For REALLY big datasets, we do produce and use offset-based
indices.
For more information, see http://www.ltg.ed.ac.uk/software/xml/.
ht
-- Henry S. Thompson, HCRC Language Technology Group, University of Edinburgh 2 Buccleuch Place, Edinburgh EH8 9LW, SCOTLAND -- (44) 131 650-4440 Fax: (44) 131 650-4587, e-mail: ht@cogsci.ed.ac.uk URL: http://www.ltg.ed.ac.uk/~ht/