JUMBO

Peter Murray-Rust (Peter@ursus.demon.co.uk)
Sat, 22 Mar 1997 19:07:22 GMT


JUMBO is a prototype browser/editor/search/transformation tool for
XML documents. I have now managed to bolt in both Lark and NXP
instead of my parser (which was crude and did not support some of the
XML constructs). The bolting-in is still rather crude and concentrates
my mind on the need for a simple API at this level. Here are some comments
which may be useful.

NXP.
----
NXP has an interface Esis, with function such as open_tag, close_tag,
process_instruction, etc. [I think they would be more properly called
start_element??]. JUMBO uses this to build up a Vector representing the
ESIS event stream, somthing like:
"_START_TAG" "CML" AttributeList "_START_TAG" "MOL" ... "_END_TAG" "MOL"...
JUMBO then builds a tree out of this, adding attributes, etc.

NXP has a class XML which is built by JACC. This contains inter alia
an Esis_Stdout object (implements Esis). There are several objects in XML
which are private and therefore not easily accessed - I think they should
have accessors, but at present I have subclassed it to PMRXML, which has
the requisiste accessors.

My test program then creates a PMRXML object, and extracts the event stream
which is then passed to JUMBO's existing tree object:
NXP.PMRXML xml = new PMRXML(NXP.Streams.load_File(file, true));
pmr.chemime.ChemTree chemTree = new ChemTree(xml.getStreamVector());
pmr.sgml.GeneralTOC toc = chemTree.createGeneralTOC(3);

Comments: I have still to work out what whitespace NXP creates - there seems
to be a lot of content which is simply white. Maybe we have to address
COLLAPSE and KEEP at this stage? Also it isn't easy to extract certain
info - for example I had to hack XML.java to get the doctype - this isn't a good
idea and we need an accessor. I am also still not clear how NXP does (or should)
behave with:
<!DOCTYPE CML>
and <!DOCTYPE CML SYSTEM "cml.dtd">
(the default on the latter is to try to validate, I think, even if validate
is set to false. I'd prefer to be able to turn off validation, but I may have
missed something).
In general I'd like to be able to treat NXP as a black box, and subclass
my Esis object. That could mean passing it as an argument to XML, e.g.:

public class PMREsis implements Esis {
public void open_tag(String name) {
...
}
}

PMREsis esis = new PMREsis();
NXP.XML xml = new NXP.XML(esis, NXP.Streams.load_File(file, true))
pmr.sgml.SGMLTree tree = new pmr.sgml.SGMLTree(xml);

and so on.

NXP is a validatin parser, but my DTDs are still struggling with Parameter
Entities so I have no experience here.

Lark
----
Lark creates a tree (called Lark) and provides a handler for
the user to pick up a variety of events (e.g. doDoctype(), doPI()). The
tree contains Elements ('Nodes') which have Attributes and a type (String).

Rather than subclassing these elements, I process Lark but iterating through
the Elements and creating a JUMBO SGMLTree (this can be delayed if required).
The tree seems complete, but I am not sure I have got all the doFOO routines
working correctly. I have also had problems with PIs (if the ?> delimiter
is used) - these may be mine.

Lark does not validate. However it is easy to interface and is fast.

General
-------
I do not use PIs myself though I shall start to do so. If they are
kept in the document tree, is there a convention where they live? (The last
opened element? What if they occur in PCDATA?).

I intend to make JUMBO available with both Lark and NXP but it's a bit creaky
at present and the interface is a bit slow. I have been told that the larger
the number of classes, the slower the program - any comments? Also I don't
know whether I should be deliberately garbage-collecting at this stage.

Any general thoughts would be welcome. I intend to bolt a crude search tool
into JUMBO along the TEI lines. I shall also see whether I can extract the
bits of NXP that do the validating, because then we have a crude validating
editor.

Any feedback from the current JUMBos would be appreciated. (I already know
it's slow, and the graphics creak in several places :-)

P.

-- 
Peter Murray-Rust, domestic net connection
Virtual School of Molecular Sciences
http://www.vsms.nottingham.ac.uk/