Re: SAX: New Idea for Entity Resolution

James Clark (jjc@jclark.com)
Sun, 19 Apr 1998 12:34:28 +0700

Messages sorted by: [ date ][ thread ][ subject ][ author ]
Previous message: James Clark: "Re: SAX: String Internalisation and a CORBA/DCOM Question"
Maybe in reply to: David Megginson: "SAX: New Idea for Entity Resolution"

David Megginson wrote:
>
> James Clark writes:
>
> > You could just have a class that encapsulates a structure with three
> > members:
> >
> > - a CharacterStream
> > - a ByteStream
> > - a String
> >
> > At least one of the CharacterStream and ByteStream must be non-null. If
> > the ByteStream is non-null the String can specify the encoding.
>
> [Read on to the bottom for a large-ish design change.]
>
> This implies, then, the following three interfaces:
>
> public interface ByteStream {
> public abstract int read ()
> throws SAXException;
> public abstract int read (byte b[], int start, int count)
> throws SAXException;
> }
>
> public interface CharacterStream {
> public abstract int read ()
> throws SAXException;
> public abstract int read (char ch[], int start, int count)
> throws SAXException;
> }

Why are the single character read calls there? They unnecessarily
complicates the interface.

> public class InputSource {
> // For each variable, imagine a get/set pair instead...
> public ByteStream byteStream;
> public CharacterStream characterStream;
> public String encoding;
> }
>
> The nice thing here is that all of these can live on separate systems
> in a distributed environment: the InputSource can be a C-program on a
> VAX, the CharacterStream can come a Python program running under alpha
> Linux, and the parser can be running in Java on a Windows box. There
> is no dependency on language- or system-specific features (except for
> java.lang.String, which should be able to map predictably to other
> languages).
>
> Now, why not take this a step further?
>
> public class InputSource {
> // For each variable, imagine a get/set pair instead...
> public String publicId;
> public String systemId;
> public ByteStream byteStream;
> public CharacterStream characterStream;
> public String encoding;
> }
>
> We'd have to define rules of precedence:
>
> 1) if there is a character stream, use it;
>
> 2) if there is no character stream but there is a byte stream, use the
> byte stream;
>
> 3) if there is neither a character stream nor a byte stream but there
> is a system identifier, open a connection to the system identifier;
>
> 4) if there is no character stream, byte stream, or system identifier,
> throw an exception (or invoke the ErrorHandler).
>
> Now, we can get away with only one parse() method in
> org.xml.sax.Parser:
>
> public abstract void parse (InputSource source)
> throws Exception;

I don't think this is a good idea: it makes SAX harder to use in the
simple case of reading from a URL.

James

Previous message: James Clark: "Re: SAX: String Internalisation and a CORBA/DCOM Question"
Maybe in reply to: David Megginson: "SAX: New Idea for Entity Resolution"