On the other hand grove based proximity search techniques have also been
used since the 1970's when this was called a "semantic network". the
advantage is that it is language independent. To date, this hasn't been
terribly useful with HTML as not many people care about indexing <p> tags
for example. This where XML has lots to offer and where efforts ought to and
are being directed (IHMO).
Jonathan Borden
JABR Technology
http://jabr.ne.mediaone.net
Tim Bray wrote:
>
>
> At 12:27 PM 11/5/98 -0000, Michael Kay wrote:
> >Switching thrreads, I am a little surprised by Tim's remarks on word
> >proximity versus character proximity. Confining our attention to European
> >languages (as most search engines do), word proximity searching
> is a common
> >feature of the high-end search engines, whereas character proximity is
> >hardly found outside basic desktop tools like grep.
>
> What I said was:
> 1. I have not seen any research which demonstrates that word proximity
> achieves better results than character proximity based on any
> well-known IR metric.
> 2. Doing word proximity at all is a *very* hard problem in the languages
> used by a large majority of the world's population.
>
> >Apart from anything
> >else, once you've done the word normalisation (normalising different
> >linguistic forms or spellings of the same word), character proximity is
> >meaningless. In the older boolean engines word proximity is used rather
> >mechanistically, in the newer engines it is used more subtly as part of a
> >statistical or linguistic approach to relevance ranking
>
> If you go poking around either in the SIGIR world (that would be the
> Association for Computing Machinery's Special Interest Group on
> Information Retrieval) or in the actual commercial retrieval engine
> world, you find a distressing lack of technology progress. Yes, with
> modern engines, precision & recall are measurably better than they
> were in 1978. But 10 times as good? Hah! Twice as good? Maybe,
> for certain restricted application domains. Given all this, I'm
> less than impressed about the subtle techniques of modern engines.
> On top of which, most of the techniques used in the "advanced" engines
> are basically Anglocentric and fall apart once you get outside the
> English-speaking world.
>
> > but either way it
> >is an established feature of the scene, and it is not there on whim: the
> >search algorithms used are based on extensive research and
> benchmarking of
> >relevance and recall scores.
>
> Yeah, well, it's *not* an established feature of the scene in Asia. Maybe
> it's just an irrational prejudice, but I'm not all that interested in
> computing techniques that are not usable by a large majority of the
> world's population. And once again, I challenge the assertion that,
> for all these clever heuristics, real-world retrieval software is
> really much better than it was 20 years ago.
>
> >An interesting comparison of web search engines is at
> >http://www.netstrider.com/search/features.html ; this asserts
> that all the
> >well-known web search engines other than Lycos use word
> proximity matching.
>
> And we know what wonderful results they produce (that's in English; for
> real joy, go try a tricky in German - even European languages sometimes
> leave out the spaces between the words - and see what happens). -Tim
>
> PS: Given my grouchy tone, I should say that I'm dazzled at the
> inventiveness, deep thought, and creativity that have been invested
> in the IR field in recent decades. The fact the results are so
> underwhelming is evidence of how hard the problems are... the real
> lesson is that we should marvel at the language-processing apparatus
> we carry around between our ears. -T
>