koha-zebra
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Koha-zebra] Koha Zebra Searching Report (from NPL)


From: Sebastian Hammer
Subject: Re: [Koha-zebra] Koha Zebra Searching Report (from NPL)
Date: Wed, 22 Mar 2006 22:43:40 -0500
User-agent: Mozilla Thunderbird 1.0.7 (Macintosh/20050923)

Joshua Ferraro wrote:

On Wed, Mar 22, 2006 at 08:28:26PM -0500, Sebastian Hammer wrote:
Can't do XOR today. I suppose it would be a possible new feature, but I've frankly never heard of it in an ILS.. can a XOR b be mapped to

(a OR b) NOT (a AND b) ? or am I just showing my fading math skills to ill effect, here?
Yep, that's the correct mapping. Voyager's where NPL originally
saw the XOR function.
Ok. It can be faked in the front-end then, or implemented deeper in the guts of Zebra.

Why do you see yourelf limited to Bib-1? Within Koha, you can do whatever you want -- specifically extend Bib-1 into the 8000-range (IIRC) for local USE attributes or define a private set.
Right, I was just hoping there was some way to map it to bib-1 as
I assume that would be useful in cross-domain searching. If not we
can certainly do a locally defined attribute or set.
I think beyond what's in the Bath profile or the US national profile, you have little hope of interoperable search.. in my experience, cross-domain searching still entails the need to do query-mapping independently per target or for groups of targets with similar characteristics. I use the CCL parser that's available through the YAZ ZOOM API, and include a reference to a set of mapping directives as part of the configuration for each target.. that allows you to get pretty far towards an interoperable-feeling search with a minimum of code.

This would, I believe, require new development. It's possible that one of the experimental ranking algorithms that are included might provide better results for these people, but I *think* that boosting the score for one field in a ranked keyword search would require an extension to the index structure.
I've looked high and low for documentation on the ranking algorithms in
Zebra but haven't found much more than a few sentences in the official
docs and some list messages ...
It isn't documented beyond what's in the code, AFAIK.

AUTHOR SEARCHING

Again, the current relevance ranking doesn't quite cut it. A good
example is a relevance ranked author search on "James Joyce". Some
records sneak into high relevance because they have multiple authors
with names like "James Henry" and "Paul Joyce" (take  "Bob the Builder
in the NPL database as an example

It might be worth checking whether one of the custom ranking algos did better on this..you an look in the NEWS file for instructions on how to enable them.
Will do.

relevance ranking
should account for proximity and use that as the highest ranking
consideration to ensure that a search on "James Joyce" returns all the
books by "James Joyce" first. Also, they requested that the default
ranking secondarily sort the items by date as well because they often are asked to find the 'latest' book by so and so. We concluded that the copyright date stored in the 008 is probably the only date normalized enough to use for sorting though I'm not sure if zebra can use that for sorting.


It could with the XSLT index rules of Zebra 1.4.
Cool, and are there docs on that somewhere? :-)
There will be by the time Zebra 1.4 is released. For now, it's pre-release stuff. However, the CVS version of Zebra contains an example setup under examples/alvis-oai/conf. I think for really gnarly indexing schemes, this is probably the wave of the future, since it's pretty much infinitely flexible. It should also be pretty easy to perl-map one of the existing ABS files into this format.

Same thing. I don't know how hard it would be to add a score for proximity.. that data is at least in the index structure, but I've no idea how hard it would be to fit into the code. We can ask the Zebra wranglers what it would entail if you're interested.
Yes, please do, we're very interested in that particular one.
Ok.

SUBJECT HEADING SEARCH

NPL would like to see a demonstration of a 'Subject Heading' search
using authorities generated from the data to compile a list of
authoritative headings (which would be compiled from multiple fields
within a given subject tag such as $650$a$v$x, etc.). So I think to do this right we'd need to look at putting our authority records
in Zebra as well.

Hmm. Not sure I fully grok the requirement here.. you seem to suggest both constructing a specific index key based on a concatenation of multiple fields (easy in the XSLT indexing rules of 1.4, not compatible with the 'melm' directive.
I'm unclear about the differences between 'elm' and 'melm'. The docs
seem to indicate that they are the same...
They are actually described as being quite different, but I can see how the nature of the difference could be more clear.

The 'elm' directive is the original.. it's parameter structure is based on the way that Z39.50 abstract record models were typically represented in the old days.. hence the weird ordering of elements, etc. It also has the limitation that you can't address attributes, because the old Z39.50 record model didn't have attributes. The xelm directive was introduced to fix that.. it allows you to express tag paths in the XPATH style, and to address attributes, either in [predicates] or directly, for indexing.

The usmarc.abs file that comes with Zebra assumes that records were ingested in ISO2709 using the record type grs.marc.<absfilename>. The grs.marc input filter actually generates an internal abstract structure which is incompatible with MARCXML.. it looks more like <245><11><a>content</a></11></245>. When MARCXML came along it became clear that it'd be nicer to work with that.. so the grs.marcxml input filter was introduced to parse ISO2709 and map them internally to MARCXML. Of course, if you're starting with MARCXML, you can just use grs.xml with the same effect.

But now the old usmarc.abs file won't work anymore, because MARCXML is all about attributes for field names and subfield codes, and the 'elm' directive can't handle that... in fact, to index 245$a, you'd have to write something like

xelm /*/address@hidden/address@hidden     title

At some point, we got a bit of money from the LoC to develop a simple set of Bath level 0 indexing rules for Zebra.. I started working on that, but got so fed up with the syntax above that I rebelled and implemented the 'melm' directive (and it takes a lot for me to touch the innards of Zebra, in my old days), so instead of the above, I could write

melm 245$a  title

Which is totally equivalent to the above, but nice and to the point.. however, none of these mechanisms allows you to construct phrase indexes that span multiple subfields.. and they don't allow you to do cool stuff like extract a date from the guts of 008... in fact, there are lots of situations where you'd like to do some form of massaging on the input before processing. In the past, I would sometimes translate MARC records to an ASCII-line based format, and use the magic of the regexp input filters (http://www.indexdata.com/zebra/doc/record-model.tkl#id2530050) to massage the data at index/retrieval time... because I can write Tcl code in the input filters to do stuff to the data, the sky is the limit.. but, because I have to write Tcl code to accomplish anything, I become sad and gray-haired. So when I build applications on Zebra these days, I am more likely to do some form of preprocessing of the records in Perl or similar BEFORE feeding them to Zebra.. not very satisfying, but it brings home the bacon.

Well, in Zebra 1.4, XSLT comes to the rescue, in a way that only XSLT can do it, with lots of angular brackets and much verbosity.... for instance, in an XSLT index filter,

melm 245$a title:w

becomes

<xsl:template match="marc:record/marc:address@hidden'245']/marc:address@hidden'a']">
 <z:index name="title"type="w">
   <xsl:value-of select="."/>
 </z:index>
</xsl:template>

Eek.

But of course the magic of that is that you could put just about anything you could possibly imagine instead of that simple <xsl:value-of> in the middle... using substr() to extract a date from 008, a code from the leader, combining subfields, doing math, looking stuff up in supporting tables, etc... the sky is the limit, and I'd prefer this to programming in Tcl anytime. And of course, if you want a more compact configuration file, you could write something like

<koha:melm field="245$a" index="title:w"/>

and use XSLT to map that into the diatribe above before sending it to Zebra.. we might even offer some options like that as part of the software down the road. In addition to the stylesheet which maps records to 'index documents' like above, Zebra 1.4 can be configured to support multiple retrieval schemas (i.e. DC, MODS, MARCXML), simply by providing stylesheets for each desired schema -- the translation is done on the fly when records are retrieved.

--Sebastian


Thanks!


--
Sebastian Hammer, Index Data
address@hidden   www.indexdata.com
Ph: (603) 209-6853





reply via email to

[Prev in Thread] Current Thread [Next in Thread]