[Koha-devel] switching from marc_words to zebra [LONG]

koha-devel

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Koha-devel] switching from marc_words to zebra [LONG]

From:	Paul POULAIN
Subject:	[Koha-devel] switching from marc_words to zebra [LONG]
Date:	Mon Jul 4 11:10:06 2005
User-agent:	Mozilla Thunderbird 1.0.2 (X11/20050317)

As discussed with Joshua on irc, here is my views on how to move from2.2 mysql-based DB to 3.0 zebra-based DB :


2.2 structure :
===============
in Koha 2.2, there are a lot of tables to manage biblios :

* biblio / items / biblioitems / additionalauthors / bibliosubtitle /bibliosubject : they contains the data in a "decoded form". It means,not depending on marc flavour. A title is called "biblio.title", notNNN$x where NNN is 245 in MARC21 and 200 in UNIMARC ! The primary keyfor a biblio is biblio.biblionumber.


* marc_biblio, is a table that contains only a few informations :
- biblionumber (biblio PK)
- bibid (marc PK. It's a design mistake I made, for sure)

- frameworkcode (the frameworkcode is used to know which cataloguingform Koha must use, see marc_*_ structure below)

* marc_subfield_table : this table contains marc datas, one line foreach subfield. With "order" row to keep trace of MARC field & subfieldsorder (to be sure to retrieve repeated fields in good order).

* marc_word : the -huge- table that is the index for searches. Thestructure works correctly with small to medium data size, but 50 000complete MARC biblio is the upper limit the system can handle.

* marc_*_structure (* being tag & subfield) : this table is the tablewhere the library defines how it's MARC works. It contains, for eachfield/subfield a lot of informations : what the field/subfield contains"1st responsability statement", where to put it in a given framework (inMARC editor), if the value must be put in "decoded" part of the DB(200$f => biblio.author in UNIMARC), and what kind of constraint thereis during typing (for example, show a list of possible values)


2.2 DB API
==========

The DB API is located in C4/Biblio.pm for biblio/items management &C4/SearchMarc.pm for search tools.


The main word here is : "MARC::Record heavy use".

All biblios are stored in a MARC::Record object, as well as items. Justbe warned that all items informations must be in the same MARC tag, thusthe MARC::Record contains only one MARC::Field.In UNIMARC, it's usually the 995 field and in MARC21, the 952. (can beanything in Koha, but all item info must be in the same field, and thisfield must contain only item info)



Biblio.pm :
------------
All record creation/modification is done through NEWxxxyyyy sub.
NEWxxx
perldoc C4/Biblio.pm will give you some infos on how it works.

In a few words : when adding a biblio, Koha calls NEWnewbiblio, with aMARC::Record.NEWnewbiblio calls MARCnewbiblio to handle MARC storing, thenMARCmarc2koha to create structure for non-MARC part of the DB, thencalls OLDnewbiblio to create the non-MARC (decoded) part of the biblio.


SearchMarc.pm :
---------------
The sub that does the search is catalogsearch. The sub parameters :
$dbh, => DB handler
 $tags, => array with MARC tags/subfields : ['200a','600f']
 $and_or, => array with 'and' or 'or' for each search term
 $excluding, => array with 'not' for each search term that must be excluded
 $operator, => =, <=, >=, ..., contains, start
 $value, => the value to search. Can have a * or % at end of each word
 $offset, => the offset in the complete list, to return only needed info
 $length, => the number of results to return
 $orderby, => how to order the search
 $desc_or_asc, => order asc or desc

$sqlstring => an alternate sqlstring that can replacetags/and_or/excluding/operator

The catalogsearch retrive a list of bibids, then, for each bibid toreturn, find interesting values (values that are shown in result list).This includes the complete item status (available, issued, not for loan...)


move to ZEBRA
=============

DB structure : biblio handling
------------------------------

I think we can remove marc_biblio, marc_subfield_table, and marc_word(of course)marc_biblio contains only one important information, the framework (usedto know which cataloguing form Koha must use in MARC editor). It can bemoved to biblio table.marc_subfield_table contains marc datas. We could either store it in rawiso2709 format in biblio table or only in zebra. I suspect it's betterto store it twice (in zebra AND in biblio table). When you do a search(in Zebra), you enter a query, get a list of results. This list can bebuilded with datas returned by zebra . Then the user clics on a givenbiblio to see the detail.Here, we can read the raw marc in biblio table, and thus not needing auseless zebra called (at the price of a SQL query, but it's based on theprimary key, so as fast as possible)


marc_word is completly dropped, as search is done with zebra.

DB structure : items handling
-----------------------------

item info can be stored in the same structure as for biblio : save theraw item MARC data in items table.


Koha <=> Zebra
--------------

It should really not be a pain to move to zebra with this structure :every call with a MARC::Record (NEWxxxxyyyy subs) manages the storing ofthe MARC::Record in marc_* tables. We could replace this code with azebra insert/update, using biblio.biblionumber as primary key.How to manage biblios and items ? My idea here would be to store biblio+ all items informations in zebra, using a full MARC::Record, thatcontains biblio and items.When NEWnewitem (or NEWmoditem) is called, the full biblio MARC::Recordis rebuilded with biblio MARC::Record and all items MARC::Records, andupdated in zebra. it can be a little cpu consuming to update zebra everytime an item is modified, but it should not be so much, as in libraries,biblios & items don't change so often.


So we would have :
NEWnewbiblio :

* create biblio/biblioitems table entry (including MARC record in rawformat)

* create zebra entry, with the provided Perl API.

NEWnewitem :
* create the items entry (includint MARC record in raw format)

* read biblio MARC record & previously existing items // append new item// update zebra entry with the provided Perl API.


NEWmodbiblio :
* modify the biblio entry (in biblio/biblioitems table)
* read the full MARC record (including items) // update Zebra entry

NEWmoditem :
* modify the item entry (in items table)
* read the full MARC record (including items) // update Zebra entry

Note that this relys on Zebra iso2709 results returns. We could use XMLor nice-top-tech possibilities. But Koha makes heavy use ofMARC::Record, so we don't need to reinvent the wheel.What is great with Zebra is that we can index iso2709 datas, but showwhat we want to users (including XML). So Koha internals can be whatever ;-)


The MARC Editor
===============

Some users thinks Koha MARC Editor could be improved. The best solutionwould be, imho, to provide an API to use an external MARC editor if thelibrary prefers.However, some libraries are happy with what exists. So the MARC editorshould be kept (& improved where possible). so marc_*_structure tablesare still needed. Some fields could be removed probably, as they arerelated to search (like seealso), and will be handled by zebra configfile. This still has to be investigated.

For libraries that prefers an external MARC editor, we could create awebservice, where the user does an http request, with iso2709 data inparameters, with the requested operation.This should be quite easy to do (the problem being to know how theexternal software can handle this. If someone has an idea or anexperience on this, feel free to post here ;-) )


Data search
==========

I won't speak a lot about search, as someone else has taken the ball forthis ;-) I just think SearchMarc.pm should be deeply modified ! As everyinformation will be in zebra, it can use only zebra search API.


A question remaining is :

in a biblio/item, the item status (when issued, transfered, returned,reverved, waiting...), changes quite often. So is it better to save thestatus in zebra DB, and thus update the zebra entry (biblio+items)everytime an item status is modified, or is it better to keep thisinformation only in items/reserve/issues tables & read it in mySQL everytime it's needed ?Open question that zebra guys can probably answer. NPL has, for example,600 000 issues per year (and hopefully 600 000 returns ;-) ), plus some(how many ?) reserves, branch transfers...


The authority problem
=====================

Authorities have to be linked to the biblio that uses them. Thus, whenan authority is modified, all biblios using them are automaticallymodified (script in misc/merge_authority.pl in Koha cvs & 2.2.x)

To keep trace of the link, Koha uses a $9 local subfield. In UNIMARC,the $3 can also be used for this. I don't know if something equivalentto $3 exists in MARC21 (could not find information onhttp//www.loc.gov/marc/)Many scripts make a heavy use of marc_subfield_table $9 data. Forexample, when you find an authority in authority module, you get thenumber of biblios using this authority. This number is calculated with aSQL request on $9 subfield.

To handle this with zebra, we have 2 solutions :

- create a table just with the link (biblionumber / authority number)that we could query

- query zebra with exact $9 subfield value

I don't know zebra enough to be sure of the best way to do it. Anysuggestion/experience welcomed.


The authority problem (another one...)
======================================
Authorities are MARC::Records too... (without items)

So they also have auth_structure & auth_word & all the infos that are inbiblios (except items level, as there is no "authority" items).so we could imagine to have 2 zebra databases : one for biblios and onefor authorities.Everything previously in this mail can be copied here. That's somethingwe could investigate after moving MARC biblios to zebra, as we wouldhave more experience on this tool.


"Trivial" querying
================

Someone may ask "why should we keep biblio/biblioitems/items tables ?",as everything is in zebra ?First, as Koha is multi-marc, remember it's very complex to know what isa "title" just with a MARC record.the same guy will ask "yes, but with Biblio/MARCmarc2koha, you cantransform you MARC::Record to a semantically meaningful hash".


I answer to this :

Yes, but without those tables, sql querying the database would becompletly impossible for developpers, as we could not know in mySQL "ifwe have authors filled by the bulkmarcimport", or "do we have theitemcallnumber correctly modified for item #158763".

That's a second reason to keep those tables in mySQL.

--
Paul POULAIN
Consultant indépendant en logiciels libres
responsable francophone de koha (SIGB libre http://www.koha-fr.org)

[Prev in Thread]

Current Thread

[Next in Thread]

[Koha-devel] switching from marc_words to zebra [LONG], Paul POULAIN <=
- Re: [Koha-devel] switching from marc_words to zebra [LONG], Thomas D, 2005/07/05
  - Re: [Koha-devel] switching from marc_words to zebra [LONG], Paul POULAIN, 2005/07/05
- Re: [Koha-devel] switching from marc_words to zebra [LONG], Thomas D, 2005/07/07
  - Re: [Koha-devel] switching from marc_words to zebra [LONG], Paul POULAIN, 2005/07/18

Prev by Date: [Koha-devel] Install Meeting, 2005-07-05 21:00 +0000
Next by Date: Re: [Koha-devel] switching from marc_words to zebra [LONG]
Previous by thread: [Koha-devel] Install Meeting, 2005-07-05 21:00 +0000
Next by thread: Re: [Koha-devel] switching from marc_words to zebra [LONG]
Index(es):
- Date
- Thread