koha-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Koha-devel] switching from marc_words to zebra [LONG]


From: Paul POULAIN
Subject: [Koha-devel] switching from marc_words to zebra [LONG]
Date: Mon Jul 4 11:10:06 2005
User-agent: Mozilla Thunderbird 1.0.2 (X11/20050317)

As discussed with Joshua on irc, here is my views on how to move from 2.2 mysql-based DB to 3.0 zebra-based DB :

2.2 structure :
===============
in Koha 2.2, there are a lot of tables to manage biblios :
* biblio / items / biblioitems / additionalauthors / bibliosubtitle / bibliosubject : they contains the data in a "decoded form". It means, not depending on marc flavour. A title is called "biblio.title", not NNN$x where NNN is 245 in MARC21 and 200 in UNIMARC ! The primary key for a biblio is biblio.biblionumber.

* marc_biblio, is a table that contains only a few informations :
- biblionumber (biblio PK)
- bibid (marc PK. It's a design mistake I made, for sure)
- frameworkcode (the frameworkcode is used to know which cataloguing form Koha must use, see marc_*_ structure below)

* marc_subfield_table : this table contains marc datas, one line for each subfield. With "order" row to keep trace of MARC field & subfields order (to be sure to retrieve repeated fields in good order).

* marc_word : the -huge- table that is the index for searches. The structure works correctly with small to medium data size, but 50 000 complete MARC biblio is the upper limit the system can handle.

* marc_*_structure (* being tag & subfield) : this table is the table where the library defines how it's MARC works. It contains, for each field/subfield a lot of informations : what the field/subfield contains "1st responsability statement", where to put it in a given framework (in MARC editor), if the value must be put in "decoded" part of the DB (200$f => biblio.author in UNIMARC), and what kind of constraint there is during typing (for example, show a list of possible values)

2.2 DB API
==========
The DB API is located in C4/Biblio.pm for biblio/items management & C4/SearchMarc.pm for search tools.

The main word here is : "MARC::Record heavy use".

All biblios are stored in a MARC::Record object, as well as items. Just be warned that all items informations must be in the same MARC tag, thus the MARC::Record contains only one MARC::Field. In UNIMARC, it's usually the 995 field and in MARC21, the 952. (can be anything in Koha, but all item info must be in the same field, and this field must contain only item info)


Biblio.pm :
------------
All record creation/modification is done through NEWxxxyyyy sub.
NEWxxx
perldoc C4/Biblio.pm will give you some infos on how it works.

In a few words : when adding a biblio, Koha calls NEWnewbiblio, with a MARC::Record. NEWnewbiblio calls MARCnewbiblio to handle MARC storing, then MARCmarc2koha to create structure for non-MARC part of the DB, then calls OLDnewbiblio to create the non-MARC (decoded) part of the biblio.

SearchMarc.pm :
---------------
The sub that does the search is catalogsearch. The sub parameters :
$dbh, => DB handler
 $tags, => array with MARC tags/subfields : ['200a','600f']
 $and_or, => array with 'and' or 'or' for each search term
 $excluding, => array with 'not' for each search term that must be excluded
 $operator, => =, <=, >=, ..., contains, start
 $value, => the value to search. Can have a * or % at end of each word
 $offset, => the offset in the complete list, to return only needed info
 $length, => the number of results to return
 $orderby, => how to order the search
 $desc_or_asc, => order asc or desc
$sqlstring => an alternate sqlstring that can replace tags/and_or/excluding/operator

The catalogsearch retrive a list of bibids, then, for each bibid to return, find interesting values (values that are shown in result list). This includes the complete item status (available, issued, not for loan...)

move to ZEBRA
=============

DB structure : biblio handling
------------------------------
I think we can remove marc_biblio, marc_subfield_table, and marc_word (of course) marc_biblio contains only one important information, the framework (used to know which cataloguing form Koha must use in MARC editor). It can be moved to biblio table. marc_subfield_table contains marc datas. We could either store it in raw iso2709 format in biblio table or only in zebra. I suspect it's better to store it twice (in zebra AND in biblio table). When you do a search (in Zebra), you enter a query, get a list of results. This list can be builded with datas returned by zebra . Then the user clics on a given biblio to see the detail. Here, we can read the raw marc in biblio table, and thus not needing a useless zebra called (at the price of a SQL query, but it's based on the primary key, so as fast as possible)

marc_word is completly dropped, as search is done with zebra.

DB structure : items handling
-----------------------------
item info can be stored in the same structure as for biblio : save the raw item MARC data in items table.

Koha <=> Zebra
--------------
It should really not be a pain to move to zebra with this structure : every call with a MARC::Record (NEWxxxxyyyy subs) manages the storing of the MARC::Record in marc_* tables. We could replace this code with a zebra insert/update, using biblio.biblionumber as primary key. How to manage biblios and items ? My idea here would be to store biblio + all items informations in zebra, using a full MARC::Record, that contains biblio and items. When NEWnewitem (or NEWmoditem) is called, the full biblio MARC::Record is rebuilded with biblio MARC::Record and all items MARC::Records, and updated in zebra. it can be a little cpu consuming to update zebra every time an item is modified, but it should not be so much, as in libraries, biblios & items don't change so often.

So we would have :
NEWnewbiblio :
* create biblio/biblioitems table entry (including MARC record in raw format)
* create zebra entry, with the provided Perl API.

NEWnewitem :
* create the items entry (includint MARC record in raw format)
* read biblio MARC record & previously existing items // append new item // update zebra entry with the provided Perl API.

NEWmodbiblio :
* modify the biblio entry (in biblio/biblioitems table)
* read the full MARC record (including items) // update Zebra entry

NEWmoditem :
* modify the item entry (in items table)
* read the full MARC record (including items) // update Zebra entry


Note that this relys on Zebra iso2709 results returns. We could use XML or nice-top-tech possibilities. But Koha makes heavy use of MARC::Record, so we don't need to reinvent the wheel. What is great with Zebra is that we can index iso2709 datas, but show what we want to users (including XML). So Koha internals can be whatever ;-)

The MARC Editor
===============
Some users thinks Koha MARC Editor could be improved. The best solution would be, imho, to provide an API to use an external MARC editor if the library prefers. However, some libraries are happy with what exists. So the MARC editor should be kept (& improved where possible). so marc_*_structure tables are still needed. Some fields could be removed probably, as they are related to search (like seealso), and will be handled by zebra config file. This still has to be investigated.

For libraries that prefers an external MARC editor, we could create a webservice, where the user does an http request, with iso2709 data in parameters, with the requested operation. This should be quite easy to do (the problem being to know how the external software can handle this. If someone has an idea or an experience on this, feel free to post here ;-) )

Data search
==========
I won't speak a lot about search, as someone else has taken the ball for this ;-) I just think SearchMarc.pm should be deeply modified ! As every information will be in zebra, it can use only zebra search API.

A question remaining is :
in a biblio/item, the item status (when issued, transfered, returned, reverved, waiting...), changes quite often. So is it better to save the status in zebra DB, and thus update the zebra entry (biblio+items) everytime an item status is modified, or is it better to keep this information only in items/reserve/issues tables & read it in mySQL every time it's needed ? Open question that zebra guys can probably answer. NPL has, for example, 600 000 issues per year (and hopefully 600 000 returns ;-) ), plus some (how many ?) reserves, branch transfers...

The authority problem
=====================
Authorities have to be linked to the biblio that uses them. Thus, when an authority is modified, all biblios using them are automatically modified (script in misc/merge_authority.pl in Koha cvs & 2.2.x)

To keep trace of the link, Koha uses a $9 local subfield. In UNIMARC, the $3 can also be used for this. I don't know if something equivalent to $3 exists in MARC21 (could not find information on http//www.loc.gov/marc/) Many scripts make a heavy use of marc_subfield_table $9 data. For example, when you find an authority in authority module, you get the number of biblios using this authority. This number is calculated with a SQL request on $9 subfield.
To handle this with zebra, we have 2 solutions :
- create a table just with the link (biblionumber / authority number) that we could query
- query zebra with exact $9 subfield value

I don't know zebra enough to be sure of the best way to do it. Any suggestion/experience welcomed.

The authority problem (another one...)
======================================
Authorities are MARC::Records too... (without items)
So they also have auth_structure & auth_word & all the infos that are in biblios (except items level, as there is no "authority" items). so we could imagine to have 2 zebra databases : one for biblios and one for authorities. Everything previously in this mail can be copied here. That's something we could investigate after moving MARC biblios to zebra, as we would have more experience on this tool.

"Trivial" querying
================
Someone may ask "why should we keep biblio/biblioitems/items tables ?", as everything is in zebra ? First, as Koha is multi-marc, remember it's very complex to know what is a "title" just with a MARC record. the same guy will ask "yes, but with Biblio/MARCmarc2koha, you can transform you MARC::Record to a semantically meaningful hash".

I answer to this :
Yes, but without those tables, sql querying the database would be completly impossible for developpers, as we could not know in mySQL "if we have authors filled by the bulkmarcimport", or "do we have the itemcallnumber correctly modified for item #158763".
That's a second reason to keep those tables in mySQL.

--
Paul POULAIN
Consultant indépendant en logiciels libres
responsable francophone de koha (SIGB libre http://www.koha-fr.org)



reply via email to

[Prev in Thread] Current Thread [Next in Thread]