Re: [Koha-zebra] address@hidden: [Koha-devel] Building zebradb]

koha-zebra

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Koha-zebra] address@hidden: [Koha-devel] Building zebradb]

From:	Sebastian Hammer
Subject:	Re: [Koha-zebra] address@hidden: [Koha-devel] Building zebradb]
Date:	Sat, 11 Mar 2006 16:34:43 -0500
User-agent:	Mozilla Thunderbird 1.0.7 (Macintosh/20050923)

Hi Joshua,

Thanks for this feedback, it's very interesting. Clearly some of theissues you describe (i.e. a lack of stability around upadets) indicatesoftware problems, but there's also some interesting ideas for possiblerefinements or new developments which I think would be really useful toget into the general development plans for the software... there aremore folks at ID involved in Zebra development that I'd like to get intothese thoughts... I dunno if a wiki or just a larger zebra-dev list isin order, but it's something to think about.


Joshua Ferraro wrote:

----- Forwarded message from Tümer Garip <address@hidden> -----


Hi,
We have now put the zebra into production level systems. So here is some
experience to share.

Building the zebra database from single records is a veeeeery looong
process. (100K records 150k items)

Yes, that confirms my expectations. We could think about building somekind of buffering into for first-time updating, or else the logic has tobe in the application, as you've seen... the situation is particularlygrim if shadow-indexing is enabled during indexing and every record iscommitted, since this causes a sync of the disks, which could take up toa whole second.

Also, I'm not sure which version of Zebra you're using? I've been foingsome performance-testing of Zebra for the Internet Archive, and notedquite a difference between 1.3 and 1.4 (the CVS version) which is reallywhere all the development happens.

Best method we found:

1- Change zebra.cfg file to include

iso2079.recordType:grs.marcxml.collection
recordType:grs.xml.collection

2- Write (or hack export.pl) to export all the marc records as one big
chunk to the correct directory with an extension .iso2079 And system
call "zebraidx -g iso2079 -d <dbnamehere> update records -n".

This ensures that zebra knows its reading marc records rather than xml
and builds 100K+ records in zooming speed.
Your zoom module always uses the grs.xml filter while you can anytime
update or reindex any big chunk of the database as long as you have marc
records.

Good strategy, I think.. but of course it's weird and awkward to have touse two different formats, especially when they both have limitations.We really must look into handling ISO2709 from ZOOM.

Mind you, version 1.4 should be able to read multiple collection-wrappedMARCXML records in one file, but only (AFAIK) in conjunction witht thenew XSLT-based index rules. I *would* like to try to develop a good wayto work with bibliographic data in that framework.

3-We are still using the old API so we read the xml and use
MARC::Record->new_from_xml( $xmldata )
A note here that we did not had to upgrade MARC::Record or MARC::Charset
at all. Any marc created within KOHA is UTF8 and any marc imported into
KOHA (old marc_subfield_tables) was correctly decoded to utf8 with
char_decode of biblio.

Why not ask for the records in ISO2709 from Zebra if that's what youwant to work with (when records are loaded in MARCXML, it can spit outeither 2709 or XML)? Or is it just that you want to have them inMARC::Record? At any rate, ISO2709 is a more compact exchange format, soit seems to make sense to use it.

4- We modified circ2.pm and items table to have item onloan field and
mapped it to marc holdings data. Now our opac search do not call mysql
but for the branchname.

5- Average updates per day is about 2000 (circulation+cataloger). I can
say that the speed of the zoom search which slows down during a commit
operation is acceptable considering the speed gain we have on the
search.

Again, I'd be keen to know which Zebra version you're running?

Beause the Internet Archive will be doing similar point-updates forrecords (only a *lot* more often than 2000 times per day) I have beenlooking a lot at the update speed for these small changes that onlyaffect a single term in a record (like a circulation code).. In my testdatabase of 150K records, 10K average size, I get update times around.5-.7 seconds, which just seems intuitively faster than it should haveto be. I'm going to nudge the coders to see if we can possibly do thisbetter.

6- Zebra behaves very well with searches but is very tempremental with
updates. A queue of updates sometimes crashes the zebraserver. When the
database crash we can not save anything even though we are using shadow
files. I'll be reporting on this issue once we can isolate the problems.

Now *that* is nasty. Do you have *any* way of consistently recreatingthe problem? Even if you don't, I'm sure Adam and the Zebra crew wouldlike to see some stack traces of those crashes!!


--Sebastian

Regards,
Tumer



_______________________________________________
Koha-devel mailing list
address@hidden
http://lists.nongnu.org/mailman/listinfo/koha-devel

----- End forwarded message -----


--
Sebastian Hammer, Index Data
address@hidden   www.indexdata.com
Ph: (603) 209-6853

[Prev in Thread]

Current Thread

[Next in Thread]

[Koha-zebra] address@hidden: [Koha-devel] Building zebradb], Joshua Ferraro, 2006/03/10
- Re: [Koha-zebra] address@hidden: [Koha-devel] Building zebradb], Sebastian Hammer <=

Prev by Date: [Koha-zebra] address@hidden: [Koha-devel] Building zebradb]
Next by Date: [Koha-zebra] RE: [Koha-devel] Building zebradb
Previous by thread: [Koha-zebra] address@hidden: [Koha-devel] Building zebradb]
Next by thread: [Koha-zebra] RE: [Koha-devel] Building zebradb
Index(es):
- Date
- Thread