koha-zebra
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Koha-zebra] Re: Import Speed


From: Sebastian Hammer
Subject: [Koha-zebra] Re: Import Speed
Date: Thu, 02 Mar 2006 11:05:44 -0500
User-agent: Mozilla Thunderbird 1.0.7 (Macintosh/20050923)

Joshua,

Importing records one at a time when first building a database, or when doing a batch update that is a substantial percentage of the size of the database is not a good idea. The software has no way to optimize the layout of the index files, so for each record update, things get shuffled around, resulting on very sluggish update performance and a less-than-ideal layout inside the index files.

It would be highly advisable to do at least the initial import from the command-line. I think it would make a lot of sense if this could be done well from the protocol, but AFAIK, the extended service interface at the moment only allows you to insert one record at a time.

Can we just process the raw MARC? Why did we choose the '.xml'
storage method in Zebra and is it a good choice? Would '.sgml' or
'.marc' be a better choice (because we could batch import directly
instead of '.xml's one-at-a-time)? Could we somehow use '.marc' for
the import and then switch to '.xml'?
That's a good question. You use .xml because extended services only work with XML. It *may* be possible to ingest records from the command-line as grs.marcxml (which reads MARC records and renders them internally as MARCXML), then do subsequent updates as XML, doing the conversion on the client side. I say *may*, because I haven't tried that, but I think it'd be worth a shot and it should be easy to make the experiment:

1: Start with a sample of MARC records
2: Build the initial index like so:

% zebraidx init
% zebraidx -f 10 -n -t grs.marcxml update recordfile (-n disables the shadow system for this update)

This should run  pleasantly fast compared to what you see now.

3: Try to update some records as MARCXML.

--Seb

Any suggestions on how to handle the connection in a more efficient
way?

Cheers,


--
Sebastian Hammer, Index Data
address@hidden   www.indexdata.com
Ph: (603) 209-6853





reply via email to

[Prev in Thread] Current Thread [Next in Thread]