pan-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Pan-devel] More database thoughts


From: K. Haley
Subject: Re: [Pan-devel] More database thoughts
Date: Sun, 20 Jun 2004 00:00:41 -0600
User-agent: Mozilla Thunderbird 0.7 (Windows/20040616)

lets see if it gets posted this time.

Tom wrote:

I like Calin's incremental approach. If you add the DB stuff to what's
already there, it makes it easy to cross-check for debugging purposes.
Also, unless you're confident that SQLite will never corrupt the DB,
having the files as a backup is an insurance factor. In any case, adding
a "(re)build DB" menu option someplace might be a good starting point.
Pan currently uses one directory for each server. These directories contain files for each group for which you've downloaded headers, containing the article info for that group. This would seem to make cross-checking complicated. As for corruption, Pan's current setup gets corupted occasionaly as is. The only real solution here is to use more than one DB file. The first would hold the server, group, and group-server tables. The article and article-server stuff could be in one or more additional tables. It's an interesting tade-off.

If stored in one table then article status would be tracked for cross-posts, however all article data is lost if the file is corrupted.

If stored in one table per group then only that groups data would be lost and the user could nuke the file if it's not wanted, however article status would not be tracked for cross-posts. It would also be more difficult to implement.

It looks to me that the Article structure is a big memory user. The
thing is, you rarely if ever display anything other than part 0 or 1, so
to me it makes sense to only keep the part 0/1 in memory, and retrieve
from DB/display the others on an as-needed basis. Correct me if I'm
wrong, but I suspect that few (text or binary) groups would have more
than 100,000 "unique" subjects (part 0/1). It seems to me that trying to
truncate the (xx/yy) from the subject string would be a small saving by
comparison.
My idea is to extend the duplicate checking to include authors. This would offer more space savings in most groups. Whether or not the subject gets truncated is another matter. The same table that hold the subjects will hold the authors as well. No need for an additional authors table since both are used for finding duplicates.

TABLE duplicates
text
ref_cnt
id

Article
subject    duplicates:id
author     duplicates:id


As for the Article structure usgin a lot of memory, all we really need is a small cache of 100-200 entries for the visible articles.

Attachment: signature.asc
Description: OpenPGP digital signature


reply via email to

[Prev in Thread] Current Thread [Next in Thread]