[Top][All Lists]

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Pan-users] Re: saving new headers progress bar?

From: Duncan
Subject: [Pan-users] Re: saving new headers progress bar?
Date: Wed, 04 Feb 2004 02:20:28 -0700
User-agent: Pan/ (As She Crawled Across the Table)

Vadim Berezniker posted <address@hidden>, excerpted
below,  on Tue, 03 Feb 2004 23:22:04 -0500:

> So, about 3 hours ago I went to fetch new headers from a group, and pan 
> started downloading around 1.4 million headers. I'm on a rather fast 
> connection and that part was done in under a minute. That was 3 hours 
> ago and Pan is still saving the headers to the hard-drive and I'm not 
> really sure when it's going to be done.
> It would be nice if the progress bar was filled as the articles are 
> saved for cases like this.
> (The major slowness is due to the massive amount of swapping that is 
> happening ... I wonder if it would be better if pan freed the data as 
> soon as it was written to disk [i'm assuming the data is not needed 
> again in the process] because as it is now pan's memory usage is not 
> going down)
> [Note, I'm not complaining about Pan. I know it's a real pain to deal 
> with so much data.]

One thing that you might find helpful, under huge-group situations like
that, is to ensure you don't have any filters/rules/scoring applying to
the group.  In fact, if you KNOW you are going to be d/ling a huge group,
it might be useful to point pan at an empty score file temporarily, until
it gets everything filed.  You could then try re-pointing it at the real
one again, if you REALLY need active scoring in the group, then loading a
post, using view-score, and then hitting rescore.  Or.. simply don't use
scoring at all.

Another thing you could do would be using the other download options
dialog, to only get a chunk of the overviews at a time.  PAN used to start
lugging at around 100K overviews, but that's more than doubled, due to
tweaking, so it can handle several hundred K overviews now, but still lugs
down with over 200K to 500K, for most user's memory configs (half a gig
or so real memory, I believe someone posted a number of about 800K @ a gig
of memory, with rules/filters/scoring zeroed out), especially if they are
heavy with the scoring and filters. Thus, pick a number, say 200K
overviews, and d/l that many first, then d/l a second batch if desired,
and a third, etc. until you either find what you are looking for, reach
the retention limit for the server, or go back as far as you desire.

It'll be interesting to see how PAN's performance changes when it gets
"libified" to use the SQLite library for its back-end data-base work. 
Currently, AFAIK, it depends to a large extent on the GTK data-aware
widgets, so there's a limit to the amount of tweaking that can be done,
given that they are designed primarily for ease of use both by programmer
and user, and aren't going to be as efficient (or scalable, and when
talking a million data-points, scalable gets REALLY critical) at real data
grinding as a dedicated data-base data-grinding library is likely to be. 
I don't know what the improvement will be, but it could be quite
significant, depending on how efficient GTK really IS at data grinding. 
Having some real number comparisons as it takes shape, when that switch
occurs, is definitely going to be educational here, anyway, because you
are right, that IS a lot of data to deal with, and I seriously suspect the
tools PAN is currently using simply were NOT designed to scale to that

Duncan - List replies preferred.   No HTML msgs.
"They that can give up essential liberty to obtain a little
temporary safety, deserve neither liberty nor safety." --
Benjamin Franklin

reply via email to

[Prev in Thread] Current Thread [Next in Thread]