[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Pan-users] size of newsrc-1 file
From: |
Heinz Mezera |
Subject: |
Re: [Pan-users] size of newsrc-1 file |
Date: |
Wed, 06 Jul 2016 11:09:26 +0200 |
Hello Duncan,
a detailed answer as ususal!
Forgive me if I do not follow standard answering methods and summerize
here:
Pan takes approx. 10 minutes to start,
ls -lh newsrc-1 --> 567M (a manual edit would be an enormous task and
I've _never_ done any scripting before)
renamed article-cache directory and reduced cache size to 10 MB (from
100)
only one news server
less than a dozen groups subscribed (a few text, others binaries)
my working style: read new headers, decide which ones might be of
interest and save them to disk; I never go back to yesterdays or older
articles
I don't use scores (or I'm not aware off)
Should I remove/rename .pan2 directory and start from scratch?
kr Heinz
Am Mittwoch, den 06.07.2016, 06:22 +0000 schrieb Duncan:
> Heinz Mezera posted on Tue, 05 Jul 2016 12:47:21 +0200 as excerpted:
>
> >
> > Hello pan-users,
> >
> > does the size of newsrc-1 influence pan's time to start, to quit or
> > its
> > performance?
> >
> > I use Ubuntu's 16.04 version of pan (0.139-5build1) and it takes
> > rather
> > long until pan appears on Ubuntu's desktop.
> >
> > Can I compact newsrc-1 or reduce its size somehow?
> I suspect your problem isn't the newsrc file, but something else...
> [discussed below, but first...]
>
> To answer your question somewhat directly, however, the newsrc
> file(s,
> one per configured server) can indeed be compacted some, and that
> /might/
> affect startup time, tho in my own experience there's a far worse
> trigger
> of startup delay that I suspect is the real problem. However, the
> newrc
> files can be made more efficient.
>
> These newsrc files follow a standard text-based format and can be
> edited
> using a standard text editor. As always, making a backup of the
> unaltered file before you begin is recommended, just in case you
> screw up
> the edits.
>
> Rather than describe in detail the format, I'll simply provide you a
> google link...
>
> https://www.google.com/search?q=newsrc+file+format
>
> There is however one caveat about pan's usage. (Current) Pan doesn't
> use
> the subscription info in the newsrc (tho old C-based pan, 0.14.x,
> did,
> before the C++ rewrite), because a newsrc is inherently single-
> server,
> and pan's subscriptions apply across all configured servers that
> carry
> the group. So pan uses a different method to track group
> subscriptions.
>
> What pan /does/ track in the newsrcs, however, is the per-server per-
> newsgroup article sequence numbers, so it knows which ones on each
> server
> you've already seen so it knows not to download those headers again.
>
> It's this sequence of comma-separated article numbers that appears at
> the
> end of the newsrc line for any group you've visited (or seen a cross-
> posted message in).
>
> And you can consolidate these article numbers lists by removing the
> gaps
> and making the ranges continuous.
>
> It's worth noting that news servers initially communicate what they
> currently have using only a high-water and a low-water mark, plus an
> /
> estimated/ count of the number of messages available, with that
> estimate
> allowed to be /more/ than the number of currently available messages,
> but
> never /less/. These are IOW the lowest numbered message still
> available
> (unexpired), and the highest numbered message available (the latest
> message to arrive), plus the estimate. Missing article numbers
> between
> the high and low water marks are specifically allowed -- this lets
> servers remove messages reported as spam or as copyright violations,
> etc. Sometimes these missing messages will be filled in later (some
> servers are infamous for doing this, infamous because it screws up
> some
> news clients). Often they're not.
>
> And it's these gaps in the server store, along with simply not
> visiting
> the newsgroup for longer than its expiration period if your server
> does
> expire messages (some dedicated news service providers effectively
> don't
> expire messages, these days), that appear as gaps in pan's sequence
> number lists -- because it never saw those messages.
>
>
> Now, if you're reasonably sure your server doesn't fill in article
> sequence numbers, only ever increasing them, or if you simply don't
> care
> to see what are likely old messages if they are filled in, you can
> cut
> out all the commas and make the list a single range, from 1 or
> whatever
> the lowest number is in the existing list, to the highest number. If
> the
> server does do fill-ins, you might still be able to make the oldest
> messages a continuous range, while leaving the gaps in anything
> newer
> than say a month old, just in case.
>
> So, to take one example line from the linuxtopia google hit (the
> first
> hit in the google above, as a write this, note that this page is from
> a
> book copyrighted in 2003, and its mention of pan as an exception to
> the
> newsrc format is... dated, pan does use the format now):
>
> news.software.readers! 1-95504,137265,137274,140059,140091,140117
>
> You can edit that to:
>
> news.software.readers! 1-140117
>
> Much shorter! =:^)
>
> Unfortunately, if you follow a lot of groups, all that manual
> editing
> could be a big chore (unless you can figure out a nice script to
> automate
> the process, should be possible), with, I suspect, rather limited
> results
> in terms of startup.
>
>
> Instead, what I've found to take the real time, particularly on
> spinning
> rust drives (I'm on SSD now and haven't had to worry about it since
> I
> upgraded to SSD), is large message caches.
>
> Note that pan's cache size is configurable, but defaults to 10 MB
> which
> shouldn't be an issue, but also will start dumping already
> downloaded
> articles to make room for more, particularly if you do binaries,
> rather
> quickly. For a usage pattern that saves off attachments directly,
> with
> no further use for the messages in cache after that, 10 MB is
> fine. For
> a usage pattern more like mine, however, where I tend to download a
> bunch
> of stuff to cache so it's local, and then go thru it later, a cache
> size
> of several GB may be more appropriate. Similarly, if you have
> groups
> that you effectively archive, keeping all messages without expiring
> them
> at all, as I do with my text groups, a cache of several gigs will
> likely
> hold several years worth of text-group messages. (I have text
> messages
> going back to 2002 in some groups. My cache for my text-groups pan
> instance[1] is, as of now, 1.4 GiB, so the average usage is 100
> MB/year.)
>
> Once that cache gets to a few hundred MiB, you'll start noticing pan
> startup gets slower and slower on *first* startup, as the cache gets
> bigger and bigger. (Pan will start up faster after the first start,
> since everything's already cached. At least it will if you have
> enough
> memory to cache into RAM the full pan message cache. If you're
> running 1
> GiB or less of RAM... probably not so much.) This is because pan
> loads
> those messages every time it starts, in ordered to rethread them --
> it
> keeps track of message threading in memory.
>
> Back when I was on spinning rust, I found a few ways to deal with
> this.
>
> One was, set pan to start with my X user session, so it could grind
> away
> for several minutes loading stuff while I did other things. A few
> minutes later when I had completed other tasks, pan would generally
> be up
> (in the system tray) and ready to go. I'd normally keep pan running
> constantly, in the system tray, until I was ready to end the user X
> session.
>
> Another I found quite by accident. I periodically do backups of the
> multiple partitions on my system, and every few years, I'll boot to
> the
> backup, wipe away the normal working partition, and copy things back
> from
> the backup to the working copy, renewing it.
>
> I found that at least with some filesystems (I was using reiserfs at
> the
> time), pan evidently fragments the cache files rather heavily. I
> believe
> this is most likely to happen when multiple threads are downloading
> files
> at once, writing them in parallel and fragmenting them in the
> process.
>
> By backing up the cache files, erasing the working cache copy, and
> copying everything back into place, the new copy was defragmented due
> to
> the copy process, and pan started up much faster after that, even tho
> it
> still had the same size cache.
>
> Of course over time it slowed down again as I added new messages to
> my
> newsgroup archive, but now that I knew the trick, I could defrag the
> cache any time the start time got too long, and pan would startup
> faster
> again.
>
> And of course as I mentioned, putting it on SSD sped things up
> dramatically, because ssds have zero seek time, so fragmentation
> doesn't
> affect them anything close to as badly (tho it can still have some
> effect
> due to IOPs per file increasing with the number of fragments).
>
>
> That's what definitely took the load time for me, pan reading all
> those
> files from cache into memory, so it could rethread them.
>
> There's a simple way to confirm whether this is your problem or
> not.
> With pan closed, simply rename the article-cache directory to
> something
> else, so pan will recreate a new, empty cache, when it starts. If
> the
> cache is your slowdown, pan should start much faster, likely nearly
> instantly, with no cache to load.
>
> Tho of course if you've never upped your cache size from the default
> 10
> MB, the cache is unlikely to be the problem, and you probably won't
> notice a difference with the above test.
>
>
> Finally, I should mention that a big scorefile will slow pan down at
> startup. There are ways to dramatically optimize the scorefile, but
> that's a different subject, that we can deal with later if you find
> it to
> be the problem. Meanwhile, however, you can test it using the same
> technique I suggested above for testing the cache. Simply rename
> the
> scorefile and see if pan starts faster with an empty one. If the
> scorefile turns out to be your problem, post back with the results
> and we
> can deal with that, then.
>
> ---
> [1] Text-groups pan instance: It is possible to have several
> separately
> configured pan instances, each with their own configuration and
> cache.
> ~/.pan2/ is only the default location. If the $PAN_HOME variable is
> found to be set in pan's environment as it starts, it will use the
> location found in that variable as its configuration and cache home,
> instead. I've taken advantage of this to setup a number of pan
> wrapper
> scripts here, pan.text, pan.test, and pan.bin, that each point at a
> different config and cache. This lets me manage my unexpiring text-
> group-
> archive cache separately from my binaries cache, also unexpiring and
> set
> rather large, but cleared manually from time to time.
>