[Top][All Lists]

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Compressed man pages (was: Accessibility of man pages (was: Playgrou

From: Alejandro Colomar
Subject: Re: Compressed man pages (was: Accessibility of man pages (was: Playground pager lsp(1)))
Date: Sun, 9 Apr 2023 15:36:05 +0200
User-agent: Mozilla/5.0 (X11; Linux x86_64; rv:102.0) Gecko/20100101 Thunderbird/102.9.1

On 4/9/23 14:29, Colin Watson wrote:
> On Sun, Apr 09, 2023 at 02:05:08PM +0200, Alejandro Colomar wrote:
>> Important note: Sam, are you sure you want your pages compressed
>> with bz2?  Have you seen the 10 seconds it takes man-db's man(1) to
>> find a word in the pages?  I suggest that at least you try to
>> reproduce these tests in your machine, and see if it's just me or
>> man-db's man(1) is pretty bad at non-gz pages.
> man-db is significantly slower with bzip2 than gzip these days, because
> much of the performance work I did in 2.10.0 only applies to gzip:
> there's in-process support for decompressing gzip, but we use
> subprocesses for bzip2.  IMO the relatively small difference in
> compressed size doesn't justify the effort of building in-process
> support for multiple compression algorithms.


>  I recommend that
> distributions just use gzip;

I don't agree here.  gzip vs man source is 5M vs 9M.  However, a
simple pipeline searching for a word in gzip pages takes ~114x the
time it takes to perform the same search on man(7) source.  I don't
think that small benefit in size doesn't justify the slowness.

Of course, this is only about theoretical maximum performance.
Current man(1) has other issues so it doesn't benefit from this
performance advantage.

> but if distributions _really_ want to use
> something else for whatever reason, then perhaps they should contribute
> code to man-db to ensure similar performance to gzip.  I'm happy to give
> pointers if there's a sufficiently compelling reason to make it worth
> the effort.
>> -  man-db's man(1) is slower with plain man(7) source than with .gz
>>    pages for some misterious reason.
> Maybe CPU is sufficiently cheaper than I/O that the fact of reading less
> data from disk dominates.

My CPU is powerful, but so is my SSD.  I wouldn't expect decompressing
to be faster than I/O.  I have a Samsung 960 PRO, which is quite fast[1].

$ lscpu
  Model name:            Intel(R) Core(TM) i7-5775C CPU @ 3.30GHz
    CPU family:          6
    Model:               71
    Thread(s) per core:  1
    Core(s) per socket:  4
    Socket(s):           1
    Stepping:            1
    CPU(s) scaling MHz:  44%
    CPU max MHz:         3700.0000
    CPU min MHz:         800.0000
Caches (sum of all):     
  L1d:                   128 KiB (4 instances)
  L1i:                   128 KiB (4 instances)
  L2:                    1 MiB (4 instances)
  L3:                    6 MiB (1 instance)
  L4:                    128 MiB (1 instance)

$ lspci | grep -i samsung
01:00.0 Non-Volatile memory controller: Samsung Electronics Co Ltd NVMe SSD 
Controller SM961/PM961/SM963

NAME                                FSTYPE   MOUNTPOINT              SIZE MODEL
nvme0n1                                                            953.9G 
Samsung SSD 960 PRO
├─nvme0n1p1                         vfat     /boot/efi              1023M 
├─nvme0n1p2                         ext4     /boot                     4G 
└─nvme0n1p3                         crypto_L                         948G 
  └─nvme0n1p3_crypt                 ext4     /                       948G

Also, a manual loop should have similar problems, but it doesn't have
them; if I loop manually over the files and grep them, it takes 0.01 s,
which is the lowest that /bin/time can measure on my system.

I repeated the tests on a tmpfs just to check.  The times are almost the
same (except that bzip goes down from 10 s to 9 s :).

$ mount | grep /tmp
tmpfs on /tmp type tmpfs (rw,noatime,inode64)
$ sudo rm -r /tmp/man
$ sudo make install-man prefix=/tmp/man/gz_ -j LINK_PAGES=symlink Z=.gz | wc -l
$ sudo make install-man prefix=/tmp/man/bz2 -j LINK_PAGES=symlink Z=.bz2 | wc -l
$ sudo make install-man prefix=/tmp/man/man -j LINK_PAGES=symlink Z= | wc -l
$ du -sh /tmp/man/*
5.3M    /tmp/man/bz2
5.4M    /tmp/man/gz_
9.3M    /tmp/man/man

$ export MANPATH=/tmp/man/gz_/share/man
$ /bin/time -f %e dash -c "man -Kaw RLIMIT_NOFILE | wc -l"
$ /bin/time -f %e dash -c "find $MANPATH -type f | while read f; do gzip -d - 
<\$f | grep -l RLIMIT_NOFILE >/dev/null && echo \$f; done | wc -l"

This is quite optimized.  I can't beat man(1) with a shell pipeline
for .gz pages.  :)

$ export MANPATH=/tmp/man/bz2/share/man
$ /bin/time -f %e dash -c "man -Kaw RLIMIT_NOFILE | wc -l"
$ /bin/time -f %e dash -c "find $MANPATH -type f | while read f; do bzip2 -d - 
<\$f | grep -l RLIMIT_NOFILE >/dev/null && echo \$f; done | wc -l"

Sam, really consider not using .bz2 for Gentoo's pages.  :)

$ export MANPATH=/tmp/man/man/share/man
$ /bin/time -f %e dash -c "man -Kaw RLIMIT_NOFILE | wc -l"
$ /bin/time -f %e dash -c "find $MANPATH -type f | xargs grep -l RLIMIT_NOFILE 
| wc -l"

man(1) is ~52x slower than my loop.  Similar results from RAM and NVMe,
so I/O is not the issue here.

> (Can I request that any concrete actions that need to be taken based on
> this thread be split out to separate bug reports or something, please?
> This thread is long and I don't really want to have lots of meandering
> discourse in my inbox going back over the tired old man vs. info debate
> or whatever, but if there are actual things I need to fix in man-db then
> I'd rather not miss them.)

Sure; do you have a mailing list, or should I send them to you and CC
linux-man@?  I have at least one bug report for you.


[1]:  <>

GPG key fingerprint: A9348594CE31283A826FBDD8D57633D441E25BB5

Attachment: OpenPGP_signature
Description: OpenPGP digital signature

reply via email to

[Prev in Thread] Current Thread [Next in Thread]