lynx-dev
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: LYNX-DEV Since Lynx won't, what will?


From: Jim Dennis
Subject: Re: LYNX-DEV Since Lynx won't, what will?
Date: Wed, 06 Aug 1997 23:09:12 -0700

> On Sun, 27 Jul 1997, Scott McGee (Personal) wrote:
> 
> Can you do a crawl/traverse, then do a for i in `cat urls.out` ; do lynx
> -source $i | grep IMG >img.urls ; done, and then for i in `cat
> img.urls`... type of script?

        I was the one who posted the request for a -mirror switch
        to supplement the -crawl -traversal options.

        I pointed out that one could do a script that parsed the 
        traverse.dat files and fetch each of the -source files
        with -dump.

        I do this with a sort | unique of the traversal, then
        a grep -v to remove all of the "offsite" references
        and then a '| { while read i ; do ... }' loop.

        It's not a problem to do this -- but is seems terribly
        inefficient compared to the single pass -crawl method.

        Thus my recommendation -- that we look at a -mirror switch.
        Since that idea was not well-received here I'll have to 
        look at doing it myself.  This is a bug for me since I 
        don't have nearly the experience in C that this project
        will require -- but it's an opportunity to learn more.
 
>> When using -crawl and -traversal, how can I get Lynx to download
>> images or sound files?

>> I seem to recall that the answer was "Lynx wasn't designed for
>> this. Try somthing that was like ____ or ____."

>> Can someone fill in the blanks in the above? I need something that
>> will fetch such files via http to my own system.

        I think the software that was recommended as an alternative
        to my suggestion was wmir or webmirror and/or wget or geturl
        (which are all Perl libraries, scripts or modules as far as I 
        know).

>> Trying to download each one seperately is tooooo
>> ssssssllllllllloooooooowwwww.  (Oh, they are offered freely, just
>> that the author didn't think to offer an archive!).

        For a one time mirror you can adapt the script I talk 
        about.  Here's a slightly tested version of it:

#! /bin/bash 
# lynx.mirror script in two passes
INDEXFILE="index.html"
TMPNAME=/tmp/lymir.$$
## lynx -traversal "$1"
#       use -traversal to build dat and reject files
#       The traverse.dat file is reasonably formatted for
#       our purposes but there is nagging problem.

#       If a link points to a directory name (forcing the 
#       remote web server to fetch the index.html or equivalent
#       for that directory) then we end up with a file that
#       that should be a directory.

#       Our only way (that I can see) of detecting this error
#       is if other entries in the dat file refer to other files
#       in this directory (thus revealing that foo is a directory
#       rather than a file.

# The error trap for this condition is: 
function mkdir_if_needed {
        # if necessary, make directory
        [ -d ./$1 ] || { [ -e ./$1 ] && {
                # if it exists and is not a directory
                # it's a collision (regular file to dir)
                        # move it out of the way
                        # and make the dir by that name
                        mv ./$1 $TMPNAME && mkdir ./$1 && \

                        # and move it under the new dir
                        # under the default index file name
                        mv $TMPNAME ./$1/$INDEXFILE }
                        } || mkdir ./$1 
        }
                        

grep '/$' traverse.dat | sort | uniq | { while read i ; do  
        # parse traverse.dat -- find lines that 
        # refer only to directory names and make a unique
        # listing to avoid fetching any cross-linked docs
        # more than once 
        # Then fetch those as $INDEXFILE (specified above)
                # trim URL hostname to get dir/filespec
                dir=${i#"$1"}; dir=${dir%/}; dir=${dir#/};
                        # directory name
                        # if necessary, make directory
                mkdir_if_needed $dir
                lynx -dump -source "$i" > ./${dir}/$INDEXFILE
                done }

grep -v '/$' traverse.dat | sort | uniq | { while read i ; do  
        # parse traverse.dat -- filter out lines that 
        # refer only to directory names and make a unique
        # listing to avoid fetching any cross-linked docs
        # more than once 
                # use bash/ksh internal string handling to
                # isolate filename parts
                fspec=${i#$1}; 
                        # full file spec: trim off URL hostname
                basename=${fspec##*/}; 
                        # basename: trim off dirname
                dir=${fspec%/*}; 
                        # directory name: trim off basename
                mkdir_if_needed $dir
                lynx -dump -source "$i" > ./${dir}/${basename}
                done }

# Get the images and non-html on that site
grep ^"$1" reject.dat | { while read i ; do  
        # find urls in reject that start with the site
        # name that we're mirroring -- thus skipping all
        # off site references that were rejected as non-local
                # use bash/ksh internal string handling to
                # isolate filename parts
                fspec=${i#$1}; 
                        # full file spec
                basename=${fspec##*/}; 
                        # basename
                dir=${fspec%/*}; 
                        # directory name

                mkdir_if_needed $dir
                lynx -dump -source "$i" > ./${dir}/${basename}
                        ## dump
                done }

        Despite it's appearance there's only about 30 to 50 lines
        of shell code there.  I haven't tested it aggressively -- 
        but it does mirror a couple of simple sites that I have 
        tested it on.

        I'm convinced that it could be done in a few hundred lines
        of C code embedded into the browser -- and do a better job
        at it.


--
Jim Dennis,                                address@hidden
Proprietor,                          address@hidden
Starshine Technical Services              http://www.starshine.org

        PGP  1024/2ABF03B1 Jim Dennis <address@hidden>
        Key fingerprint =  2524E3FEF0922A84  A27BDEDB38EBB95A 
;
; To UNSUBSCRIBE:  Send a mail message to address@hidden
;                  with "unsubscribe lynx-dev" (without the
;                  quotation marks) on a line by itself.
;

reply via email to

[Prev in Thread] Current Thread [Next in Thread]