[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: LYNX-DEV Since Lynx won't, what will?
From: |
Jim Dennis |
Subject: |
Re: LYNX-DEV Since Lynx won't, what will? |
Date: |
Wed, 06 Aug 1997 23:09:12 -0700 |
> On Sun, 27 Jul 1997, Scott McGee (Personal) wrote:
>
> Can you do a crawl/traverse, then do a for i in `cat urls.out` ; do lynx
> -source $i | grep IMG >img.urls ; done, and then for i in `cat
> img.urls`... type of script?
I was the one who posted the request for a -mirror switch
to supplement the -crawl -traversal options.
I pointed out that one could do a script that parsed the
traverse.dat files and fetch each of the -source files
with -dump.
I do this with a sort | unique of the traversal, then
a grep -v to remove all of the "offsite" references
and then a '| { while read i ; do ... }' loop.
It's not a problem to do this -- but is seems terribly
inefficient compared to the single pass -crawl method.
Thus my recommendation -- that we look at a -mirror switch.
Since that idea was not well-received here I'll have to
look at doing it myself. This is a bug for me since I
don't have nearly the experience in C that this project
will require -- but it's an opportunity to learn more.
>> When using -crawl and -traversal, how can I get Lynx to download
>> images or sound files?
>> I seem to recall that the answer was "Lynx wasn't designed for
>> this. Try somthing that was like ____ or ____."
>> Can someone fill in the blanks in the above? I need something that
>> will fetch such files via http to my own system.
I think the software that was recommended as an alternative
to my suggestion was wmir or webmirror and/or wget or geturl
(which are all Perl libraries, scripts or modules as far as I
know).
>> Trying to download each one seperately is tooooo
>> ssssssllllllllloooooooowwwww. (Oh, they are offered freely, just
>> that the author didn't think to offer an archive!).
For a one time mirror you can adapt the script I talk
about. Here's a slightly tested version of it:
#! /bin/bash
# lynx.mirror script in two passes
INDEXFILE="index.html"
TMPNAME=/tmp/lymir.$$
## lynx -traversal "$1"
# use -traversal to build dat and reject files
# The traverse.dat file is reasonably formatted for
# our purposes but there is nagging problem.
# If a link points to a directory name (forcing the
# remote web server to fetch the index.html or equivalent
# for that directory) then we end up with a file that
# that should be a directory.
# Our only way (that I can see) of detecting this error
# is if other entries in the dat file refer to other files
# in this directory (thus revealing that foo is a directory
# rather than a file.
# The error trap for this condition is:
function mkdir_if_needed {
# if necessary, make directory
[ -d ./$1 ] || { [ -e ./$1 ] && {
# if it exists and is not a directory
# it's a collision (regular file to dir)
# move it out of the way
# and make the dir by that name
mv ./$1 $TMPNAME && mkdir ./$1 && \
# and move it under the new dir
# under the default index file name
mv $TMPNAME ./$1/$INDEXFILE }
} || mkdir ./$1
}
grep '/$' traverse.dat | sort | uniq | { while read i ; do
# parse traverse.dat -- find lines that
# refer only to directory names and make a unique
# listing to avoid fetching any cross-linked docs
# more than once
# Then fetch those as $INDEXFILE (specified above)
# trim URL hostname to get dir/filespec
dir=${i#"$1"}; dir=${dir%/}; dir=${dir#/};
# directory name
# if necessary, make directory
mkdir_if_needed $dir
lynx -dump -source "$i" > ./${dir}/$INDEXFILE
done }
grep -v '/$' traverse.dat | sort | uniq | { while read i ; do
# parse traverse.dat -- filter out lines that
# refer only to directory names and make a unique
# listing to avoid fetching any cross-linked docs
# more than once
# use bash/ksh internal string handling to
# isolate filename parts
fspec=${i#$1};
# full file spec: trim off URL hostname
basename=${fspec##*/};
# basename: trim off dirname
dir=${fspec%/*};
# directory name: trim off basename
mkdir_if_needed $dir
lynx -dump -source "$i" > ./${dir}/${basename}
done }
# Get the images and non-html on that site
grep ^"$1" reject.dat | { while read i ; do
# find urls in reject that start with the site
# name that we're mirroring -- thus skipping all
# off site references that were rejected as non-local
# use bash/ksh internal string handling to
# isolate filename parts
fspec=${i#$1};
# full file spec
basename=${fspec##*/};
# basename
dir=${fspec%/*};
# directory name
mkdir_if_needed $dir
lynx -dump -source "$i" > ./${dir}/${basename}
## dump
done }
Despite it's appearance there's only about 30 to 50 lines
of shell code there. I haven't tested it aggressively --
but it does mirror a couple of simple sites that I have
tested it on.
I'm convinced that it could be done in a few hundred lines
of C code embedded into the browser -- and do a better job
at it.
--
Jim Dennis, address@hidden
Proprietor, address@hidden
Starshine Technical Services http://www.starshine.org
PGP 1024/2ABF03B1 Jim Dennis <address@hidden>
Key fingerprint = 2524E3FEF0922A84 A27BDEDB38EBB95A
;
; To UNSUBSCRIBE: Send a mail message to address@hidden
; with "unsubscribe lynx-dev" (without the
; quotation marks) on a line by itself.
;
- Re: LYNX-DEV Since Lynx won't, what will?,
Jim Dennis <=
- Re: LYNX-DEV Since Lynx won't, what will?, Scott McGee (Personal), 1997/08/07
- Re: LYNX-DEV Since Lynx won't, what will?, Jim Dennis, 1997/08/07
- Re: LYNX-DEV Since Lynx won't, what will?, David Woolley, 1997/08/09
- Re: LYNX-DEV Since Lynx won't, what will?, Jonathan Sergent, 1997/08/20
- Re: LYNX-DEV Since Lynx won't, what will?, Larry W. Virden, x2487, 1997/08/20
- Re: LYNX-DEV Since Lynx won't, what will?, Foteos Macrides, 1997/08/09
- Re: LYNX-DEV Since Lynx won't, what will?, Jim Dennis, 1997/08/10