[Top][All Lists]

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Lynx-dev] circumventing blocking sites

From: Stefan Caunter
Subject: Re: [Lynx-dev] circumventing blocking sites
Date: Sat, 4 Feb 2017 12:06:39 -0500

On Sat, Feb 4, 2017 at 11:28 AM, Nelson H. F. Beebe <address@hidden> wrote:
> For several years, I have used lynx (and also wget, and rarely, curl)
> to access publisher Web pages for new journal issues.  Recently, I
> noticed that a lynx pull of an page from Elsevier ScienceDirect would
> never complete:
>         % lynx -source -accept_all_cookies -cookies  --trace 
> > foo.62
> parse_arg(arg_name=, 
> mask=1, count=5)
>         parse_arg 
> startfile:
>         ... no further output, and no job completion ...
> Similarly, I also find that wget and curl fail to complete.
> This new behavior suggests that the publisher site has thrown up
> http-agent-specific, rather than IP-address-specific blocks, because
> accessing the same URL in a GUI browser on the SAME machine gets an
> immediate return of the expected journal issue contents.
> If I add the --debug option to wget, I find that it reports
>         ---request begin---
>         GET /science/journal/00978493/62 HTTP/1.1
>         User-Agent: Wget/1.14 (linux-gnu)
>         Accept: */*
>         Host:
>         Connection: Keep-Alive
>         ---request end---
> Thus, it identifies itself as wget, and I assume that lynx probably
> self identifies as well.
> Does anyone on this list have an idea how to circumvent these apparent
> blocks?

put -useragent="Googlebot" or "Mozilla" in your command line:

lynx -useragent="Mozilla"  -accept_all_cookies -dump

gets me a long list of links in the html result

reply via email to

[Prev in Thread] Current Thread [Next in Thread]