[bug #59293] wget downloads nofollow links

bug-wget

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[bug #59293] wget downloads nofollow links

From:	--
Subject:	[bug #59293] wget downloads nofollow links
Date:	Sun, 18 Oct 2020 09:22:27 -0400 (EDT)
User-agent:	Mozilla/5.0 (Windows NT 10.0; rv:68.0) Gecko/20100101 Firefox/68.0

URL:
  <https://savannah.gnu.org/bugs/?59293>

                 Summary: wget downloads nofollow links
                 Project: GNU Wget
            Submitted by: nhagea
            Submitted on: Sun 18 Oct 2020 01:22:25 PM UTC
                Category: Program Logic
                Severity: 3 - Normal
                Priority: 5 - Normal
                  Status: None
                 Privacy: Public
             Assigned to: None
         Originator Name: 
        Originator Email: 
             Open/Closed: Open
                 Release: 1.20
         Discussion Lock: Any
        Operating System: GNU/Linux
         Reproducibility: Every Time
           Fixed Release: None
         Planned Release: None
              Regression: None
           Work Required: None
          Patch Included: None

    _______________________________________________________

Details:

I want to crawl/scrape a wordpress website with wget.
Problem: wget will download documents/links despite them having a
`rel=nofollow` attribute. 
It is my understanding, that, when robots.txt is honored, also nofollow links
are honored, but this is not the case.

Quote wget.info:
> Specify whether the norobots convention is respected by Wget, “on” by
default. This switch controls both the ‘/robots.txt’ and the
‘nofollow’ aspect of the spec.

Overall it would be good to have those two separated - a user might want to
ignore robots.txt but do not download nofollow 
In the case of wordpress they are for example used to reply to and like
comments - so every comment will lead to the same page being downloaded 3x.

Example:

> wget --mirror --page-requisites --adjust-extension --convert-links
--restrict-file-names=windows --no-parent --span-hosts
--domains=randomascii.wordpress.com,wp.com 
https://randomascii.wordpress.com/about/

Now open the about folder and after some seconds you will see dozens of html
files that stem from nofollow links: `index.html@share=reddit.html`,
`index.html@share=twitter.html`, `index.html@replytocom=74214.html` ...

Is `nofollow` only checked when in a <meta> tag and not for <a>? That would be
quite bad.

Tested version: `GNU Wget 1.20.1 built on linux-gnu.` (Debian).




    _______________________________________________________

Reply to this item at:

  <https://savannah.gnu.org/bugs/?59293>

_______________________________________________
  Message sent via Savannah
  https://savannah.gnu.org/

[Prev in Thread]

Current Thread

[Next in Thread]

[bug #59293] wget downloads nofollow links, -- <=

Prev by Date: Re: wget core dumps after getting file (opensuse tumbleweed)
Next by Date: [bug #59320] wget does not obey timestamping option when destination file is owned by "nobody"
Previous by thread: Love you wget :) help please!
Next by thread: [bug #59320] wget does not obey timestamping option when destination file is owned by "nobody"
Index(es):
- Date
- Thread