bug-wget
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[bug #59293] wget downloads nofollow links


From: --
Subject: [bug #59293] wget downloads nofollow links
Date: Sun, 18 Oct 2020 09:22:27 -0400 (EDT)
User-agent: Mozilla/5.0 (Windows NT 10.0; rv:68.0) Gecko/20100101 Firefox/68.0

URL:
  <https://savannah.gnu.org/bugs/?59293>

                 Summary: wget downloads nofollow links
                 Project: GNU Wget
            Submitted by: nhagea
            Submitted on: Sun 18 Oct 2020 01:22:25 PM UTC
                Category: Program Logic
                Severity: 3 - Normal
                Priority: 5 - Normal
                  Status: None
                 Privacy: Public
             Assigned to: None
         Originator Name: 
        Originator Email: 
             Open/Closed: Open
                 Release: 1.20
         Discussion Lock: Any
        Operating System: GNU/Linux
         Reproducibility: Every Time
           Fixed Release: None
         Planned Release: None
              Regression: None
           Work Required: None
          Patch Included: None

    _______________________________________________________

Details:

I want to crawl/scrape a wordpress website with wget.
Problem: wget will download documents/links despite them having a
`rel=nofollow` attribute. 
It is my understanding, that, when robots.txt is honored, also nofollow links
are honored, but this is not the case.

Quote wget.info:
> Specify whether the norobots convention is respected by Wget, “on” by
default. This switch controls both the ‘/robots.txt’ and the
‘nofollow’ aspect of the spec.

Overall it would be good to have those two separated - a user might want to
ignore robots.txt but do not download nofollow 
In the case of wordpress they are for example used to reply to and like
comments - so every comment will lead to the same page being downloaded 3x.

Example:

> wget --mirror --page-requisites --adjust-extension --convert-links
--restrict-file-names=windows --no-parent --span-hosts
--domains=randomascii.wordpress.com,wp.com 
https://randomascii.wordpress.com/about/

Now open the about folder and after some seconds you will see dozens of html
files that stem from nofollow links: `index.html@share=reddit.html`,
`index.html@share=twitter.html`, `index.html@replytocom=74214.html` ...

Is `nofollow` only checked when in a <meta> tag and not for <a>? That would be
quite bad.

Tested version: `GNU Wget 1.20.1 built on linux-gnu.` (Debian).




    _______________________________________________________

Reply to this item at:

  <https://savannah.gnu.org/bugs/?59293>

_______________________________________________
  Message sent via Savannah
  https://savannah.gnu.org/




reply via email to

[Prev in Thread] Current Thread [Next in Thread]