[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[bug #59293] wget downloads nofollow links
From: |
-- |
Subject: |
[bug #59293] wget downloads nofollow links |
Date: |
Sun, 18 Oct 2020 09:22:27 -0400 (EDT) |
User-agent: |
Mozilla/5.0 (Windows NT 10.0; rv:68.0) Gecko/20100101 Firefox/68.0 |
URL:
<https://savannah.gnu.org/bugs/?59293>
Summary: wget downloads nofollow links
Project: GNU Wget
Submitted by: nhagea
Submitted on: Sun 18 Oct 2020 01:22:25 PM UTC
Category: Program Logic
Severity: 3 - Normal
Priority: 5 - Normal
Status: None
Privacy: Public
Assigned to: None
Originator Name:
Originator Email:
Open/Closed: Open
Release: 1.20
Discussion Lock: Any
Operating System: GNU/Linux
Reproducibility: Every Time
Fixed Release: None
Planned Release: None
Regression: None
Work Required: None
Patch Included: None
_______________________________________________________
Details:
I want to crawl/scrape a wordpress website with wget.
Problem: wget will download documents/links despite them having a
`rel=nofollow` attribute.
It is my understanding, that, when robots.txt is honored, also nofollow links
are honored, but this is not the case.
Quote wget.info:
> Specify whether the norobots convention is respected by Wget, “on” by
default. This switch controls both the ‘/robots.txt’ and the
‘nofollow’ aspect of the spec.
Overall it would be good to have those two separated - a user might want to
ignore robots.txt but do not download nofollow
In the case of wordpress they are for example used to reply to and like
comments - so every comment will lead to the same page being downloaded 3x.
Example:
> wget --mirror --page-requisites --adjust-extension --convert-links
--restrict-file-names=windows --no-parent --span-hosts
--domains=randomascii.wordpress.com,wp.com
https://randomascii.wordpress.com/about/
Now open the about folder and after some seconds you will see dozens of html
files that stem from nofollow links: `index.html@share=reddit.html`,
`index.html@share=twitter.html`, `index.html@replytocom=74214.html` ...
Is `nofollow` only checked when in a <meta> tag and not for <a>? That would be
quite bad.
Tested version: `GNU Wget 1.20.1 built on linux-gnu.` (Debian).
_______________________________________________________
Reply to this item at:
<https://savannah.gnu.org/bugs/?59293>
_______________________________________________
Message sent via Savannah
https://savannah.gnu.org/
[Prev in Thread] |
Current Thread |
[Next in Thread] |
- [bug #59293] wget downloads nofollow links,
-- <=