[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[Bug-wget] [bug #50516] domain.com vs www.domain.com site duplication
From: |
Ages Ayemtwo |
Subject: |
[Bug-wget] [bug #50516] domain.com vs www.domain.com site duplication |
Date: |
Sat, 11 Mar 2017 15:01:58 -0500 (EST) |
User-agent: |
Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:45.9) Gecko/20100101 Goanna/3.0 Firefox/45.9 PaleMoon/27.1.0 |
URL:
<http://savannah.gnu.org/bugs/?50516>
Summary: domain.com vs www.domain.com site duplication
Project: GNU Wget
Submitted by: ages2500
Submitted on: Sat 11 Mar 2017 08:01:57 PM UTC
Category: Feature Request
Severity: 3 - Normal
Priority: 5 - Normal
Status: None
Privacy: Public
Assigned to: None
Originator Name:
Originator Email:
Open/Closed: Open
Discussion Lock: Any
Release: None
Operating System: None
Reproducibility: None
Fixed Release: None
Planned Release: None
Regression: None
Work Required: None
Patch Included: No
_______________________________________________________
Details:
When retrieving http://www.domain.com/, the site author may link a file to
domain.com, without the www. This also occurs when the opposite is true.
Either scenario results in the website being downloaded twice, creating a
hapazard mesh of file links between:
/domain.com/
and
/www.domain.com/
It also means that 404 pages will link to http://domain.com/ in the html of
files of one folder, and http://www.domain.com/ in the other.
If one were to overlook the local mess this creates, it still puts extra
strain on a large wget process by crawling and downloading near twice as much
data than it needs to.
Restricting the site to -D www.domain.com runs the risk of missing data. To
ensure I get all of the data from the domain in question, I use -D
domain.com.
It would be nice for an extra flag to treat domain.com and www.domain.com
content the same in wget, and store the content in the same folder without
content duplication.
I am not requesting that this feature be a default function, but rather an
additional flag/feature that treats www.domain.com and domain.com as coming
from the same domain.
The following URL will exhibit this behavior in wget:
wget -rkE -np -l inf -D runequake.com http://www.runequake.com/
_______________________________________________________
Reply to this item at:
<http://savannah.gnu.org/bugs/?50516>
_______________________________________________
Message sent via/by Savannah
http://savannah.gnu.org/
- [Bug-wget] [bug #50516] domain.com vs www.domain.com site duplication,
Ages Ayemtwo <=