bug-wget
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Bug-wget] [bug #50516] domain.com vs www.domain.com site duplication


From: Ages Ayemtwo
Subject: [Bug-wget] [bug #50516] domain.com vs www.domain.com site duplication
Date: Sat, 11 Mar 2017 15:01:58 -0500 (EST)
User-agent: Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:45.9) Gecko/20100101 Goanna/3.0 Firefox/45.9 PaleMoon/27.1.0

URL:
  <http://savannah.gnu.org/bugs/?50516>

                 Summary: domain.com vs www.domain.com site duplication
                 Project: GNU Wget
            Submitted by: ages2500
            Submitted on: Sat 11 Mar 2017 08:01:57 PM UTC
                Category: Feature Request
                Severity: 3 - Normal
                Priority: 5 - Normal
                  Status: None
                 Privacy: Public
             Assigned to: None
         Originator Name: 
        Originator Email: 
             Open/Closed: Open
         Discussion Lock: Any
                 Release: None
        Operating System: None
         Reproducibility: None
           Fixed Release: None
         Planned Release: None
              Regression: None
           Work Required: None
          Patch Included: No

    _______________________________________________________

Details:

When retrieving http://www.domain.com/, the site author may link a file to
domain.com, without the www. This also occurs when the opposite is true.

Either scenario results in the website being downloaded twice, creating a
hapazard mesh of file links between:

/domain.com/

and

/www.domain.com/

It also means that 404 pages will link to http://domain.com/ in the html of
files of one folder, and http://www.domain.com/ in the other.

If one were to overlook the local mess this creates, it still puts extra
strain on a large wget process by crawling and downloading near twice as much
data than it needs to.

Restricting the site to -D www.domain.com runs the risk of missing data. To
ensure I get all of the data from the domain in question, I use -D
domain.com.

It would be nice for an extra flag to treat domain.com and www.domain.com
content the same in wget, and store the content in the same folder without
content duplication.

I am not requesting that this feature be a default function, but rather an
additional flag/feature that treats www.domain.com and domain.com as coming
from the same domain.

The following URL will exhibit this behavior in wget:


wget -rkE -np -l inf -D runequake.com http://www.runequake.com/







    _______________________________________________________

Reply to this item at:

  <http://savannah.gnu.org/bugs/?50516>

_______________________________________________
  Message sent via/by Savannah
  http://savannah.gnu.org/




reply via email to

[Prev in Thread] Current Thread [Next in Thread]