bug-wget
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Bug-wget] [bug #53818] Proposal: Check HTML suffix (for TEXTHTML flag)


From: Tsukasa OI
Subject: [Bug-wget] [bug #53818] Proposal: Check HTML suffix (for TEXTHTML flag) also on unchanged files
Date: Thu, 3 May 2018 06:00:54 -0400 (EDT)
User-agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:59.0) Gecko/20100101 Firefox/59.0

URL:
  <http://savannah.gnu.org/bugs/?53818>

                 Summary: Proposal: Check HTML suffix (for TEXTHTML flag) also
on unchanged files
                 Project: GNU Wget
            Submitted by: a4lg
            Submitted on: Thu 03 May 2018 07:00:52 PM JST
                Category: Program Logic
                Severity: 3 - Normal
                Priority: 5 - Normal
                  Status: None
                 Privacy: Public
             Assigned to: None
         Originator Name: 
        Originator Email: 
             Open/Closed: Open
         Discussion Lock: Any
                 Release: None
        Operating System: GNU/Linux
         Reproducibility: Every Time
           Fixed Release: None
         Planned Release: None
              Regression: No
           Work Required: None
          Patch Included: Yes

    _______________________________________________________

Details:

Version: 1.19.4

If both `-r' (recursive) and `-N' (check timestamp) options are given and the
server returns 304 (Not Modified), the HTML file (already downloaded) is not
considered as a HTML file and links in the HTML file are not followed.

If we want to (periodically) backup some website (all pages are linked from
index.html directly or indirectly) to track some changes while avoiding
unnecessary downloads, we naturally use `-N' option. However, if some "leaf"
pages are changed but index.html is unchanged, we could miss some important
changes.

I hate this behavior (`-nc' option mostly works because it guesses HTML file
by its file name suffix but `-N' doesn't) so I decided to propose a small
change.

The attached patch reuses `get_file_flags` (which guesses HTML file by file
name suffix *when -nc (no clobber) option is given*) if the server returns 304
(Not Modified).

Note that:
0 This patch slightly changes Wget's behavior.
0 It makes a caveat similar to bug #50935. If solution to bug #50935 is
invented, it can be (and should be) applied to this.
0 I (as author) consider this patch is too small to be copyrighted.

I tested the patch but I'm not sure whether this patch is suitable for
upstream merge. I consider this as _improvement_ but you may consider I
_broke_ the behavior.

Please let me know if you have any feedback about this.



    _______________________________________________________

File Attachments:


-------------------------------------------------------
Date: Thu 03 May 2018 07:00:52 PM JST  Name:
0001-Check-HTML-suffix-also-on-unchanged-files.patch  Size: 2KiB   By: a4lg

<http://savannah.gnu.org/bugs/download.php?file_id=44069>

    _______________________________________________________

Reply to this item at:

  <http://savannah.gnu.org/bugs/?53818>

_______________________________________________
  Message sent via Savannah
  https://savannah.gnu.org/




reply via email to

[Prev in Thread] Current Thread [Next in Thread]