[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[Bug-wget] Patch: Always surround the "WARC-Target-URI" value with angle
From: |
Benjamin Esham |
Subject: |
[Bug-wget] Patch: Always surround the "WARC-Target-URI" value with angle brackets |
Date: |
Fri, 3 Mar 2017 09:00:57 -0500 |
Hello,
When producing WARC files, Wget records the requested URI in the
"WARC-Target-URI" field. I noticed that Wget encloses the value of this URI
within <angle brackets> in blocks with "WARC-Type: request", but not those
with types of "response", "resource", "revisit", or "metadata". Enclosing URIs
within angle brackets is required by the spec [1]. I'm attaching a patch that
adds the angle brackets for all block types.
(Doing this for "request" blocks was the subject of bug 47281 [2], which was
fixed almost exactly a year ago. My patch simply extends the use of the
warc_write_header_uri function to the other appropriate places.)
Here is a truncated example of the output from Wget 1.19.1:
WARC/1.0
WARC-Type: response
WARC-Record-ID: <urn:uuid:95D7B77A-C019-4E91-9BBB-7526B68864F2>
WARC-Warcinfo-ID: <urn:uuid:29F863DF-B273-498B-B91C-B50B2FD1BFCD>
WARC-Concurrent-To: <urn:uuid:EDCAF84C-D7A6-43CE-AE78-AEE16D3B7F4B>
WARC-Target-URI: https://www.gnu.org/software/wget/
And from the patched version:
WARC/1.0
WARC-Type: response
WARC-Record-ID: <urn:uuid:54F2170C-C3FA-4B05-A8B1-116466D92401>
WARC-Warcinfo-ID: <urn:uuid:29BCF957-0D4D-4933-9CA3-F7FF2218D144>
WARC-Concurrent-To: <urn:uuid:61FCAFA4-5DF9-4CC0-A6C6-BC233601EF1E>
WARC-Target-URI: <https://www.gnu.org/software/wget/>
Best regards,
Benjamin
[1] http://bibnum.bnf.fr/WARC/WARC_ISO_28500_version1_latestdraft.pdf
[2] http://savannah.gnu.org/bugs/?47281
0001-src-warc.c-Use-warc_write_header_uri-for-all-WARC-Ta.patch
Description: Binary data
- [Bug-wget] Patch: Always surround the "WARC-Target-URI" value with angle brackets,
Benjamin Esham <=