[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
block digest on warc revisit records
From: |
Thomas Krichel |
Subject: |
block digest on warc revisit records |
Date: |
Thu, 9 Jul 2020 15:08:08 +0000 |
Hi gang,
it seems to me that the version of wget I have
archec@darni:~/pp$ wget -V | head -1
GNU Wget 1.20.3 built on linux-gnu.
has a bug at the point where it generates block digests for WARC
revisit records. To establish this, look at some sample lines of
bash
################################################################
TARGET=http://openlib.org/home/krichel/debug.css
WARC_1=/tmp/1
WARC_2=/tmp/2
DEDUP="--warc-dedup ${WARC_1}.cdx "
FLAGS='-O /dev/null --no-warc-compression --no-warc-keep-log --warc-cdx'
# start from clean sheet
rm -f $WARC_1.warc $WARC_2.warc ${WARC_1}.cdx
# run twice, dedup in CDX
wget $FLAGS --warc-file $WARC_1 $TARGET
wget $FLAGS --warc-file $WARC_2 $DEDUP $TARGET
# append the second to the first
cat $WARC_2.warc >> $WARC_1.warc
# now verify
python3 -m warcat --verbose verify $WARC_1.warc
################################################################
When I run this
Opening WARC file ‘/tmp/1.warc’.
--2020-07-09 14:57:34-- http://openlib.org/home/krichel/debug.css
Resolving openlib.org (openlib.org)... 2a01:4f9:2a:23a8::2, 95.216.35.87
Connecting to openlib.org (openlib.org)|2a01:4f9:2a:23a8::2|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 53 [text/css]
Saving to: ‘/dev/null’
/dev/null 100%[===============================>] 53
--.-KB/s in 0s
2020-07-09 14:57:34 (11.8 MB/s) - ‘/dev/null’ saved [53/53]
Loaded 1 record from CDX.
Opening WARC file ‘/tmp/2.warc’.
--2020-07-09 14:57:34-- http://openlib.org/home/krichel/debug.css
Resolving openlib.org (openlib.org)... 2a01:4f9:2a:23a8::2, 95.216.35.87
Connecting to openlib.org (openlib.org)|2a01:4f9:2a:23a8::2|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 53 [text/css]
Saving to: ‘/dev/null’
/dev/null 100%[===============================>] 53
--.-KB/s in 0s
Found exact match in CDX file. Saving revisit record to WARC.
2020-07-09 14:57:34 (15.5 MB/s) - ‘/dev/null’ saved [53/53]
INFO:warcat.model.warc:Opened file /tmp/1.warc
ERROR:warcat.tool:Record <urn:uuid:ccc449bb-ddcc-4436-840b-fb143f33f47d> failed
validation
Traceback (most recent call last):
File "/home/archec/local/lib/python/warcat/tool.py", line 283, in action
action(record)
File "/home/archec/local/lib/python/warcat/tool.py", line 292, in
verify_block_digest
raise VerifyProblem('Bad block digest.', '5.8')
warcat.tool.VerifyProblem: ('Bad block digest.', '5.8', True)
INFO:warcat.model.warc:Finished reading Warc
Validation failed. Problems: 1.
The dedup works but the block digest on revisit record is not
correct. All other records validate just fine.
--
Cheers,
Thomas Krichel http://openlib.org/home/krichel
skype:thomaskrichel
- block digest on warc revisit records,
Thomas Krichel <=