bug-wget
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

block digest on warc revisit records


From: Thomas Krichel
Subject: block digest on warc revisit records
Date: Thu, 9 Jul 2020 15:08:08 +0000

  Hi gang,

  it seems to me that the version of wget I have

archec@darni:~/pp$ wget -V | head -1
GNU Wget 1.20.3 built on linux-gnu.

  has a bug at the point where it generates block digests for WARC
  revisit records. To establish this, look at some sample lines of
  bash

################################################################

TARGET=http://openlib.org/home/krichel/debug.css
WARC_1=/tmp/1
WARC_2=/tmp/2
DEDUP="--warc-dedup ${WARC_1}.cdx "
FLAGS='-O /dev/null --no-warc-compression --no-warc-keep-log --warc-cdx'

# start from clean sheet
rm -f $WARC_1.warc $WARC_2.warc ${WARC_1}.cdx

# run twice, dedup in CDX
wget $FLAGS --warc-file $WARC_1 $TARGET
wget $FLAGS --warc-file $WARC_2 $DEDUP $TARGET

# append the second to the first
cat $WARC_2.warc >> $WARC_1.warc

# now verify                                                                    
                
python3 -m warcat --verbose verify $WARC_1.warc

################################################################

   When I run this

Opening WARC file ‘/tmp/1.warc’.

--2020-07-09 14:57:34--  http://openlib.org/home/krichel/debug.css
Resolving openlib.org (openlib.org)... 2a01:4f9:2a:23a8::2, 95.216.35.87
Connecting to openlib.org (openlib.org)|2a01:4f9:2a:23a8::2|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 53 [text/css]
Saving to: ‘/dev/null’

/dev/null               100%[===============================>]      53  
--.-KB/s    in 0s      

2020-07-09 14:57:34 (11.8 MB/s) - ‘/dev/null’ saved [53/53]

Loaded 1 record from CDX.

Opening WARC file ‘/tmp/2.warc’.

--2020-07-09 14:57:34--  http://openlib.org/home/krichel/debug.css
Resolving openlib.org (openlib.org)... 2a01:4f9:2a:23a8::2, 95.216.35.87
Connecting to openlib.org (openlib.org)|2a01:4f9:2a:23a8::2|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 53 [text/css]
Saving to: ‘/dev/null’

/dev/null               100%[===============================>]      53  
--.-KB/s    in 0s      

Found exact match in CDX file. Saving revisit record to WARC.
2020-07-09 14:57:34 (15.5 MB/s) - ‘/dev/null’ saved [53/53]

INFO:warcat.model.warc:Opened file /tmp/1.warc
ERROR:warcat.tool:Record <urn:uuid:ccc449bb-ddcc-4436-840b-fb143f33f47d> failed 
validation
Traceback (most recent call last):
  File "/home/archec/local/lib/python/warcat/tool.py", line 283, in action
    action(record)
  File "/home/archec/local/lib/python/warcat/tool.py", line 292, in 
verify_block_digest
    raise VerifyProblem('Bad block digest.', '5.8')
warcat.tool.VerifyProblem: ('Bad block digest.', '5.8', True)
INFO:warcat.model.warc:Finished reading Warc
Validation failed. Problems: 1.

  The dedup works but the block digest on revisit record is not
  correct. All other records validate just fine. 

--

  Cheers,

  Thomas Krichel                  http://openlib.org/home/krichel
                                              skype:thomaskrichel



reply via email to

[Prev in Thread] Current Thread [Next in Thread]