Easier getting of capture URL and capture date for wats from both ARC and WARC.

Description

When dealing with wat files from a mix of ARC and WARC, the capture URL and capture date must be handled differently:

Envelope.WARC-Header-Metadata.WARC-Target-URI
Envelope.WARC-Header-Metadata.WARC-Date

Envelope.ARC-Header-Metadata.Target-URI
Envelope.ARC-Header-Metadata.Date

Also the date formats are different. The Pig script has to deal with these differences, PITA.

However, if you look at the WARC records in the WAT file, the WARC record's header fields contain the URL and date in the same form, for both ARC and WARC-based wats. For example:

From WARC

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 WARC/1.0 WARC-Type: metadata WARC-Target-URI: http://state.tn.us/sos/acts/100/pub/PUBC1036.htm WARC-Date: 2010-01-12T19:00:57Z WARC-Record-ID: <urn:uuid:d2f04d35-f929-44ac-a7ac-87fec468cf63> WARC-Refers-To: <urn:uuid:4ff93ea0-710f-40ef-a5f7-841359dba922> Content-Type: application/json Content-Length: 1124 {"Envelope":{"Format":"WARC","WARC-Header-Length":"327","Block-Digest":"sha1:TCGFY2I4XEDNNCBNSM2626JYD6XHN4RS","Actual-Content-Length":"134","WARC-Header-Metadata":{"WARC-Type":"metadata","WARC-Date":"2010-01-1 2T19:00:57Z","Content-Length":"134","WARC-Record-ID":"<urn:uuid:4ff93ea0-710f-40ef-a5f7-841359dba922>","WARC-Target-URI":"http://state.tn.us/sos/acts/100/pub/PUBC1036.htm","WARC-Concurrent-To":"<urn:uuid:a546c6 2c-4743-4a38-b6ac-2e57a3cce930>","Content-Type":"application/warc-fields"},"Payload-Metadata":{"Trailing-Slop-Length":"4","WARC-Metadata-Metadata":{"Trailing-Slop-Length":"0","Metadata-Records":[{"Name":"via"," Value":"http://state.tn.us/sos/acts/100/pub/pc1001-1100.htm"},{"Name":"hopsFromSeed","Value":"LLLL"},{"Name":"sourceTag","Value":"http://state.tn.us/sos/"},{"Name":"fetchTimeMs","Value":"174"}],"Actual-Content- Length":"134"},"Actual-Content-Type":"application/metadata-fields"}},"Container":{"Compressed":true,"Gzip-Metadata":{"Footer-Length":"8","Deflate-Length":"317","Header-Length":"10","Inflated-CRC":"528734932","I nflated-Length":"465"},"Offset":"440","Filename":"TENN-000093.warc.gz"}}

From ARC

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 WARC/1.0 WARC-Type: metadata WARC-Target-URI: http://www.sensenet.com:80/cgi-bin/wtk/wtkjazz.cgi?cmd=showrecording&id=104 WARC-Date: 1996-11-14T21:55:05Z WARC-Record-ID: <urn:uuid:5d3b5584-dd58-43c9-b8d6-11aef4c518e7> WARC-Refers-To: <urn:arc:5c2cfaf601dcddde26bf3af0415f2b30.IA-001964.arc.gz:26011> Content-Type: application/json Content-Length: 1159 {"Envelope":{"Format":"ARC","ARC-Header-Metadata":{"Date":"19961114215505","Content-Length":"235","Content-Type":"text/html","Target-URI":"http://www.sensenet.com:80/cgi-bin/wtk/wtkjazz.cgi?cmd=showrecording&id =104","IP-Address":"199.33.238.3"},"ARC-Header-Length":"118","Payload-Metadata":{"Trailing-Slop-Length":"1","Actual-Content-Type":"application/http; msgtype=response","HTTP-Response-Metadata":{"Headers":{"Conte nt-type":"text/html","Date":"Thu, 14 Nov 1996 21:55:05 GMT","Server":"Apache/1.1.1"},"Headers-Length":"103","Entity-Length":"132","Entity-Trailing-Slop-Bytes":"0","Response-Message":{"Status":"200","Version":"H TTP/1.0","Reason":"OK"},"HTML-Metadata":{"Links":[{"path":"BODY@/background","url":"/staff/wtk/jazz/music.gif"}],"Head":{"Title":"Jazz"}},"Entity-Digest":"sha1:F27ZVSFCOGMGFUQCYML5XYMVJSV6FLOQ"},"Block-Digest": "sha1:J7HOA52LQPSAZQXCURMTGWSO5NN7ZPYI","Actual-Content-Length":"235"}},"Container":{"Compressed":true,"Gzip-Metadata":{"Footer-Length":"8","Deflate-Length":"281","Header-Length":"10","Inflated-CRC":"375393481" ,"Inflated-Length":"354"},"Offset":"26011","Filename":"5c2cfaf601dcddde26bf3af0415f2b30.IA-001964.arc.gz"}}

In both cases, the WARC record headers have the "target" url and the capture date, and the date is in WARC form.

It would be nice if our ArchiveJSONViewLoader() could provide access to these, so that a Pig script-writer didn't have to deal with the unimportant differences between the ARC and WARC headers inside the JSON block.

Environment

None

Status

Assignee

Brad Tofel

Reporter

Aaron Binns

Labels

None

Group Assignee

None

ZendeskID

None

Estimated Difficulty

None

Actual Difficulty

None

Components

Priority

Major
Configure