Option to skip an ARC record based on size or other filtering policy

Description

During some testing of NutchWAX 0.12.1 with Archive-IT collection 499, I found that some of the ARC files have very large records, like 500+MB video files, etc.

In those cases not only are the contents not something we will index, but we would likely not care about duplicates like we do for contents we do index. What we should be able to do is configure NutchWAX 0.12.1 so that we:

o Read the ARC header
o Determine if the record size is above a threshold
If it is, then create a document in the segment for it, like we would if we were going to parse the whole thing
Don't parse it, but just skip directly to the next ARC record.

In this scheme we wouldn't have a digest since the digest is only available if we actually read the entire contents, digesting as we go. We could use a dummy digest, such as "NONE" or "".

The idea is that we would still want to have the ARC record's URL, content type and other metadata in the search index, but skip the time+effort to pull the contents across the wire and compute a digest because we know a priori that we won't index the content because it is simply too darn big. Also, it's almost 100% the case where a large file is going to be a type that we can't index for full-text search anyways, like video files, audio files, zip files, etc. We still want the metadata (URL, content-type, etc.) in the search index so people can find the file via a URL search.

Iin the future we might have a parser that can extract metadata from a WMV header such that we only have to read the first say 100K of a WMV file to get the metadata (like title, summary/description, etc.) and not have to read in the whole thing. In that case, we might calculate a digest based on just that first 100K (speculative). It will need more thought when we get there.

Also, we can resurrect the URL regex filtering, which we have currently disabled. The URL regex filtering is currently disabled because it will prevent the metadata from being stored in the index. But, if we tweak things, we might be able to use it to determine which URLs fall into a "metadata-only" bucket vs. a "full content" bucket vs. a "filter out completely" bucket.

Environment

None

Status

Assignee

Aaron Binns

Reporter

Aaron Binns

Labels

None

Group Assignee

None

ZendeskID

None

Estimated Difficulty

None

Actual Difficulty

None

Priority

Major
Configure