Bug in shouldProcessRule in WriterPoolProcessor. It doesn't work

Description

public ProcessResult process(CrawlURI uri)
throws InterruptedException {
if (!getEnabled()) {
return ProcessResult.PROCEED;
}

if (getShouldProcessRule().decisionFor(uri) == DecideResult.REJECT) {
innerRejectProcess(uri);
return ProcessResult.PROCEED;
}

The innerReject of the WARC and ARC writers don't actually stop the document
from being saved. I believe this is a bug.

/**

  • If this fetch is identical to the last written (archived) fetch, then

  • copy forward the writeTag. This method should generally be called when

  • writeTag is present from a previous identical fetch, even though this

  • particular fetch is not being written anywhere (not even a revisit

  • record).
    */
    protected void copyForwardWriteTagIfDupe(CrawlURI curi) {
    if (IdenticalDigestDecideRule.hasIdenticalDigest(curi)) {
    @SuppressWarnings("unchecked")
    Map<String,Object>[] history =
    (Map<String,Object>[])curi.getData().get(A_FETCH_HISTORY);
    if (history[1].containsKey(A_WRITE_TAG)) {
    history[0].put(A_WRITE_TAG, history[1].get(A_WRITE_TAG));
    }
    }
    }

@Override
protected void innerRejectProcess(CrawlURI curi) throws InterruptedException
{
copyForwardWriteTagIfDupe(curi);
}

This code doesn't do anything when called after the rule returns REJECT

Environment

None

Status

Assignee

Unassigned

Reporter

adam i

Labels

None

Group Assignee

None

ZendeskID

None

Estimated Difficulty

None

Actual Difficulty

None

Priority

Major
Configure