NullPointerException when getting cookies

Description

The following NPE condition has been observed repeatedly in 2 separate crawls:

2014-07-16T14:56:53.257Z -5 1 http://byssan.is/benelli-ultra-light-28-review/ LLLEL http://byssan.is/author/admin/page/9/ unknown #029 - - http://byssan.is/ err=java.lang.NullPointerException
java.lang.NullPointerException
at org.apache.http.client.protocol.RequestAddCookies.process(RequestAddCookies.java:159)
at org.apache.http.protocol.ImmutableHttpProcessor.process(ImmutableHttpProcessor.java:131)
at org.apache.http.impl.execchain.ProtocolExec.execute(ProtocolExec.java:193)
at org.apache.http.impl.execchain.RetryExec.execute(RetryExec.java:86)
at org.apache.http.impl.client.InternalHttpClient.doExecute(InternalHttpClient.java:186)
at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:72)
at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:57)
at org.archive.modules.fetcher.FetchHTTPRequest.execute(FetchHTTPRequest.java:632)
at org.archive.modules.fetcher.FetchHTTP.innerProcess(FetchHTTP.java:658)
at org.archive.modules.Processor.innerProcessResult(Processor.java:175)
at org.archive.modules.Processor.process(Processor.java:142)
at org.archive.modules.ProcessorChain.process(ProcessorChain.java:131)
at org.archive.crawler.framework.ToeThread.run(ToeThread.java:148)
java.lang.NullPointerException
at org.apache.http.client.protocol.RequestAddCookies.process(RequestAddCookies.java:159)
at org.apache.http.protocol.ImmutableHttpProcessor.process(ImmutableHttpProcessor.java:131)
at org.apache.http.impl.execchain.ProtocolExec.execute(ProtocolExec.java:193)
at org.apache.http.impl.execchain.RetryExec.execute(RetryExec.java:86)
at org.apache.http.impl.client.InternalHttpClient.doExecute(InternalHttpClient.java:186)
at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:72)
at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:57)
at org.archive.modules.fetcher.FetchHTTPRequest.execute(FetchHTTPRequest.java:632)
at org.archive.modules.fetcher.FetchHTTP.innerProcess(FetchHTTP.java:658)
at org.archive.modules.Processor.innerProcessResult(Processor.java:175)
at org.archive.modules.Processor.process(Processor.java:142)
at org.archive.modules.ProcessorChain.process(ProcessorChain.java:131)
at org.archive.crawler.framework.ToeThread.run(ToeThread.java:148)

Frequently, most of these stacktraces may be squashed unless the -XX:-OmitStackTraceInFastThrow JVM option is set.

Discussing this on the mailing list Noah commented:

I found a few -5s in some of our currently active crawl jobs. Most of them also have no stack trace. I found one with the same stack trace as yours.
Excerpt from RequestAddCookies.java:

154 final List<Cookie> cookies = new
ArrayList<Cookie>(cookieStore.getCookies());
155 // Find cookies matching the given origin
156 final List<Cookie> matchedCookies = new ArrayList<Cookie>();
157 final Date now = new Date();
158 for (final Cookie cookie : cookies) {
159 if (!cookie.isExpired(now)) {
160 if (cookieSpec.match(cookie, cookieOrigin)) {

So it would appear that one of the cookies in the list returned by
cookieStore.getCookies() is null. It could be a threading issue, where another thread adds or removes a cookie at the same time that line 154 there is in the middle of making its copy. If that is the problem we could probably fix it by synchronizing BdbCookieStore.getCookies() and having it return a copy of the list. It seems wasteful to copy the list when it's about to get copied again, but I'm not sure there's a better way, if that's really the issue.

Environment

None

Activity

Show:
Kristinn Sigurðsson
October 3, 2014, 9:46 AM
Noah Levitt
September 30, 2014, 7:10 PM
Gordon Mohr
July 17, 2014, 2:13 AM

Actually, it appears lookup-by-custom-keying was taken out when a new FetchHTTP and supporting classes came in (2012).

So it seems now the default HttpComponents lookup - iterate over all cookies for every lookup - may be what's happening. And, and optimization in February brought all cookies back into an in-memory ArrayList. I think this would risk OOMEs in broad crawls (arbitrarily many alive cookies) and still often be slow (because of the linear probes in a giant list). Also, it looks like the same ArrayList is being returned for HttpComponents' unsynchronized probe as might be updated by other threads, which could account for both NPEs and maybe even other strange results (like Roger's 'wrong' cookies appearing in places).

I suspect returning to the prior style would resolve the errors (and might result in a performance boost), though it will also require patching-around the HttpComponents full-list, full-iteration approach... which isn't really appropriate for arbitrarily-broad crawls anyway.

(The prior patch, an alternate match() method that worked with SortedMap, can be seen at https://github.com/internetarchive/heritrix3/blob/4ffb6865e1cb0b4a002ea661534f0ccba0f4b862/commons/src/main/java/org/apache/commons/httpclient/cookie/CookieSpecBase.java#L592 and contrasted with the match() just above it, which is more like the full-iteration search of http://grepcode.com/file/repo1.maven.org/maven2/org.apache.httpcomponents/httpclient/4.0.1/org/apache/http/client/protocol/RequestAddCookies.java#174 . Perhaps it will be easier to swap in an alternate HttpRequestInterceptor in place of RequestAddCookies, than the prior CookieSpecBase patch?)

Gordon Mohr
July 17, 2014, 12:43 AM

I would tend to suspect any recent revisions to Heritrix's use of HttpComponent (such as moving to a later version or patching around limitations) as possible causes of this, and the not-quite-the-same problem of mixed-up cookies that Roger of BL mentioned on the list 7/4.

The adaptation of cookies to use a persistence-backed data structure required a custom keying of the cookies for ranged lookup. That could be fragile in the cast of unsynchronized HttpComponents changes, or other changes like the port-stripping in .

Fixed

Assignee

Unassigned

Reporter

Kristinn Sigurðsson