Limited Parallelism

Description

When using Heritrix in parallel, by setting parallelqueue=100, number of thread = 100, and crawl a single website, e.g. walmart.com, we observed only a few queues are used out of 100 queues.

If crawling a lot of websites simultaneously, we are able to saturate 100 queues.

Is there a better way to exploit more parallelism even for a single website?
Seems the current mechanism uses SURT, and a single website tends to have similar SURT, and URLs end up sitting in a couple of very long queues while others are empty.

Thank you!

Environment

None

Status

Assignee

Unassigned

Reporter

Shaofeng Liu

Labels

None

Group Assignee

None

ZendeskID

None

Estimated Difficulty

None

Actual Difficulty

None

Priority

Major
Configure