When using Heritrix in parallel, by setting parallelqueue=100, number of thread = 100, and crawl a single website, e.g. walmart.com, we observed only a few queues are used out of 100 queues.
If crawling a lot of websites simultaneously, we are able to saturate 100 queues.
Is there a better way to exploit more parallelism even for a single website?
Seems the current mechanism uses SURT, and a single website tends to have similar SURT, and URLs end up sitting in a couple of very long queues while others are empty.