The main idea behind the Crawl-By-Example project is improving crawler ability to find useful and interesting pages. Simple crawl procedure would recursively crawl all outgoing links on processed pages and dump the results to output file as raw data. Often, most of the crawled data is totally irrelevant to the crawl subject (advertisements, unrelated pages, paid links, etc.). This makes manual detection of new interesting pages on the subject virtually impossible, as user will have to go over all the raw data to find which pages yielded the most relevant results. Crawl-By-Example tries to solve this problem by running a crawl which classifies the processed pages by subjects and finds the best pages according to examples provided by the operator. Thus, the process of discovering pages of interest is automated and the user is freed from the grueling task of, so to speak, looking for a needle in a haystack.
Below a brief overview of the project will be presented, including the following points:
- Project milestones
- Algorithm behind the Crawl-By-Example
- Incorporation of Crawl-By-Example into current Heritrix crawl process
- Main design considerations
- Ideas for future work
Crawl-by-Example is now available for download here!
- ::End of May - project start
- ::End of June - completion of clustering algorithm
- ::Middle of July - completion of classification algorithm
- ::End of July - integration of algorithms into a complete flow
- ::End of August - project delivery, incl. code, documentation and examples of test runs
Algorithm behind the Crawl-By-Example
The Crawl-By-Example process is divided into 3 distinct steps:
- Pre-processing. The seed sites (and whatever else is in crawl scope) are crawled and their contents parsed and processed to enable clustering.
- Initial clustering. Clustering structure is created based on the processed documents. This clustering will be used as the train set based on which the following crawls will be classified
- Classification. Additional crawls (with different seeds, but on similar subjects as pre-processing crawl) are performed, and results of the crawls are classified based on the train set received in the initial clustering step.
Pre-processing step receives as input raw html data issued by the crawler and outputs terms-documents inverted index that will be used for clustering. Pre-processing actions include html parsing, stop-words removal and terms stemming.
Click here to see a diagram of pre-processing flow
Clustering step receives as input terms-documents inverted index created by the pre-processing step. Based on it algorithm creates k-frequent items sets (using Apriori). These are the most frequent term combinations in the index. Based on these combinations a clustering hierarchy of the crawled documents is created. This step is based on FIHC algorithm.
Click here to see a diagram of clustering flow
After the clustering step was completed, subsequent crawls can be classified based on clustering hierarchy. During the crawl each document will be dynamically classified (using frequent terms associated with each cluster) to a best matching cluster.
An outline of classification algorithm for a single document:
It should be noted that this algorithm ensures that classification task complexity is independent from the number of classified pages. It only depends on the number of pre-determined clusters and terms associated with them. This number is not expected to be very large and remains constant during the classification algorithm execution.
At the end of the crawl pages that were found as the most relevant to the crawl subject are reported
Incorporation of the algorithm into current crawl process (Possible Use-Cases)
- Initial seed is given by the user
- Crawling this seed provides the initial clustering formation
- Subsequent crawls may be based on this clustering formation and classified according to the clusters
- Clustering/classification may be further improved by receiving feedback from previous stages
Use Case 1: Creating initial clustering formations
This use-case should create a train set for the classification use-case that follows. Hence, input of this step should be a compact and statistically representative group of page topics user wants to cover in the crawl.
- User creates crawl job based on following inputs
- "Auto-in" list. All pages from this list and pages that match pages from this list according to pre-defined rules will be considered as sites relevant to user topic. In actual crawl, "auto-in" list can be added to the crawl seeds.
- "Auto-out" list. All pages from this list and pages that match pages from this list according to pre-defined rules will be considered as sites irrelevant to user topic. These sites can be blocked from crawling by using DecidingScope with rules that would block the "auto-out" sites (e.g. NotSurtPrefixedDecideRule).
- Pre-defined rules to match sites to "auto-in"/"auto-out" lists
- User adds TermsIndexingProcessor module to processors list
- Crawl process proceeds as usual, while in the background TermsIndexingProcessor parses crawl pages to find TFIDF metrics (see above)
- At the end of the crawl process, file output is created under preprocess folder, which includes index of term frequencies for each document
- Based on this output, user can invoke clustering procedure (currently this will be done outside of Heritrix framework) that will result in creation of cluster structure based on TFIDF-distance measure between crawled pages. Clustering output should include the following
- clustering algorithm run-time, number of iterations
- number of clusters
- cluster parameters:
- Cluster labels
- Most frequent terms inside the cluster
- Cluster relevance rate. Cluster relevance rate will determined by the percentage of "auto-in" sites inside the clusters (e.g. cluster with no "auto-in" sites will be deemed as "non-relevant").
Use Case 2: Running classification
This use-case should create a classification according to example clusters provided by previous use-case.
- User creates crawl job based on clustering achieved in Use Case 1.
- User adds ClassifierProcessor module to processors list
- Crawl process proceeds as usual while in the background ClassifierProcessor classifies crawled pages according to given cluster structure
- At the end of the crawl process, an additional folder is created under for the crawl job, named ''classification''. This folder includes all the info of the performed classification, including:
- classification algorithm run-time (incl. the crawl-time), number of classified pages
- list of documents per cluster
- list of outstanding pages:
- the best fitting pages for clusters with high relevance rank. These should be considered as likely additions for future crawls scope.
- the best fitting pages for clusters with low relevance rank and the least fitting pages. These should be considered as likely to be exluded from future crawls scope.
- the unclassified pages. Pages that the algorithm failed to classify based on the given clustering. If percentage of these unclassified pages is high - proposed initial clustering doesn't match well the classified pages.
Use Case 3: Iterative classification improvement
User can run additional clustering/classification steps based on results received by previous classifications.
- Additional inputs can be added to the crawl job to improve algorithm performance:
- Additions to "auto-in" list. These are the pages that were previously associated with high-relevance clusters.
- "Auto-out" list. These are the pages that were previously associated with low relevance clusters. These sites will be excluded from crawl scope as irrelevant.
Demonstrations of Crawl-By-Example in action
- Detailed description of running a complete by-example crawl process in Heritrix
- Clustering step complexity analysis
Main design considerations
An important aspect of this work is to create a framework that will be extensible and modular. Main framework components should be independent. Meaning, changing implementation rules for one component should not affect others. As a chief example for this independence, different clustering and/or classification procedures should be allowed without affecting the rest of the framework. Accordingly:
- Only defined system-wide utilities are shared between by the components
- Each component exposes a well defined API
- Information transfer between clustering and classification steps is done only through step output files
- The output files structure is well-defined and documented (see the detailed design section)
This allows re-writing/modifying clustering or classification algrorithm based solely on the outputs they should provide
More detailed design description can be found here
Ideas for future work
Crawl-by-Example was only a summer project, so naturally there were many interesting things that had to be left out due to lack of time. Below is a partial list of these things. Their future implementation would be of great value to project development and continuation.
- Creating a crawl scope that will allow proceeding with page processing only if the page has passed the classification processor
- Discovery of info hubs during classification crawl. Info hubs are pages, which by themselves may not classified as highly relevant, but which point to many other highly relevant pages
- Integration of clustering offline procedure into Heritrix Web UI
- Developing additional clustering/classification algorithms that could be integrated into Crawl-by-Example framework
Project code can be obtained as a part of Heritrix code from SourceForge CVS. Crawl-by-Example branch is tagged by bemike_byexample
As Crawl-by-Example functionalities combine several different IR problems, the problem is not based on a single ready-made solution. However, the following resources were of great help:
- Mining the Web: Discovering Knowledge from Hypertext Data by Soumen Chakrabarti - a main source of information and ideas on all project related topics
- Hierarchical Document Clustering Using Frequent Itemsets by Benjamin C.M. Fung, Ke Wang, Martin Ester - an academic paper describing the principals of clustering algorithm FIHC applied in this project.
- Fast Algorithms for Mining Association Rules by Rakesh Agrawal, Ramakrishnan Srikant - academic paper including Apriori algorithm description
- Porter Stemming Algorithm - official site by the Porter Stemming algorithm author, which also provides it's implementation in Java.