Java:Data Science Made Easy
上QQ阅读APP看书,第一时间看更新

Using the crawler4j web crawler

Here we will illustrate the use of the crawler4j (https://github.com/yasserg/crawler4j) web crawler. We will use an adapted version of the basic crawler found at https://github.com/yasserg/crawler4j/tree/master/src/test/java/edu/uci/ics/crawler4j/examples/basic. We will create two classes: CrawlerController and SampleCrawler. The former class set ups the crawler while the latter contains the logic that controls what pages will be processed.

As with our previous crawler, we will crawl the Wikipedia article dealing with Bishop Rock. The results using this crawler will be smaller as many extraneous pages are ignored.

Let's look at the CrawlerController class first. There are several parameters that are used with the crawler as detailed here:

  • Crawl storage folder: The location where crawl data is stored
  • Number of crawlers: This controls the number of threads used for the crawl
  • Politeness delay: How many seconds to pause between requests
  • Crawl depth: How deep the crawl will go
  • Maximum number of pages to fetch: How many pages to fetch
  • Binary data: Whether to crawl binary data such as PDF files

The basic class is shown here:

public class CrawlerController { 

public static void main(String[] args) throws Exception {
int numberOfCrawlers = 2;
CrawlConfig config = new CrawlConfig();
String crawlStorageFolder = "data";

config.setCrawlStorageFolder(crawlStorageFolder);
config.setPolitenessDelay(500);
config.setMaxDepthOfCrawling(2);
config.setMaxPagesToFetch(20);
config.setIncludeBinaryContentInCrawling(false);
...
}
}

Next, the CrawlController class is created and configured. Notice the RobotstxtConfig and RobotstxtServer classes used to handle robot.txt files. These files contain instructions that are intended to be read by a web crawler. They provide direction to help a crawler to do a better job such as specifying which parts of a site should not be crawled. This is useful for auto generated pages:

    PageFetcher pageFetcher = new PageFetcher(config); 
RobotstxtConfig robotstxtConfig = new RobotstxtConfig();
RobotstxtServer robotstxtServer =
new RobotstxtServer(robotstxtConfig, pageFetcher);
CrawlController controller =
new CrawlController(config, pageFetcher, robotstxtServer);

The crawler needs to start at one or more pages. The addSeed method adds the starting pages. While we used the method only once here, it can be used as many times as needed:

    controller.addSeed( 
"https://en.wikipedia.org/wiki/Bishop_Rock,_Isles_of_Scilly");

The start method will begin the crawling process:

    controller.start(SampleCrawler.class, numberOfCrawlers); 

The SampleCrawler class contains two methods of interest. The first is the shouldVisit method that determines whether a page will be visited and the visit method that actually handles the page. We start with the class declaration and the declaration of a Java regular expression class Pattern object. It will be one way of determining whether a page will be visited. In this declaration, standard images are specified and will be ignored:

    public class SampleCrawler extends WebCrawler { 
private static final Pattern IMAGE_EXTENSIONS =
Pattern.compile(".*\\.(bmp|gif|jpg|png)$");

...
}

The shouldVisit method is passed a reference to the page where this URL was found along with the URL. If any of the images match, the method returns false and the page is ignored. In addition, the URL must start with https://en.wikipedia.org/wiki/. We added this to restrict our searches to the Wikipedia website:

    public boolean shouldVisit(Page referringPage, WebURL url) { 
String href = url.getURL().toLowerCase();
if (IMAGE_EXTENSIONS.matcher(href).matches()) {
return false;
}
return href.startsWith("https://en.wikipedia.org/wiki/");
}

The visit method is passed a Page object representing the page being visited. In this implementation, only those pages containing the string shipping route will be processed. This further restricts the pages visited. When we find such a page, its URL, Text, and Text length are displayed:

    public void visit(Page page) { 
String url = page.getWebURL().getURL();

if (page.getParseData() instanceof HtmlParseData) {
HtmlParseData htmlParseData =
(HtmlParseData) page.getParseData();
String text = htmlParseData.getText();
if (text.contains("shipping route")) {
out.println("\nURL: " + url);
out.println("Text: " + text);
out.println("Text length: " + text.length());
}
}
}

The following is the truncated output of the program when executed:

URL: https://en.wikipedia.org/wiki/Bishop_Rock,_Isles_of_Scilly
Text: Bishop Rock, Isles of Scilly...From Wikipedia, the free encyclopedia ... Jump to: ... navigation, search For the Bishop Rock in the Pacific Ocean, see Cortes Bank. Bishop Rock Bishop Rock Lighthouse (2005)
...
Text length: 14677

Notice that only one page was returned. This web crawler was able to identify and ignore previous versions of the main web page.

We could perform further processing, but this example provides some insight into how the API works. Significant amounts of information can be obtained when visiting a page. In the example, we only used the URL and the length of the text. The following is a sample of other data that you may be interested in obtaining:

  • URL path
  • Parent URL
  • Anchor
  • HTML text
  • Outgoing links
  • Document ID