StormCrawler: An Open Source SDK for Building Web Crawlers with ApacheStorm

1150

StormCrawler is an open source collection of reusable resources, mostly implemented in Java, for building low-latency, scalable web crawlers on Apache Storm. In his upcoming talk at ApacheCon, Julien Nioche, Director of DigitalPebble Ltd, will compare StormCrawler with similar projects, such as Apache Nutch, and present some real-life use cases.

We spoke with Nioche to learn more about StormCrawler and its capabilities.

Julien Nioche, Director of DigitalPebble Ltd.

Linux.com: What is StormCrawler and what does it do? Briefly, how does it work?

Julien Nioche: StormCrawler (SC) is an open source SDK for building distributed web crawlers with Apache Storm. The project is under Apache license v2 and consists of a collection of reusable resources and components, written mostly in Java. It is used for scraping data from web pages, indexing with search engines or archiving, and can run on a single machine or an entire Storm cluster with exactly the same code and a minimal number of resources to implement.

The code can be built with Maven and has a Maven archetype, which helps users bootstrap a fully working crawler project which can be used as a starting point.  

Apache Storm handles the distribution of work across the cluster, error handling, monitoring and log capabilities, whereas StormCrawler focuses on the specific resources for crawling the web. The project aims at being as flexible and modular as possible and provides code for commonly used third-party tools, such as Apache SOLR or Elasticsearch.

Linux.com: Why did you choose ApacheStorm for this project?

Julien: I knew from my experience with batch-driven crawlers like Apache Nutch that stream-processing frameworks were probably the way I wanted to go for a new crawler. There were not so many resources around when I started working on StormCrawler two to three years ago (or at least less than now — there seems to be a new one cropping up every month), but luckily Storm was in incubation at Apache. I remember finding that its concepts were both simple and elegant, the community was already active, and I managed to leverage it pretty quickly to get some stuff up and running. That convinced me that this was a good platform to build on, and I am glad I chose it because the project has developed very nicely since, going from strength to strength with each release.

Linux.com: Can you describe some use cases? Who should be using StormCrawler?

Julien: There is a variety of web crawlers based on SC, which is possible thanks to its flexible and modular nature. The “Powered By” page on the wiki lists some of them.  A very natural fit is for processing URLs coming as a stream (e.g., pages visited by some users). This is difficult to implement elegantly with batch-driven web crawlers, whereas StormCrawler can be both efficient and elegant for such cases.

There are also users who do traditional recursive crawls with it, for instance with Elasticsearch as a back end, or crawls that are more vertical in nature and aim at data scraping. A number of users are also using it on finite lists of URLs, without any link discovery.

StormCrawler comes out of the box with a set of resources, which help building web crawlers with minimal effort. What I really like is that with just a few files and a single Java class, you can build a web crawler that can be deployed on a large Storm cluster. Users also seem to find the flexibility of StormCrawler very attractive, and some of them have added custom components to a basic crawl topology relatively easily. Others love its performance and the fact that with continuous processing, they always get the most from their hardware: StormCrawler uses the CPU, bandwith, memory, and disk constantly.

Linux.com: Are there additional features or capabilities you would like to add? If so, what?

Julien: StormCrawler is constantly improving and, as the number of users and contributors grows, we get more and more bugfixes and new functionalities. One thing in particular that has been planned for some time is to have a Selenium-based protocol implementation to handle AJAX-based websites. Hopefully, we’ll get that added in a not-too-distant future. There are also external components, like the WARC resources that might be moved to the main repository.

Hear from leading open source technologists from Cloudera, Hortonworks, Uber, Red Hat, and more at Apache: Big Data and ApacheCon Europe on November 14-18 in Seville, Spain. Register Now >>