3 Emerging Open Source Data Analytics Tools Beyond Apache Spark

1607

On the data analytics front, profound change is in the air, and open source tools are leading many of the changes. Sure, you are probably familiar with some of the open source stars in this space, such as Hadoop and Apache Spark, but there is now a strong need for new tools that can holistically round out the data analytics ecosystem. Notably, many of these tools are customized to process streaming data.

The Internet of Things (IoT), which is giving rise to sensors and other devices that produce continuous streams of data, is just one of the big trends driving the need for new analytics tools. Streaming data analytics are needed for improved drug discovery, and NASA and the SETI Institute are even collaborating to analyze terabytes of complex, streaming deep space radio signals.

While Apache Spark grabs many of the headlines in the data analytics space, given billions of development dollars thrown at it by IBM and other companies, several unsung open source projects are also on the rise. Here are three emerging data analytics tools worth exploring:

Grappa

Big organizations and small ones are working on new ways to cull meaningful insights from streaming data, and many of them are working with data generated on clusters and, increasingly, on commodity hardware. That puts a premium on affordable data-centric approaches that can improve on the performance and functionality of tools such as MapReduce and even Spark. Enter the open source Grappa project, which scales data-intensive applications on commodity clusters and offers a new type of abstraction that can beat classic distributed shared memory (DSM) systems.

You can get the source code for Grappa and find more about it here. Grappa began when a group of engineers with experience running Big Data jobs on Cray systems wondered if they could challenge the analytics that Cray systems were capable of on off-the-shelf commodity hardware.

As the developers note: “Grappa provides abstraction at a level high enough to subsume many performance optimizations common to data-intensive platforms. However, its relatively low-level interface provides a convenient abstraction for building data-intensive frameworks on top of. Prototype implementations of (simplified) MapReduce, GraphLab, and a relational query engine have been built on Grappa that out-perform the original systems.”

Grappa is freely available on GitHub under a BSD license. If you are interested in seeing Grappa at work, you can follow easy quick-start directions in the application’s README file to build and run it on a cluster. To learn how to write your own Grappa applications, check out the tutorial.

Apache Drill

The Apache Drill project is making such a difference in the Big Data space that companies such as MapR have even wrapped it into their Hadoop distributions. It is a Top-Level project at Apache and is being leveraged along with Apache Spark in many streaming data scenarios.

For example, at the New York Apache Drill meeting back in January of this year, MapR system engineers showed how Apache Spark and Drill could be used in tandem in a use case involving packet capture and near-real-time query and search.

Drill is notable in streaming data applications because it is a distributed, schema-free SQL engine. DevOps and IT staff can use Drill to interactively explore data in Hadoop and other NoSQL databases, such as HBase and MongoDB. There is no need to explicitly define and maintain schemas, as Drill can automatically leverage the structure that’s embedded in the data. It is able to stream data in memory between operators, and minimizes the use of disks unless needed to complete a query.

Apache Kafka

The Apache Kafka project has emerged as a star for real-time data tracking capabilities. It provides unified, high-throughput, low-latency processing for real-time data. Confluent and other organizations have also produced custom tools for using Kafka with data streams.

Apache Kafka was originally developed by LinkedIn, and was subsequently open sourced in early 2011. It is a hardened, tested tool and many organizations require workers with Kafka knowledge. Cisco, Netflix, PayPal, Uber, and Spotify are among well-known companies using Kafka.

The engineers who created Kafka at LinkedIn also created Confluent, which focuses on Kafka. Confluent University offers training courses for Kafka developers, and for operators/administrators. Both onsite and public courses are available.

Are you interested in more unsung open source data analytics projects on the rise? If so, you can find more in my recent post on the topic.