The Fedora 20 release is tantalizingly close, but even without the final gold seal of approval we have a clear picture of the features that users will be enjoying very soon. One of the additions to Fedora 20 is the inclusion of Apache Hadoop packages, which will let users easily get up and running with Hadoop right out of the box. Fedora contributor Matthew Farrellee talked to us about the packaging effort, what this means for Fedora 20, and what’s coming in Fedora 21 and beyond.
What’s Hadoop Good For?
For those who missed the Big Data hype a few years ago, Hadoop is the poster child project for organizations that are crunching big data. While not the only solution, it’s certainly the best-known.
Farrellee says “Hadoop provides a powerful platform for computation and data management. That means when a Fedoran has log, media (identi.ca/twitter), click stream, geo-location, etc. data they can come to Fedora, yum install [Hadoop] and start processing it.”
Typically, big data crunching implies having a fair number of machines or instances to throw at a problem. Is Hadoop useful for those who are just working on a single desktop or laptop? Farrellee says it can be. “Hadoop can be used on a single machine, which is how many of us do our development and basic testing.”
“However,” says Farrellee, “to really get benefits from Hadoop, Fedorans will need to tie together many more machines. To help with that, we’re working on packaging the Apache Ambari project, which does multiple-system deployment and management of Hadoop.”
Ambari isn’t the only Hadoop tool that’s coming to Fedora in the future. Farrellee says that Hadoop is “the foundation of an entire ecosystem of tools and applications. Having Apache Hadoop in Fedora means we can start bringing projects like Hive, HBase, Pig, Solr, Flume and Mahout to Fedora.”
New In This Release
The version packaged for Fedora 20, Hadoop 2.2.0, is hot off the presses of the Apache Hadoop project (released on 15 October 2013). Farrellee says that this version has several interesting new features, in addition to the existing functionality of Hadoop that big data crunchers know and love.
The biggest change in this version, says Farrellee, is the general availability (GA) of Yet Another Resource Negotiator (YARN). “YARN gives Apache Hadoop the ability to concurrently manage multiple types of workloads on the same infrastructure. That means you can have MapReduce workloads on your Hadoop infrastructure right next to Spark workloads (a BDAS (Berkeley Data Analytics Stack project). And, it lets you consolidate your Hadoop ecosystem services to run on YARN instead of in parallel. The Hoya project is doing that for HBase.”
Farrellee also says that the release includes many enhancements to the Hadoop Distributed File System, including high availability, namespace federation, and snapshots.
Dependencies, Dependencies, Dependencies!
The hardest part about getting Hadoop into Fedora? “Dependencies, dependencies, dependencies!” says Farrellee. In general, dependencies are often a sticking point, especially (as Farrellee points out) for those tools that depend on “languages other than C/C++, Perl or Python.”
For Hadoop? It was more difficult than usual. “There were some dependencies that were just missing and we had to work through those as you’d expect – there were a lot of these. Then there were dependencies that were older than what upstream was using – rare, I know, for Fedora, which aims to be on the bleeding edge. The hardest to deal with were dependencies that were newer than what upstream was using. We tried to write patches for these, but we weren’t always successful. When we did write patches we worked to get them upstream, but in at least one case, that of Jetty, it’s complicated because the version Fedora has does not work with Java 6 and the upstream community isn’t ready to give up on Java 6.”
Just because Hadoop is in Fedora 20, doesn’t mean the problem goes away. “Dependencies are, and will be, an ongoing effort, as Fedora rapidly consumes new upstream versions.”
With all that work to be done, Farrellee was far from the only person working on the packaging effort for Hadoop. He says that the team “came together under the umbrella of the Big Data SIG that Robyn [Bergeron] kicked off near the beginning of 2013” and has been “awesome” in pulling together to get the job done.
“We include people primarily interested in Hadoop, members from the Java SIG (which is key because the Hadoop ecosystem is mostly Java), random Fedorans who had an itch to scratch, and massively prolific packagers who were already looking at doing packages needed for Hadoop.”
Coming Soon
What’s next in Fedora with the Hadoop ecosystem? Ambari, already mentioned, is a big one. “We’re working with the upstream Ambari community to get it ready for Fedora” says Farrellee. “It turns out to heavily use node.js, which does not have a strong presence in Fedora. HBase is on its way, along with Hive and a handful of others.”
The Big Data SIG also has a list of what’s been done, what’s in progress, and what’s to come. In standard Fedora fashion, Farrellee adds that “anyone is welcome to add to the future list or take things off and start packaging them!”
Guest contributor Joe Brockmeier works on the Open Source & Standards team at Red Hat.