It only makes sense that as the community of Spark contributors got bigger the project would get even more ambitious. So when Spark 2.0 comes out in a matter of weeks it’s going to have at least three robust new features, according to Ion Stoica, the founder of Databricks and keynote speaker at Apache Big Data in Vancouver on Tuesday afternoon.
“Spark 2.0 is about taking what has worked and what we have learned from the users and making it even better,” Stoica said.
Queries will be more performant – the goal is 10x faster – through the success of Project Tungsten, an ongoing effort which set out to improve the efficiency of memory and CPU for applications. The three ways it’s succeeded is through cache-aware computation that uses algorithms and data structures to exploit memory hierarchy, code generation to exploit modern compilers and CPUs, and using application semantics to eliminate memory getting bogged down on garbage collection and the JVM object model.
“The more semantics you know the better you can optimize the applications,” Stoica said.
Spark 2.0 will ship with even more components from Project Tungsten, which has been rolling out in pieces, across multiple releases, since Spark 1.4 about a year ago.
Spark 2.0 will also feature improved APIs to make it “even easier” to write applications for Spark, a feature for the influx of data scientists that are now using Spark who aren’t necessarily full-blown developers and database admins. Part of this feature is the introduction of the Dataset API. Datasets are static typed extensions that use Resilient Distributed Dataset (RDD)-like operations, and when added to Spark’s dataframes, it creates a best-of-both-worlds approach.
Stoica said each library in Spark – things like graphing libraries, machine learning libraries, and so on – will be rewritten to include datasets.
The third major improvement is increased support for streaming data by creating what Stoica called “infinite dataframes,” which is a table that’s constantly adding new entries.
“We’re going to integrate support for interactive and batch queries,” Stoica said. “It’s not just streaming, it’s what we’re going to call continuous applications.”
Stoica ran a demo of the streaming capabilities through his company Databricks’ Spark distribution by showing Twitter clusters on a map computed in real time. But the implications are much bigger for running analytics on streaming data with all the many systems and languages that connect standard to Spark. This will be huge, he said, for applications like fraud detection, or updating machine learning algorithms in real-time.
“Everyone knows that streaming is more and more important,” Stoica said. “You want to operate and do analytics on fresh data. You want the ability to do queries on the data that was just streamed.”