Scala based Apache Spark may have only graduated from incubator status last February, but in a short space of time, it’s become the most active open source big data project around – a rise that’s segued with the decline of old faithful tool Hadoop MapReduce.
Just this week, Spark was named in the ‘InfoWorld Technology of the Year 2015 Awards’, with the panel stating that, when compared to traditional Hadoop machine learning models, Spark was the “hands-down the winner in almost every respect.” You may have also read about the Apache trailblazer in a newly released report, ‘APACHE SPARK – Preparing for the Next Wave of Reactive Big Data’; led (naturally enough) by reactive platform and Scala people Typesafe, and Spark guardians Databricks.
According to the largely developer-led survey (74% to be exact, followed by 7.6% data scientists and 6.5% C-Suite), out of 2,136 respondents, 13% were actively using it in production, 31% were toying with implementing it, and 20% were on schedule to pick it up for production this year.
But what’s driving this enterprise takeover? And is it time to consign MapReduce to the elephant graveyard? Resident Typesafe Big Data Architect Dean Wampler weighs in.
Voxxed: There have been some modifications to the Reactive Manifesto in the past few months – can you clarify what you mean by ‘Reactive Big Data’ in context of this report?
DW (Dean Wampler): This term speaks to the growing interest in processing streams of events very soon after the events are received, rather than capturing the data to storage and then processing it in batches later on. Reactive Programming also encapsulates a few general-purpose principles about building highly scalable, resilient, and responsive applications, Big Data or otherwise. You want your data fast, so your services need to be reactive, in this sense, to meet this goal.
Do you think that this mass uptake (or pending uptake) of Spark decisively signals the death of MapReduce in 2015?
DW: I think almost all organizations will stop writing new MapReduce apps this year, but many existing apps will be maintained for a long time. As always, it can be expensive and a bit risky to rewrite something that’s working. For some mature users of MapReduce, they may also have a significant investment in reusable libraries that they’ll have to port to Spark first, as well. However, most new apps will use start to use Spark this year.
One of your key takings was that faster data processing and event streaming are the focus for enterprises. Why are speed and velocity more of an issue when it comes to data than size?
DW: In my experience doing Hadoop consulting in prior years, many of the projects I worked on really didn’t have huge data sets, like we associate with Facebook, Google, Twitter, etc. They were leveraging tools like Hadoop due to the cost and flexibility advantages, but these tools weren’t always optimal for their specific needs, such as scaling down to smaller data sets and processing event streams.
Even when some types of analytics can be done with batch-mode MapReduce jobs, say for example, updating your search engine when a user changes a document or website, there is a competitive advantage if the search engine incorporates that change sooner rather than later. So, data organizations want answers faster, ideally in real time.
What makes Spark so much faster at data processing than Hadoop?
DW: First, by design, MapReduce assumed a batch mode model, where performance is achieved with massive parallelism, so startup and shutdown overhead are less important. That’s not true in an event streaming world. The corresponding overhead for Spark is smaller.
Second, when you write a Spark job, you are actually defining a data flow, a pipeline, if you like. This gives Spark a “global” view of your job, so it can apply optimizations like caching intermediate data between steps and merging some steps together. These features, especially the caching, dramatically improve performance over MapReduce, which has no such global view, so it can’t apply similar optimizations. So with Spark, you get the full spectrum, event stream processing in the small up to very large batch jobs.
Do you see reactive systems as being more important as big data evolves?
DW: All of the traditional forms of data analytics will remain useful, but people will search for new ways to shorten the time between data arrival and information extraction. So, we’ll see many of these traditional approaches sped up and made more iterative (that is, repeated over and over on smaller batches of data as they arrive). Similarly, new forms of analytics are made easier by Spark and they are easier to integrate, too. You can do classic ETL (extract, transform, and load) from several data streams, where you clean and transform the data, then pass it into a machine learning algorithm that constantly trains itself with the new data, write the results to another data store, etc., all in one Spark job.
Another data point; we were surprised how many people interested in Spark planned to ingest data from Apache Kafka (41%), a popular, highly-scalable message queue.
Do you think in the future the number of Java users of Spark could leap-frog that of the Scala community, given the comparatively huge number of Java programmers? (Currently 44% of users are working with Spark in Java, followed by 22% Python).
DW: This is probably a safe bet, even though I think they should switch to Scala! There are many, many Hadoop developers adopting Spark now and perhaps most will want to continue with Java. It will be one less new thing to learn.
I can’t blame them for that, but I think their long-term productivity won’t be as good.
Scala use will increase, however. I have talked with many developers who first started using Scala when they started using Spark or predecessor tools like Twitter’s Scalding.
Why do you think Java users are choosing Spark over Java-centric HBase?
DW: They are really distinct things. HBase is one way to persist your data. Spark is one way to process it. Previously, it was common to write MapReduce jobs to process data in HBase. Now Spark is being used for this purpose.
Similarly, our survey showed that about 20% of the people interested in Spark want to use it to process data in Cassandra.
Looking forward, what areas do you think Spark will be making inroads? Will there be a pivotal place for it in IoT systems, for example?
DW: The answer is yes in two ways. First and most obviously, the ability of Spark to analyze data streams is an important capability for IoT systems. When you’re looking at trends in your system, I can think of no better tool for the job.
The other, less obvious way is the ability for Spark to scale down efficiently to smaller deployments. Recall my previous discussion of the overhead of Spark vs. MapReduce. If you have a service infrastructure managing an IoT network of devices, it will be much easier to “mix in” Spark analytics into that infrastructure, without building out a dedicated data cluster, Hadoop or otherwise.
So, in general, I see Spark becoming an integrated piece into infrastructures that don’t fit our usual model of data warehousing, etc. They are just conventional IT environments, with much better analytics baked in.