Real-time processing of streaming data in Hadoop typically comes down to choosing between two projects:
Storm or Spark.
But a third contender, which has been open-sourced from a formerly
commercial-only offering, is about to enter the race, and like those
components, it may have a future outside of Hadoop.
DataTorrent RTS (real-time streaming) has long been a
commercial offering for
live data processing apart from the family of Apache Foundation open
source projects around Hadoop. But now DataTorrent (the company) is
preparing to
open-source the core DataTorrent RTS engine,
offer it under the same Apache 2.0 licensing as its competitors, and
eventually contribute it to the Apache Foundation for governance.
Built for business
Project Apex, as the open source version of DataTorrent
RTS's engine is to be called, is meant to not only compete with Storm
and Spark but to be superior to them -- to run faster (10 to 100 times
faster than Spark, it's claimed), to be easier to program, to better
support enterprise needs like fault tolerance and scalability, and to
make it easier to demonstrate the value of Hadoop to a business owner.
According to DataTorrent VP of Marketing John Fanelli,
DataTorrent RTS/Project Apex is meant to ease the process of working
with Spark's streaming processing. "Spark is very much a development
framework," Fanelli said in a phone conversation, "where you have to
write everything by hand ... and where you have to think and program in
more of a MapReduce paradigm."
Fanelli said that Spark lacks other key features that would
be attractive to enterprises, such as event processing, the ability to
guarantee the order of events, and fault-tolerance at the platform
level. Apex doesn't require Scala to program it, meaning existing Java
programmers wouldn't need to do as much retooling to leverage it. (Spark
is written in Scala and can be programmed both with it and a few other
languages, including Python and Java -- but the best results with Spark
generally come from using Scala.)
Fanelli also felt Apex can help Spark users get away from
working with time-consuming batch-oriented methods to generate insights
from existing data. "It's better to use a streaming product to do batch
than it is to use a batch product to do streaming," he said.
Hadoop might only be the beginning
There's little question Apex is being open-sourced in part
to entice users toward the commercial DataTorrent RTS product. Many of
its features -- such as graphical app design and dynamic optimizations
of workloads, which expand upon the core that Apex offers -- are an
attempt to address what Fanelli feels are the value propositions Hadoop
doesn't always communicate well to enterprise customers, like generating
real-time actionable insight on ingested data.
If Hadoop isn't taking off in some enterprises because of
its value proposition, that by itself isn't tied to any one issue. Aside
from the perception that
Hadoop is overkill for the work being done, there's also the notion that Hadoop is
too costly or complex to be worth the trouble. Hadoop vendors keep trying to address these issues, but there's reason to believe Hadoop only has
so much appeal with enterprises.
Likely less limited is the culture of reuse and development
around individual pieces within Hadoop, like Spark -- and now Project
Apex. Their real-time processing functionality doesn't have to be
coupled with Hadoop to be useful, although it's been the most common
scenario associated with how they're leveraged. Having Apex as an open
source project will add another option to that toolbelt, one that's
useful apart from any other happenings with Hadoop.
This story, "Spark and Storm face new competition for real-time Hadoop processing" was originally published by
InfoWorld.
Comments
Post a Comment