Social networks generate colossal amounts of data that have
come to defy the use of conventional data-processing tools, so it's no
wonder their engineering teams have built their own toolsets -- such as
Facebook and its
machine-learning tools.
Enter LinkedIn, now offering its own Apache-licensed, open-sourced data-processing solution:
Pinot,
a real-time analytics engine and datastore, designed to run at scale.
Yes, Hadoop is one of its data sources, providing yet another option for
those looking to perform SQL-style queries.
LinkedIn's own OLAP
As
originally discussed by
LinkedIn's engineers late last year, Pinot was designed to provide the
company with a way to ingest "billions of events per day" and serve
"thousands of queries per second" with low latency and near-real-time
results -- and provide analytics in a distributed, fault-tolerant
fashion.
The original system was assembled out of whole congeries of existing pieces -- an Oracle database here, a
Project Voldemort key-value
store there -- but LinkedIn found the amount of data ingested was too
great for solutions not designed for OLAP-style jobs in the first place.
Like many other data-processing frameworks that live in or near Hadoop, Pinot is written in Java. It uses
Apache Helix -- also developed at LinkedIn -- to perform cluster management. Real-time data comes in by way of
Kafka, with historical data fetched from Hadoop.
Some sacrifices were made
With querying, Pinot shows some of its limitations --
although most are deliberate design decisions, reflecting Pinot's focus
on the specific conditions for which LinkedIn created it.
For instance, the SQL-like query language used with Pinot
does not have the ability to perform table joins, "in order to ensure
predictable latency" (according to LinkedIn's engineers). There's truth
to this, since SQL-on-Hadoop solutions have been known to suffer from
poor performance if they attempt to perform joins between data stored in
highly disparate places. Full-text search and relevance ordering for
results also aren't supported.
Finally, data is strictly read-only -- although given the
number of other SQL-for-big-data solutions that work the same way, this
won't likely be a major letdown.
A fairly vertical solution
Each SQL-on-Hadoop solution has so far addressed a slightly
different set of needs -- some for real-time queries (Spark SQL), some
for historical data (Hive), some to emulate as much of SQL's existing
behavior as possible without sacrificing performance (Stinger). Pinot is
similarly narrow in focus, given that it was built to scratch
LinkedIn's specific itches.
With the project going open source, though, LinkedIn clearly
hopes it can scratch other peoples' itches as well, especially if
existing SQL-for-Hadoop/real-time-data solutions don't cut it. It's less
clear if LinkedIn wants Pinot to eventually follow in the footsteps of
other Hadoop projects and eventually become Apache-governed, although
the choice of license for the project (Apache) would make such a
transition a snap.
This story, " LinkedIn fills another SQL-on-Hadoop niche" was originally published by
InfoWorld.
Comments
Post a Comment