Big data by default relates to Hadoop for many. While Hadoop started as batch system, demand grew for real time response while processing big data. Storing exascale data on Hadoop has been practical and affordable. Because of original batch oriented architecture, the response of Hadoop based system did lag for real time business queries for actionable insight.
In this blog, you will find that there are viable options available to deliver real time response for big data. I will not go into deep architectural or implementation details, but will give a broad view of the options we have and how we can relate them simple paradigm like SQL interface. I will also give an use case of lambda architecture i.e response = f ( all data ), where we will use components of Hadoop ecosystem to describe a real time Big Data system.
Introduction
When Yahoo open-sourced Hadoop and MapReduce, it was a batch-oriented system. It was derived from MapReduce paper published by Google. Many have implemented Big Data solutions on top of Hadoop. Hive on top of Hadoop has made it easier to query the data stored in Hadoop. Hbase, taking columnar table approach of stored files, tries to make read/write fast by scans based on intelligent row key and timestamp glued attribute values. Many SQL interfaces like Cloudera Impala, Pivotal HAWQ, streaming data engines like Storm, and in-memory frameworks like Spark are now available to speed up query responses. Last but not least, NoSQL databases like Cassandra also offer real time big data solutions.
Options
1 - Hbase
Hbase runs on HDFS in Hadoop framework. Apache HBase is used when you need random, realtime read/write access to your Big Data. Hbase's goal is the hosting of very large tables -- billions of rows X millions of columns -- atop clusters of commodity hardware. Apache HBase is an open-source, distributed, versioned, non-relational database modeled after Google's Bigtable: Just as Bigtable leverages the distributed data storage provided by the Google File System, Apache HBase provides Bigtable-like capabilities on top of Hadoop and HDFS.
2 - Kafka and Storm with Hadoop
Kafka
Apache Kafka is a distributed publish-subscribe messaging system. It is designed to provide high throughput persistent messaging that’s scalable and allows for parallel data loads into Hadoop. Its features include the use of compression to optimize IO performance and mirroring to improve availability, scalability and to optimize performance in multiple-cluster scenarios.
Storm
Hadoop is ideal for batch-mode processing over massive data sets, but it doesn’t support event-stream (a.k.a. message-stream) processing, i.e., responding to individual events within a reasonable time frame. (For limited scenarios, you could use a NoSQL database like HBase to capture incoming data in the form of append updates.).
Storm is a general-purpose, event-processing system that is growing in popularity for addressing this gap in Hadoop. Like Hadoop, Storm uses a cluster of services for scalability and reliability. In Storm terminology, you create a topology that runs continuously over a stream of incoming data, which is analogous to a Hadoop job that runs as a batch process over a fixed data set and then terminates. It is common for organizations to run a combination of Hadoop and Storm services to gain the best features of both platforms.
This is the foundation of new lambda architecture where slow and fast data sets are ingested and queried for fast response.
3 - SQL on HDFS and GFS
Impala
As useful as Hive is, the latency of Hadoop means that each query in an interactive Hive session will take many seconds, which makes it difficult to explore and evolve ideas quickly. Impala is a new query engine that bypasses MapReduce for very fast queries over data sets in HDFS. It uses HiveQL as the query language. Impala is very new; the first production release is forthcoming. It currently doesn’t support all HiveQL features, but in many scenarios, speedups of 100x over Hive performance are already possible. Apache Drill is another technology that is upcoming along with Tez/Stinger from Hortonworks. Goal is to offer SQL like interface to massive datasets for real time response.
Spark and Shark
While very flexible, the MapReduce has a number of constraints that affect performance. Spark is a newer distributed computing framework that exploits sophisticated in-memory data caching to significantly improve many common data operations, sometimes by multiples of 10x. Shark is a port of Hive to Spark, bringing performance improvements to Hive queries that are comparable to the improvements Impala provides. Tachyon is a fault tolerant distributed file system enabling reliable file sharing at memory-speed across cluster frameworks, such as Spark and MapReduce.
Google BigQuery
This is a web service that lets you do interactive analysis of massive datasets—up to billions of rows. It is the first publicly available access to Google's internal Big Data technology stack which runs on GFS ( Google File System).
4 - Special Big Relational Databases
Let us look at real time big data solutions aided by MPP ( massively parallel processing ) and columnar databases like HP Vertica. Adoption by Twitter, Facebook and Zynga shows these databases hold place in the realm of complete real time big data solutions. Depending on use case, if you want real time analytic output based on 360 degree view of an entity, you could feed all data streams to a MPP databases for final reporting and calculations. SAP Hana, Sybase IQ, VoltDB are databases of similar analytical prowess. Then, Oracle 12c ( in memory ) and DB2 10.5 BLU ( in-memory ) databases show promise of achieving same results what Vertica and HANA have shown.
5 - Big Data Appliances
Let us now look at big data appliances like Aster, Netezza and Greenplum. They are cluster based and hardware driving query execution engine grids that support petabytes of data. Do they have place in big data as we know now? I will leave it to the use cases and how they compete with MPP databases like Vertica in future, and moreover on how they co-exist with Hadoop. All appliance vendors have shown spirit of Hadoop co-existence. That is good news. Hardware acceleration of SQL query is big benefit. While their analytic calculation engine runs at scale, cost and storage are prohibitive.
6 - NoSQL databases
NoSQL databases are new breed of databases that support unstructured data storage and show promise of scalability and speed. Except Hbase. most of the databases are Hadoop independent. Let us consider Apache Cassandra here . Cassndra is a shared-nothing architecture oriented NoSQL columnar database. Cassandra Query Language (CQL) is a SQL (Structured Query Language)-like language for querying Cassandra. Cassandra's data model is a partitioned row store with tunable consistency. Rows are organized into tables; the first component of a table's primary key is the partition key; within a partition, rows are clustered by the remaining columns of the key. Other columns can be indexed separately from the primary key. Tables can be created, dropped, and altered at runtime without blocking updates and queries. Cassandra does not support joins or subqueries, except for batch analysis through Hadoop.
MongoDB is a master-slave architecture NoSQL database, which is very popular. It is a document oriented database. It takes JSON documents and stored them in shards. It can scale with shards. It provides JavaScript type shell interface to query the database. Like Cassandra, it has Java, Python. .Net interfaces. But, it does not have SQL interface like Cassandra.
7 - Solr and Elastic Search
Solr and ElasticSearch leverage Lucene near real-time capabilities. This makes it possible for queries to match documents right after they’ve been indexed. In addition to that, both Solr (since 4.0) and ElasticSearch (since 0.15) allow versioning of documents in the index. This feature allows them to support optimistic locking and thus enable prevention of overwriting updates. Search is basic component of any Big Data real time solution. Unless you properly index total data with Cassandra, Hbase. MongoDB etc. - you will still need a separate indexing system to meet your full text query for faceted, fuzzy and natural language processed results. Solr and Elastic Search fit the need very well. Both have strengths and weaknesses, they are a solid partner to any Big Data real time solution that asks for sophisticated text query of big data.
SQL as unifier
Now that we have discussed real time big data options on broad basis, how do we make use of it? What is best way to calibrate them? What about SQL way? There is a legacy of SQL ecosystem built over last 40 years. This language has very large following. So, we see immense effort by many companies to introduce SQL real time response for Big Data. For example, Google BigQuery and Cloudera Impala have production ready SQL interface to Big Data. Good news is more big data systems now support SQL interface. Light weight and fast NoSQL databases like MongoDB are only exception. Solr and Elastic Search can be integrated with the help of JDBC driver that invokes right data import utility at source, to build the index shards for real time query ( facet, cache etc. ) as user wants. SQL is an unifier. On a grand scale, Apache Hive with Shark, Cloudera Impala, Pivotal HAWQ etc. externalize Hbase tables and HDFS files via storage handlers to enable SQL interface.
Let us consider an example of realtime streaming and messaging platform where Storm and Kafka ingest data at very high rate and you write the data to Hbase as part of complex event processing or publication/subscription of incoming high speed data. Then, you query Hbase data with SQL interface. This is a lambda architecture solution as stated in Option 2 above. Users in large are at great advantage to reuse their SQL skills for varieties of data sets that make up real time big data systems of today.
Conclusion
Big Data ecosystem is evolving. We reviewed broad choices for real tiem big data above. While the momentum behind SQL for real time big data is large and getting bigger, there are some who question whether abandoning the MapReduce framework is the best way to achieve low latency in Hadoop. Use of MapReduce will thrive for years. There will be big advances in real time analytics on top of Hadoop and Big Data.
In the meantime, other technologies, like very fast in-memory relational analytic databases like SAP HANA, NoSQL databases like Cassandra, MPP databases like Vertica and Big Data Appliances will complement and co-exist with Hadoop for real time business need, say it for player scoring, real time device diagnosis, on-line ad placement or pin-pointed fraud detection.
Reference URLs :
1. http://www.apache.org
2. http://www.datanami.com/datanami/2014-02-14/rethinking_real-time_hadoop.html?featured=top
3. http://www.vertica.com
4. http://shark.cs.berkeley.edu
5. https://developers.google.com/bigquery/