Introduction
Hadoop and NoSQL databases have helped address 3Vs ( Volume, Velocity and Variety ) of Big Data. If you survey all solutions that are in production today, you will find majority of Big Data implementations are Hadoop based and they run in cloud. As an example, Netflix uses Amazon Web Services extensively to offer its on-line movie rentals. What about Pintrest? Pintrest also uses Amazon Web Services. Using cloud to offer big data solutions for social business and enterprise is not a misnomer anymore. It is practical and here to stay.
In this blog, I will give a survey of cloud offerings for big data Hadoop. We will cover Amazon, Google, Microsoft as big data infrastructure providers. We will also cover Infochmps which is a pure Hadoop services provider.
Amazon - EMR to Kinesis
With Amazon Elastic MapReduce (Amazon EMR), you can analyze and process vast amounts of data. It does this by distributing the computational work across a cluster of virtual servers running in the Amazon cloud. The cluster is managed using an open-source framework called Hadoop. Hadoop uses a distributed processing architecture called MapReduce in which a task is mapped to a set of servers for processing. The results of the computation performed by those servers is then reduced down to a single output set. One node, designated as the master node, controls the distribution of tasks.
Amazon EMR has made enhancements to Hadoop and other open-source applications to work seamlessly with AWS. For example, Hadoop clusters running on Amazon EMR use EC2 instances as virtual Linux servers for the master and slave nodes, Amazon S3 for bulk storage of input and output data, and CloudWatch to monitor cluster performance and raise alarms.You can also move data into and out of DynamoDB using Amazon EMR and Hive. All of this is orchestrated by Amazon EMR control software that launches and manages the Hadoop cluster. This process is called an Amazon EMR cluster.
Amazon offers add-on services like Hive, PIG, Hbase, Impala, newly Kinesis ( real time streaming ) on top of Hadoop as services. Starting with AMI 3.0.4, Amazon EMR clusters can read and process Amazon Kinesis streams directly, using familiar tools in the Hadoop ecosystem such as Hive, Pig, MapReduce, the Hadoop Streaming API, and Cascading. You can also join real-time data from Amazon Kinesis with existing data on Amazon S3, Amazon DynamoDB, and HDFS in a running cluster. You can directly load the data from Amazon EMR to Amazon S3 or DynamoDB for post-processing activities. For information about Amazon Kinesis service highlights and pricing, see Amazon Kinesis.
Integration between Amazon EMR and Amazon Kinesis makes certain scenarios much easier; for example:
- Streaming log analysis–You can analyze streaming web logs to generate a list of top 10 error types every few minutes by region, browser, and access domains
- Customer engagement–You can write queries that join clickstream data from Amazon Kinesis with advertising campaign information stored in a DynamoDB table to identify the most effective categories of ads that are displayed on particular websites.
- Ad-hoc interactive queries–You can periodically load data from Amazon Kinesis streams into HDFS and make it available as a local Impala table for fast, interactive, analytic queries.
Google - Big Query to Hadoop
Compared to Amazon, Google is a new Hadoop cloud provider. But, it is formidable. Hadoop on Google https://cloud.google.com/solutions/hadoop/ is gaining steam. Before Hadoop, Google announced flagship BigQuery. Querying massive datasets can be time consuming and expensive without the right hardware and infrastructure. Google BigQuery solves this problem by enabling super-fast, SQL-like queries against append-only tables, using the processing power of Google's infrastructure. You simply move your data into BigQuery and let Google handle the hard work. You can control access to both the project and your data based on your business needs, such as giving others the ability to view or query your data.
You can access BigQuery by using a browser tool or a command-line tool, or by making calls to the BigQuery REST API using a variety of client libraries such as Java, PHP or Python. There are also a variety of third-party tools that you can use to interact with BigQuery, such as visualizing the data or loading the data.
Coming back to Hadoop on Google infrastructure, Hadoop scales fast on Google Cloud Platform. Rather than rely on hardware to deliver high-availability, the library itself is designed to detect and handle failures at the application layer, so delivering a highly-available service on top of a cluster of computers, each of which may be prone to failures. The Hadoop Cluster on Google Compute Engine creates one master node and a user-specified number of worker nodes. The master node runs the HDFS NameNode and MapReduce JobTracker. Each worker node runs an instance of the HDFS DataNode and MapReduce TaskTracker.
The Apache Hive and Pig installs Hive and/or Pig on the master node and configures appropriate access to directories within HDFS.
Google Cloud Storage plays two roles in the solution:
- Stages the software for installation on Google Compute Engine instances. When an instance starts, it downloads the appropriate software package from Google Cloud Storage
- Provides the durable storage for data. Data is brought onto the Google Compute Engine cluster and pushed into HDFS for processing. Results data is then copied from HDFS into Google Cloud Storage
Microsoft - Windows to Apache Hadoop
What are the options for Microsoft Windows users? Microsoft offers HdInsight as per http://www.windowsazure.com/en-us/services/hdinsight/ .
HDInsight is a Hadoop-based service from Microsoft that brings a 100 percent Apache Hadoop solution to the cloud. A modern, cloud-based data platform that manages data of any type, whether structured or unstructured, and of any size, HDInsight makes it possible for you to gain the full value of big data. With HDInsight you can seamlessly process data of all types through Microsoft’s modern data platform, which provides simplicity, ease of management, and an open Enterprise-ready Hadoop service all running in the cloud. You can analyze your Hadoop data with PowerPivot, Power View and other Microsoft BI tools, thanks to integration with Microsoft data platform running on Windows .
Inforchimps - Hadoop to Streams
Infochimps™ Cloud at http://www.infochimps.com/infochimps-cloud/overview/ offers suite of cloud services that makes it faster and far less complex to develop and deploy Big Data applications. Big Data applications solve real business problems — via analytics, data flows, scalable storage and flexible, interactive interfaces. The consumers of your applications and insights may be your own employees, business partners or customers. It is a 3rd party pure Hadoop service provider unlike others which provide compute,storage and other services in addition to Hadoop. Infochimps provides :
- Cloud::Streams — Streaming data and real-time analytics. We believe that turning data into value requires that enterprises operationalize their insights in real-time. Take advantage of your customer’s state of mind in the moment. Take action NOW with Infochimps Cloud::Streams — a stream processing cloud service that provides in-memory, real-time analytics, distributed ETL and complex event processing (CEP), all in a single cloud service.
- Cloud::Queries — NoSQL database and ad hoc, query-based analytics. We also believe that your enterprise needs the ability to explore your data to discover new insights. Business users need the ability to perform ad hoc queries of the data in “query-response” time. This is addressed with Infochimps Cloud::Queries — an advanced NoSQL cloud service that provides the ability to interact with and analyze billions of data elements.
- Cloud::Hadoop — Elastic Hadoop clusters and batch analytics. We believe that your enterprise needs to understand how your entire business operates quarter to quarter, season to season, year to year, by analyzing all operations holistically — sales, marketing, engineering, operations, finance, HR, etc. — over any and all periods of time. Finding the truth of how the business is running can only be done with the power of simplified batch analytics delivered by Infochimps Cloud::Hadoop.
Conclusion
In this blog, we reviewed vendors who provide Hadoop as service. As we found, Amazon is matured and biggest player. Google is ramping up. Microsoft is now out to market with HDInsight for Windows user base specifically. We discussed InfoChimps as unique solution provider which does not have infrastructure power of Amazon, Google or Microsoft; but it can offer you alternatives if you do not want to go with big players and just focus on pure big data solutions and SLAs.
I have not covered Rackspace, IBM, HP and many other providers for lack of space in the blog. There are also solution providers which use hadoop infrastructure in the back-end to offer add-on services like Hive, PIG etc. They include Qubole, Mortar Data, Treasure Data and so on. The cloud offering market for Hadoop is maturing and expanding. It is at a stage now that there is intense pricing competition between Amazon, Google and Microsoft. Who are the winners? It is we , developer and customer community.
As of March 31, 2014; Microsoft announced price cuts to their Azure cloud computing platform that includes HDInsight. This move appears as a reaction and combative move against Amazon's own price cuts last week, that were caused by Google cutting costs to their own cloud platform. Microsoft is keeping its word to price match Amazon in the cloud computing department. As an Azure developer you can look forward to compute price cuts up to 35% and storage price cuts up to 65% in the line of Amazon and Google cuts.
So, Big Data and Hadoop as cloud offering is just getting more affordable, exciting and worth trying by all means.