Blog Posts

Agile Big Data - Idea to Implementation

12/28/2014

Introduction

Big Data applications are changing the way we live, think and respond. After reading stories about Google, Twitter, Facebook, Pintrest, Ubers of the world, I have found that these companies are successful because their apps respond to the need of millions of people on this earth. Not only the response is timely, it is often customized meeting a personal need at the moment. It could be a best near-by restaurant within your budget or mobile phone available at cheaper store around you, or even a doctor for fast urgent care. Uses are numerous.

How do they do it? To deliver at scale and respond 24x7, they have built scalable, reliable and secure big data infrastructure that meets need of an agile development team. Agile development teams at these organizations are empowered to unleash innovation at speed of thought. Innovation keeps them alive. They keep on scavenging for best open source technology they can get their hands on so that they can experiment new ideas, build mock-ups, code it close to production standards. There is not much lag between initial code and production code. First time code has to be right code!

I have discussed about Hadoop and Machine Learning throughout the year here . Let me end this year with a blog on how to implement big data apps by realizing product features fast. I will start with what I mean by agility and then discuss visualization driven ideation and mockup methods. This is very important. Businesses need top-notch visualization of big data findings to garner insights, to make useful and timely decisions. Taking cues from ideas, I will discuss implementation choices we have. There, I will discuss web app implementation and modern data infrastructure choices, urging you decide on a framework for development , storage and processing. It is the foundation of what you will build and maintain for years to come. I will touch upon agile development concepts as needed.

Agility

Agile Development fits Big Data development well. Agile methods attempt a useful compromise between no process and too much process, providing just enough process to gain a reasonable payoff at the speed of business.   Agile methods have some significant changes in emphasis from engineering methods. The most immediate difference is that they are less document-oriented, usually emphasizing a smaller amount of documentation for a given task. In many ways they are rather code-oriented: following a route that says that the key part of documentation is source code.

Two key characteristics that apply to big data development are - Agile methods are adaptive and people oriented. These two themes advance modern big data application development. Just ask a Google or Uber developer.

Idea

From understanding the background information, we want to learn the following: What is the main point we want to make to our audience? If there are multiple goals, how can we prioritize them? Knowing these answers helps us decide on the hierarchy of ideas to visually emphasize. At this stage, there is very little discussion about the look and feel. That will emerge with time.Sketching out an assortment of thumbnails of potential designs is a good habit. Keep your mind open to different ideas. Trying out ideas with pen and paper saves a lot of time later on because if you identify a bad idea, it’s much less painful to throw it out since you haven’t invested much time and effort.

After sketching ideas, we should experiment with the custom data visualizations by coding proofs-of-concept. Using d3.js is advised but there are many solutions out there. This site is an example of a quick sketch using d3.js to understand how a sketch looked with real data. It doesn’t need to be fancy, just enough to get the concept across so you have time to explore other ideas.

After deciding which ideas are the most promising, we should create a mock--up. At this stage there is plenty of jumping back and forth between sketching, d3.js sketching, and mock-up while refining ideas. Let ideas win consensus from stakeholders and developers at first. This is what should happen during initial set of daily sprints of SCRUM like agile development sessions. Let us not fall into the trap of coding from a sketch and playing around with the style sheets to figure out the design. Separating the design side from the development side is much more efficient. Same person should not do do both tasks. They require different skills and trying to design while developing is difficult, error-prone and frustrating.

Implementation

We are fully aware of LAMP and how open source technology revolutionized web application development for ever. LAMP was used to build prototype of Facebook, Flipkart and so many big data applications that we use today. World has changed since then. MySQL has become NoSQL and PHP has become JavaScript. What are our choices today? There is good news. Following footsteps of LAMP, new frameworks like Mean and Meteor are advancing thought of rapid application development, using base  modules like jQuery, HTML5, Node.js and Angular.js. They shorten development time while giving developers control by sharing data,, models and code repositories. This approach is rooted in agile development methodology.

If you want to stay conservative, you can stick with vendors, Cloudera/Hortonworks/MapR/IBM/MongoDB/Cassandra, and keep experimenting with what they offer to meet your need. You still need to keep track of open source offerings that you need to integrate with vendor offerings. You get peace of mind that you can go to someone for help as you design your architecture and start data prep, coding and deployment. If you do not want to go vendor route. you can pick and choose open source modules. You have to tap into right resources ( software/skills ) that can deliver you goods, which may end up being expensive and frustrating. Beauty here is that you devise one blueprint that works for you, so you can keep going for long time to come. This is what Google, Netflix, Facebook and others have in practice. Google has relentlessly innovated sine they solved search/rank problem using GFS ( Google File System ) for Storage and Map Reduce for Computing, decade or more back.

Today, Cassandra and HDFS coupled with multiple execution frameworks, including Spark, Spark Streaming, Spark SQL, Impala, and CQL. Kafka/Storm are used for scalable fault tolerant data ingestion. Once data is ingested, data needs to be accessed and processed. You need transparent mechanism data flow and processing.

The ideal scenario for us would be a single copy of the data to serve all our data needs, but this is very difficult to achieve because the underlying architectures that serve high-volume, low-latency OLTP needs are fundamentally different from those needed to serve the large table scans required in analytical use cases. For example, Cassandra purposely makes it difficult to be able to scan large ranges of data, but allows for low-millisecond access of millions of records per second. On the flip side, the fundamental storage pattern in distributed file systems like HDFS purposely does not index data to allow for low-millisecond access, but uses larger block sizes to allow for very fast sequential access of data to scan billions of records per second.

Streaming engines like Spark Streaming help mitigate the problem by allowing analytics to be performed on the fly as data is ingested into the system. Apache Storm can be used in place of Spark Streaming.

When is the last time you were able to predict every metric that needed to be calculated at project inception? Thus lambda architectures have arisen to allow metrics to be added down the road by recomputing history. One important characteristic of a good lambda architecture is that the system needing to re-compute history should be able to operate quickly on a large dataset without disrupting performance or the low-latency user traffic. Therefore, it is not uncommon to store copy of our data in Parquet format on HDFS, whereas on Cassandra for the low-latency user traffic.

Modern data architecture requires ability to use multiple execution frameworks over the same data. By using open data formats and storage engines, we gain the flexibility to use the right tool for the job, and position ourselves to exploit new technologies as they emerge. Said that, we have many choices. Trick is to look at best practices of what successful companies have deployed and derived best techniques conforming to timing, budget and resources guidelines. If skills are unlimited in supply and there are no budget constraints, you can be best positioned to devise right architecture and process first time. That is rare. You have to keep experimenting and tuning the framework. It is always work in process and 100% agile.

Conclusion

We have only scratched the surface of how this process can be used to solve multiple data engineering problems in the context of Big Data app development - from idea to implementation. I have urged practicing agile methodology. Hopefully, this post provides some  useful ideas. If you want to know more about what Big Data apps for enterprise are about, please read famous Forbes article by Edd Dumbill.

17 Comments

Java Tuning of Servers

12/3/2014

1 Comment

Hadoop stack is built on Java. Big Data and Java are synonymous. Google exploited Java to build both BigTable and Android platforms.

So, it is useful to learn how to tune Java performance.

Please find an excellent article on Java Tuning here .

1 Comment

Affordable Event Log Analysis with Logstash and Hadoop

11/16/2014

5 Comments

Introduction

The popular log analysis system Splunk is expensive for many use cases. What is an alternative which is open source and integrates well with Hadoop ecosystem ? Answer is Logstash, a Java-based system is built on top of Elasticsearch, an open source search engine technology has been put to use by everyone from Netflix to Github. With Logstash, any data that carries a timestamp of some kind can be considered log data and can be ingested and processed according to user-defined rules. If you want to know how it is being used by Dreamhost, please find the presentation here .

By itself, Logstash is no direct competition for Splunk, but it's part of a stack of components that compete as a whole. The so-called ELK -- Elasticsearch (search), Logstash (ingestion and processing), and Kibana (reporting and visualization) -- stack can replace Splunk considering it's an Apache-licensed open source endeavor and has strong community involvement. It also has a lower barrier to entry than Splunk as far as cost is concerned, as the entire stack can be used for free, but for-pay support plans are available.

Elasticsearch's list of features for the 1.4 version of Logstash include a faster installation process and startup for the software, plus a revised and simplified plug-in system that lets users write their own input, filtering, and output drivers. Most significant is a redesigned set of modules for Puppet, allowing Logstash deployments to be automated through Puppet on a physical server or a VM. (Docker support for Logstash also exists courtesy of Arcus.io.) Dreamhost presentation mentions the usage of Puppet.

I tried to use Logstach myself and now tell you that I was amazed at the documentation and clarity. While I did not attempt a production type deployment, but I was able to capture easiness and extensibility of the stack. Let us discuss what I found,

Steps

As stated earlier, if you want to do your SPLUNK, take a look at open sourced LogStash from Elastic Search. I ran thru the tutorials and it was a breeze, It was highly educational. Please find detailed documentation here .For many use cases, it may just do the job. For a complete exercise, please refer to Michael's tutorial in reference,

For this blog, I followed tutorial documentation on Logstash website. My sample output is as follows:

1) Run Logstash with ElasticSearch as output

[ssabat@localhost logstash-1.4.2]$ bin/logstash -e 'input { stdin { } } output { elasticsearch { host => localhost } }'
[2014-11-15 21:56:17,844][INFO ][cluster.service ] [Cornelius van Lunt] added {[logstash-localhost.localdomain-2958-2010][KN4c5TODTmG2kQM3Io4lKw][localhost.localdomain][inet[/10.0.2.15:9301]]{client=true, data=false},},
.......
...

2) Query ElasticSearch

[ssabat@localhost ~]$ curl 'http://localhost:9200/_search?pretty'
{
"took" : 74,
"timed_out" : false,
"_shards" : {
"total" : 5
.....
....
3) Bonus : If you use kopf plugin, you will experience GUI of Splunk cluster you are used to.

Type http://localhost:9200/_plugin/kopf/#!/cluster and you wlll see beautiful cluster view of ElasticSearch cluster.

Big Data Event Logging and Analytics

In some cases, you may want to perform batch analytics on data stored on ElasticSearch cluster, you can achieve that as well. There are plugins available. You can find more on the integration here . Elasticsearch’s engine integrates with Hortonworks Data Platform 2.0 and YARN to provide real-time search and batch analytics for your log data. Hortonworks has published article on the integration story here . So, when it comes to log search and analytics in big data Hadoop ecosystem, we have many choices. We do not have to pay fortune for it. Above diagram illustrates the concept and information flow for your Big Data Event Log App.

Business Model

Elasticsearch also has been commercializing Logstash by monetizing analytics, a tactic that hearkens back to the methods used by AppDynamics, New Relic and Famo.us: In Elasticsearch's case, it's through its Marvel product, which manages and reports back on Elasticsearch nodes. Developers can use Marvel for free, but production use is $500 per year for the first five nodes.

Conclusion

As stated in this blog, ELK ( Elastic Search, Logstash and Kibana ) framework is powerful for event log analysis and visualization. Combined with Hadoop integration, you have an open source alternative to Splunk. I did not elaborate a production deployment using fluent agents to collect the logs, hundreds of GROK filters or multi-node administration challenges ; my attempt has been to introduce you the alternative available and enable the start of your journey in Event Log management projects. With IoT ( Internet Of Things ) looming over us for years to come, it is important that we understand log management well and make best use of it to increase IoT driven business productivity, performance and profit at scale, and most importantly at affordable price! You are backed up by strong community of open source community of developers and well funded venture capital money.

Reference

LogStash Tutorial by Michael Bouvy

5 Comments

Speaking at Silicon Valley Code Camp

10/8/2014

0 Comments

I am speaking at silicon valley code camp. Please come to my session! Click here for details.

Please attend Silicon Valley Code Camp on October 11 and 12, 2014. I will be speaking on Saturday at 3.30pm. See you there!

0 Comments

Handling Slowness of Hadoop for Real Time Use

9/17/2014

0 Comments

Introduction

Here is an article written by Ed Burns, describing slowness of Hadoop and techniques to speed it up for real time analytics. This article not only gives an example of Real Time Ad Bidding platform, but also talks about how SQL on Hadoop could fill the void for organizations that lack real time Hadoop skills of Storm, Kafka etc. Story is that you can either rely on Impala, Drill, Tez, or HAWQ and perform SQL based analytics on Hadoop; or go with Apache Kafka/Storm/Hadoop combined with DW appliances like Greenplum, Netezza to achieve similar results.

It comes to down your ultimate latency targets and skills in hand. It is a good article. I present it here.

Original Article ( Source Techtarget.com, Title :Hadoop Still Too Slow for real-time analytical applications)

Hadoop vendors are trying to position the distributed processing technology as a real-time analytics tool. But some real-time roadblocks remain. With all the buzz that Hadoop is generating in IT circles these days, it's easy to start thinking that the open source distributed processing framework can handle just about anything in big data environments. But real-time analysis involving ad hoc querying of Hadoop data has been a notable exception. Hadoop is optimized to crunch through large sets of structured, unstructured and semi-structured data, but it was designed as a batch processing system -- something that doesn't lend itself to fast data analysis performance.

And Jan Gelin, vice president of technical operations at Rubicon Project, said analytics speed is something that the online advertising brokerneeds -- badly.

Rubicon Project is based in Playa Vista, Calif., and offers a platform for advertisers to use in bidding for ad space on webpages as Internet users visit the pages. The system allows the advertisers to see information about website visitors before making bids to try to ensure that ads will be seen only by interested consumers. Gelin said the process involves a lot of analytics, and it all has to happen in fractions of a second.

Rubicon leans heavily on Hadoop to help power the ad-bidding platform. But the key, Gelin said, is to pair Hadoop with other technologies that can handle true real-time analytics. Rubicon uses the Storm complex event processing engine to capture and quickly analyze large amounts of data as part of the ad bidding process. Storm then sends the data into a cluster running MapR Technologies Inc.'s Hadoop distribution. TheHadoop cluster is primarily used to transform the data to prepare it for more traditional analytical applications, such as business intelligence reporting. Even for that stage, though, much of the information is loaded into a Greenplum analytical database after the transformation process is completed.

Hadoop realism

Gelin said the sheer volume of data that Rubicon produces on a daily basis pointed it toward Hadoop's processing muscle. But when it comes to analyzing the data, he added, "You can't take away the fact that Hadoop is a batch-processing system. There are other things on top of Hadoop you can play around with that are actually like real real-time."

Several Hadoop vendors are trying to eliminate the real-time analytics restrictions. Cloudera Inc. got the ball rolling in April by releasing its Impala query engine, promising the ability to run interactive SQL queries against Hadoop data with near-real-time performance. Pivotal, a data management and analytics spinoff from EMC Corp. and its subsidiary VMware, followed three months later with a similar query engine named Hawq. Also looking to get in the game is Splunk Inc., which focuses on capturing streams of machine-generated data; it made a Hadoop data analysis tool called Hunk generally available in late October.

The Hadoop 2 version of the framework, which was released in October as well, also aids the cause by opening up Hadoop systems to applications other than MapReduce batch jobs. With all the new tools and capabilities available or on the way, Hadoop may soon be up to the real-time analysis challenge, said Mike Gualtieri, an analyst at Forrester Research Inc. One big factor working in its favor, he added, is that vendors as well as Hadoop users are determined to make the technology function in real or near real time for analytics applications.

"Hadoop is fundamentally a batch operation environment," Gualtieri said. "However, because of the distributed architecture and because a lot of use cases have to do with putting data into Hadoop, a lot of vendors or even the end users are saying, 'Hey, why can't we do more real-time or ad hoc queries against Hadoop,' and it's a good question."

Real-time analysis roadblocks

Gualtieri sees two main real-time hurdles for Hadoop. First, he said, most of the new Hadoop query engines still aren't as fast as running queries against mainstream relational databases. Tools like Impala and Hawq provide interfaces that enable end users to write queries in the SQL programming language. The queries then get translated into MapReduce for execution on a Hadoop cluster, but that process is inherently slower than running a SQL query directly against a relational database, according to Gualtieri.

The second challenge Gualtieri sees is that Hadoop currently is a read-only system once data has been written into the Hadoop Distributed File System (HDFS). Users can't easily insert, delete or modify individual pieces of data stored in the file system like they can in a relational database, he said.

While the challenges are real, Gualtieri thinks they can be overcome. For example, Hadoop 2 includes a capability for appending data to HDFS files.

Gartner Inc. analyst Nick Heudecker said via email that even though the new query engines might not support true real-time data analytics functionality, they do enable users with less technical expertise to access and analyze data stored in Hadoop. That can decrease the cycle time and cost associated with running Hadoop analytics jobs because MapReduce developers no longer need to be involved in writing queries, he said.

Organizations will have to decide for themselves whether that's enough of a justification for deploying such tools. The scalability and affordability of Hadoop are also alluring -- but that can lead some businesses down the wrong path, said Patricia Gorla, a consultant at IT services provider OpenSource Connections LLC in Charlottesville, Va. What's required, Gorla cautioned, is finding the best fit for Hadoop -- and not trying to force-fit it into a systems architecture where it doesn't belong. "Hadoop is good at what it's good at and not at what it's not," she said.

0 Comments

Troubleshooting and High Availability for Hadoop - Step by Step Approach

8/31/2014

9 Comments

Introduction

In this blog, I present the steps to troubleshoot Hadoop and its components. Hadoop is a complex framework. Consider a production deployment now-a-days. Usage of Hadoop clusters deploying HDFS, Hbase, Hive, Mapreduce ( Pig/Hive) is fairly common. The deployments run 24x7. The systems need to be highly available. Still, they fail for basic reasons of configuration, permissions, file locations, inadequate cpu/memory allocation, etc.

To debug Hadoop issues, it is advised that we are aware of common trouble spots. It is best to get a thorough understanding of the configuration/module specific troubleshooting, and then high availability aspect of running Hadoop clusters early in cycle.

There are several ways to troubleshoot Hadoop production deployment related issues. In this blog, I focus on MapeReduce 1 aka Hadoop 1 because most of the installations are on that platform. Where needed, I also cover MapReduce 2 aka Hadoop 2, also known as YARN. We will review high availability aspects of Hadoop as well. Because, on production usage, Hadoop clusters need to be higly available. Troubleshooting is key task of keeping systems highly available.

You will also find step by step procedures and how to do a module specific diagnosis from a setup perspective. Assumption is that if you have right permission with right configuration of Java settings, you should be able to run your Hadoop jobs and commands. Each module is vast and they have their own diagnostics. For example, region servers in Hbase may fill over and cluster can stall. To fix the issue, you have to use diagnostics that come with Hbase.

These steps assume you have sized your Hadoop deployment properly i.e your nodes are configured with cpu, memory and disks correctly. They are also networked as per industry standard. Please refer to [1] and [2] below for more information on approaches to Hadoop troubleshooting.

Steps:

Step 1 ) Validate environment information including Version installed / used

1.1 Hadoop Distribution and Version installed

1.2. Validate if above documented Versions are supported

1.3 Confirm client connectivity by hostname to each of the nodes you have configured for the cluster!

Use PING <Nodes Hostname> for each node.

If you can't reach a node contact

Hadoop administrator and / or
IT Administrator

Step 2) Validate Hadoop Cluster is healthy and running.

2.1 Check Hadoop Cluster management interface:

Cloudera Manager
Apache Ambari (Hortonworks)
MapR Control System (CLDB Web Interface) for any errors and issues.

If any issues are identified, please check

Namenode (web) for MapReduce 1
Job tracker (web ) for MaprReduce 1,The URL http://mymachine.com:50030 is the web address for the JobTracker daemon.
If you have installed YARN (MapReduce 2 ), it runs ResourceManager.

To see the YARN ResourceManager UI, check your yarn-site.xml file for the following property:yarn.resourcemanager.webapp.address
By default, it should point to : resource_manager_hostname:8088
Assuming your ResourceManager runs on mymachine, you should see the ResourceManager UI at http://mymachine.com:8088/
For YARN related commands including how to capture logs, please refer to Apache documentation .

Make sure all your deamons are up and running if you are running MapReduce 1 or 2.

2.2 Collect the usernames under which you run the various hadoop functions as below :

Cluster
Distributed filesystem

hdfs from Cloudera, Hortonworks or
maprfs from MapR or
others like gpfs from IBM

Map/Reduce 1 and 2
Hive
Hive2
Impala (Cloudera)
Hbase
Pig
Individual directories

Step 3) Troubleshooting HDFS

3.1  Identify the User(s) under which you will access the HDFS
3.2  SSH into cluster as above identified USER(s)
3.3  Issue the following commands on HDFS to verify that you have the authority to execute them.

hadoop fs -ls /
hadoop fs -ls
hadoop fs -cp <somefile> <somefile>.temp
hadoop fs -rm <somefile>.temp

3.4 If any of the above commands failed contact your Hadoop Administrator or your Hadoop vendor.

Step 4) Troubleshooting Hive, Hive2 and Impala

For Hive, Hive2 and Impala, collect following information.

Who is the service running as?
Who owns <fill in distro name)?
On which ports is the service running?
netstat to verify the ports are open

Hive Server Example: netstat -an | grep 10000

4.1 What Hive Metastore are you using?
local
remote
thrift url
database
ports
You can find this information either by

Contacting your Hadoop Administrator
Reviewing the Hadoop Management console
Reviewing your configuration files for this service

4.2  Identify the user(s) under which you will access the above selected service.

4.3 ssh into the Cluster with the above selected user(s).

4.4 For HIVE, execute

hive

4.5 For HIVE2, execute

beeline

4.6 For IMPALA, execute

impala-shell
Once you are connected to the command prompt, please refresh your tables. Refer to the IMPALA documentation for the proper refresh command.

4.7 Execute the command in impala shell

show tables;
Verify that the correct tables are shown.

4.8 To validate that you have the right access rights please execute a

"simple" select statement for a table of your choosing selecting a single column to display.
Please note, to minimize executing time and keep your frustration level low select a "small" table.

Example:

SELECT COUNT(*) FROM <mytab>

Step 5) Troubleshooting HBASE

5.1  Identify correct User for HBASE

5.2  SSH into the cluster as above identified User.

5.3  Connect to HBASE

For most distributions, use following command.

hbase shell

5.4   Issue the following commands to verify that you have the authority to execute them.

list
create '<tablename>'
delete '<tablename>'
describe '<tablename>'

5.5   If any of the above commands failed contact your Hadoop administrator or your Hadoop vendor

Step 6) Troubleshooting PIG
6.1 Identify the User(s) under which you will access the HDFS, see information you collected in Step 3
6.2 SSH into cluster as above identified USER(s)
6.3 Validate that your PIG client is working correctly.
6.4 Command to execute PIG is
Local Mode
$ pig -x local
... - Connecting to ...
grunt>
Mapreduce Mode
$ pig -x mapreduce
... - Connecting to ...
grunt>

Step 7) Troubleshooting MAPR
MAPR adopts a different file system unlike HDFS. It is still Apache Hadoop compatible.

7.1   Validate that the MAPR client is installed and configured, You can test your mapr file system with

hadoop fs -ls maprfs:///

7.2  Issue the following commands from the MAPR Client to verify that you have the authority to execute them.

hadoop fs -ls
hadoop fs -cp <somefile> <somefile>.temp
hadoop fs -rm <somefile>.temp

Step 8 ) Sanity checks

8.1 Run famous word-count (or similar) sample which is included in your distribution to validate that it is working correctly for distribution of your choice. If any of the above commands failed contact your hadoop administrator or your hadoop vendor.

8.2 For most distributions, Hadoop configuration files (core-site.xml, hdfs-site.xml, mapred-site.xml,yarn-site.xml, capacity-scheduler.xml ) are required depending on if you are running Hadoop 1 or 2, or both. Ensure that these files are copied to the hadoop configurations folder of your application, and that they match the current files on the Hadoop Cluster. These XML files help the application understand which Hadoop cluster to connect to, and the default configurations of various Hadoop services. Without these files, the application will not know about your Hadoop cluster.

Step 9) Hadoop cluster in High Availability ( HA ) Mode
The Hadoop stack contains multiple services (HDFS, MapReduce, HBase, etc.) and each of these services have their own co-dependencies. A client application, that interacts with Hadoop, can depend on one or more of these services. A highly available Hadoop platform must ensure that the NameNode master service as well as client applications are resilient to critical failure services.

HA architecture has the following key properties:

It provides high availability for the NameNode master daemon service.
When the NameNode master daemon fails over, the HA solution initiates the following actions:
- Dependent services (like JobTracker) automatically detect the failure or fail over of the co-dependent component (NameNode) and these dependent services pause, retry, and recover the failed service. (For example, the JobTracker does not launch new jobs or kill jobs that have been waiting for the NameNode.)
- Applications running inside and outside the Hadoop cluster also automatically pause and retry their connection to the failed services.

9.1 For Mapreduce 1, connect to your NameNode(s) / Master Server and check if the following files

core-site.xml
hdfs-site.xml

Make sure that they are configured correctly as per your Hadoop administrator.

If you have the following parameter configured in the hdfs-site.xml

dfs.nameservices ( MapReduce 1 )

then you are configured for High Availability. Please refer to [3] below for more information,

9.2 High Availability ( HA ) in MapReduce 2 ( YARN )

To enable HA, set yarn.resourcemanager.ha.enabled to true in yarn-site.xml. YARN uses identifiers (rm-ids) as logical names for each resourcemanager. Set yarn.resourcemanager.ha.rm-ids to the list of identifiers you wish to use; for example, rm1,rm2 (the value is a comma-separated list). For each of the identifiers (corresponding to each ResourceManager), define yarn.resourcemanager.hostname. (should be the hostname of the ResourceManager). NodeManagers and clients use the configured addresses to find the ResourceManager service they need to talk to. They go through the list of rm-ids and successively try the address corresponding to each rm-id.

For a YARN application to complete successfully, both the ApplicationMaster and the ResourceManager must be live. Containers with completed tasks need to report to the ApplicationMaster, which in turn needs to report to the ResourceManager to commit the job completion.

The web UI of the standby ResourceManager will automatically redirect to that of the active ResourceManager, which is convenient. If you want to ascertain the HA status of a particular ResourceManager in an HA cluster, you can go to the /cluster/cluster page of the ResourceManager web address (this page does not redirect), or you can use the REST API, by going to /ws/v1/cluster/info of the web address.

9.3 Using cluster name

Are you using the clustername or an IP and/or hostname to connect to your cluster? Preferred method is via clustername. If you are using the clustername and can't connect to the cluster, something is wrong with the cluster.
If you are using IP / Hostname and can't connect to the cluster , you might be trying to connect to the passive cluster.

Step 10) In some cases, you may have to turn off permissions to debug job runs. Permission can be turned off by

cluster management user interface
dfs.permissions parameter (by setting this to false) in hdfs-site.xml

Without permission check, make sure your jobs run to completion thus eliminating permission related command/job failures. Why? Mismatched class libraries and permissions are two main factors why Hadoop jobs fail. It is paramount to make sure permission issues are resolved at first. There are several techniques to fix class library issues given rich history of Java deployments. Each module has specific diagnostics attached to module's rich class methods and behavior.

Step 11) If jobs run without any permission, then
Turn on the permission flag disabled earlier and fix your permissions issues in Hadoop.

Ste 12 ) If a hadoop job fail after permission issues are fixed, use hdfs and module specific log files to analyze and fix class libraries. Correct class libraries need to be deployed at Java run time for successful completion of job or command executed on your Hadoop infrastructure.

Conclusion

In this blog, we reviewed the steps to troubleshoot Hadoop framework operational at your organization. We covered popular Hadoop distributions. Each module in Hadoop framework behaves differently. Again, important module diagnostics were covered as well. I did not go over log files of each module. Log file analysis will help us pin point real issues. But, that would make the blog bigger and complex. That can be addressed in another blog. I wish your Hadoop deployments be healthy and operational for your benefit.

Reference

1 ) Hadoop Troubleshooting
2) Avoiding Mapreduce 1 and 2 Time Consuming Gotchas
3) High Availability on Hadoop

9 Comments

Developing and Adopting Big Data Machine Learning Practice

7/31/2014

2 Comments

Introduction

In this blog, let me give you an overview of the steps involved in Big Data ML (Machine Learning) adoption and deployment practice at your organization. Here is a bit of history at first.

When I worked at Intel 23 years back, we deployed Simulated Annealing to optimize transistor circuit sizes for IO ( Input/Output) cells in Standard Cell Library, for fastest propagation delay ( low to high and high to low ) at minimum noise ( inductive voltage due to power supply and ground ). R version of the technique we applied then is available now in optim package. Here, you will find a matrix example, similar to delay and noise constraint equations we used then.

Algebraically, both delay and noise are function of circuit sizes for given process and operating condition. If you want to know more about they are related, please refer to the presentation . For how noise is related to circuit sizing, please refer to the article . Before we deployed simulated annealing based optimization method thus automating circuit design to a large extent, we would manually design each circuit, simulate it under different operating conditions and make sure highest speed is achieved with least noise. As the chip manufacturing process technology evolved along with need of new features on the chip, library size evolved, very similar to how phenomenon has changed from Data to Big Data. That's when Intel R&D decided to automate the circuit sizing process. Under new framework, circuit delay and noise models were built with initial size sets from past experience. These equations i.e. constraints were predictive models. Real simulation delay and noise measurements were taken to verify the models. We tweaked the model depending on feedback. If the design requirement required bigger circuit for faster speed, we would plugin the target values in the model, and derive approximate circuit sizes. That cut down design time considerably.

Simulated annealing technique was adopted for Cell Library characterization, and it reduced man months in library development, making Intel truly agile. Today,we hear a lot about big data and machine learning. I started comparing: What we did 23 years back and what is done now. I see striking similarity. That is the theme of my blog.

ML Steps and Tasks

Let us now discuss steps and processes of machine learning adoption and deployment.

Steps
From my past experience and what I have found researching Big Data Machine Learning platforms at Google, Netflix and others is that machine learning adoption and deployment, is a 7 step process that consists of:

1. Data selection - Data may need to be cleaned and prepossessed.

2. Feature selection - Size (dimensionality) of the data can be big. Document classification may have too many words. Use features that are easier to extract and less sensitive to noise. Divide the dataset into a training dataset and a testing dataset.

3. Model selection - A lot of guessing goes here. From experience and domain expertise, select the model (or model set) and error function. Select the simplest model first, then select another class of model. We need to avoid over-fitting.

4. Learning - Train the learner or model. Find the parameter values by minimizing the error function.

5. Evaluation - Learner is evaluated on the testing dataset. You may need to select another model, or switch to a different set of features if error is not acceptable.

6. Application - Apply the learning model. For example, perform prediction with new, unseen data using learned model.

7. Production - Deploy the model in production, tune the model over time, as business model and stakeholders evolve.

Tasks
If application of machine learning is crucial to your business success similar to what we implemented at Intel for Standard Cell Library characterization, the organization needs to evolve and adapt. Stakeholders have to concur and work toward making it a success. Every party benefits in the end. Let me list the tasks that need to be in practice. The ML team should be at least proficient in:

Different learning problems and their solutions
Choosing a right model for a learning problem
Finding good parameter values for the models
Selecting good features or parameters as input to a model
The evaluation of a machine learner
Sharing results and collecting feedback
Continuously improving model and business results

It is absolutely a team task. It is agile. Team that collaborates most gets best results and contributes to bottom line of a business that thrives on power of ML. There is not doubt about it.

ML Platform, Package and Building

If you know by now. ML adoption and deployment carries a practice that consists of platforms. The platforms would embody package of modules. Together, they help us building the model.

ML Platform
Google, Netflix and all successful ML practitioners deploy ML platforms and packages. Their platform and package patterns are proprietary. They have also successful teams in place. Why platform is crucial?

We need a platform that team can rely on. As discussed in my previous blog, recommendation engine platform is also a ML platform. It has built in components that team can rely on daily basis. Are there open sources platforms? In that regard, KIJI Framework is a Hbase/Cassandra based ML framework. Before KIJI, machine learning algorithms were ported to run on Hadoop platform, thanks to Mahout. Hadoop distribution vendors like Cloudera, MapR and Pivotal also offer ML platforms.

At its epitome, Google has deployed Sybil which is a proven ML platform.

ML Packages
We know that proper modeling is fundamental to success of a ML platform and deployment. I have already covered modeling algorithms in my blog on Basics of Machine Learning. While there are commercial packages sold by Matlab, SAS, SPSS and oner vendors, there is open source R Caret package for any one to learn modeling, evaluating and tuning aspects of ML. This package has several functions that attempt to streamline the model building, evaluation and selection process. There are other open source packages like Sci-Kit ( Python ) and Scala/MLib ( Scala ) packages available for your consideration. I have not seen one comparable to Caret yet.

In Caret, for example, train function can be used to evaluate, using re-sampling, the effect of model tuning parameters on performance. Then, you choose the "optimal" model across these parameters while continuously measuring model performance from a training set. It is what we did at Intel 23 years back. We did few steps manually, but the process was 100% same.

ML Model Building
How do we build model? First, a specific model must be chosen. Then, tuning of model parameters comes into play. Let us see how we do it using Caret package.

The first step in tuning the model is to choose a set of parameters to evaluate. For example, if fitting a Partial Least Squares (PLS) model, the number of PLS components to evaluate must be specified. This is same as fitting some approximate sizes to model equation of circuit delay as we did at Intel.

Here is pseudo-algorithm.

Define sets of model parameters to evaluate
For each model parameter set
For each sample set of data
Collect hold out ( test ) data
Prepare data set for analysis
Fit The Model
Done
Evaluate The Model with hold-out ( test ) data
Done
Calculate best parameter set for given model

Currently, 150+ such models are available in Caret; there are lists of tuning parameters that can potentially be optimized. User-defined models can also be created. Once the model and tuning parameter values have been defined, the type of resampling should be also be specified. After resampling, the train function can produce a profile of performance measures and is available to guide the user as to which tuning parameter values should be chosen. By default, the validation function automatically chooses the tuning parameters associated with the best value, although different algorithms can be used. Please refer to training process for more information.

Conclusion

In this blog, we covered how my own experience of circuit design automation at Intel still applies today, in the form of tasks and processes in building and tuning Big Data ML models on platforms using packages. Developing a tuned ML model using one of the packages listed above is fundamental task, but there is lot that goes before and after the building process, to make ML deployment truly successful in an organization.

It all depends on your business objective, stake holders and acceptable latency to serve the business need; most importantly on matching infrastructure and resources to handle the compute/io/storage/network demand if the model needs to be deployed in production at scale.

I will be speaking at Silicon Valley Code Camp 2014. If you are in San Francisco Bay Area, please attend the session. It is FREE. Please find the session details at Developing Real Time Recommendation Engine

2 Comments

Basics of Machine Learning

6/30/2014

4 Comments

Introduction
In my last blog, I wrote about big data recommendation engines. After receiving feedback and questions, I present you with this blog with purpose of introducing basics of machine learning and modeling. I hope you will find it useful.

Let us start with the roots of Machine Learning (ML).

We know that hardware speed and capability increases at a faster rate to software. The gap is increasing daily. Since the 1950s, computer scientists have tried to give computers the ability to learn with increasing hardware speed. Artificial intelligence (AI) is the human-like intelligence exhibited by machines or software. It is also an academic field of study. Major AI researchers and textbooks define the field as "the study and design of intelligent agents", where an intelligent agent is a system that perceives its environment and takes actions that maximize its chances of success. MIT's John McCarthy, who coined the term in 1955, defines it as "the science and engineering of making intelligent machines".

AI research is highly technical and specialized, and is deeply divided into subfields that often fail to communicate with each other. Some of the division is due to social and cultural factors: subfields have grown up around particular institutions and the work of individual researchers. AI research is also divided by several technical issues. Some subfields focus on the solution of specific problems. Others focus on one of several possible approaches or on the use of a particular tool or towards the accomplishment of particular applications.

The central problems (or goals) of AI research include reasoning, knowledge, planning, learning, natural language processing (communication), perception and the ability to move and manipulate objects. General intelligence is still among the field's long term goals. It attempts to emulate human thinking. Currently popular approaches include statistical methods, computational intelligence and traditional symbolic AI. There are a large number of tools used in AI, including versions of search and mathematical optimization, logic, methods based on probability and economics, and many others. The AI field is interdisciplinary, in which a number of sciences and professions converge, including computer science, psychology, linguistics, philosophy and neuroscience, as well as other specialized field such as artificial psychology.

Birth of Machine Learning
ML is a subfield of AI concerned with computer programs that learn from experience. ML is building computer programs that improve its performance (its learning) of doing some task using observed data or past experience. An ML program (learner) tries to learn from the observed data (examples) and generates a model that could respond (predict) to future data or describe the data seen. In 1959, Arthur Samuel defined ML as a "Field of study that gives computers the ability to learn without being explicitly programmed".

Tom M. Mitchell provided a widely quoted, more formal definition: "A computer program is said to learn from experience E with respect to some class of tasks T and performance measure P, if its performance at tasks in T, as measured by P, improves with experience E". This definition is notable for its defining machine learning in fundamentally operational rather than cognitive terms, thus following Alan Turing's proposal in Turing's paper "Computing Machinery and Intelligence" that the question "Can machines think?" be replaced with the question "Can machines do what we (as thinking entities) can do?"

The field was founded on the claim that a central property of humans, intelligence—the sapience of Homo sapiens—"can be so precisely described that a machine can be made to simulate it". This raises philosophical and social issues about the nature of the mind and the ethics of creating artificial beings endowed with human-like intelligence, issues which have been addressed by myth, fiction and philosophy since antiquity. Artificial intelligence has been the subject of tremendous optimism but has also suffered stunning setbacks. But, today it has become an essential part of the technology industry, providing the heavy lifting for many of the most challenging problems in computer science.

Data Mining and Machine Learning
For years now, we are familiar with data mining in the context of business intelligence. Is data mining machine learning? Generally, data mining (sometimes called data or knowledge discovery) is the process of analyzing data from different perspectives and summarizing it into useful information - information that can be used to increase revenue, cuts costs, or both. Data mining software is one of a number of analytical tools for analyzing data. It allows users to analyze data from many different dimensions or angles, categorize it, and summarize the relationships identified. Technically, data mining is the process of finding correlations or patterns among dozens of fields in large relational databases.

These two terms are commonly confused, as they often employ the same methods and overlap significantly. They can be roughly defined as follows:

Machine learning generally focuses on prediction, based on known properties learned from the training data.
Data mining focuses on the discovery of (previously) unknown properties in the data. This is the analysis step of Knowledge Discovery in Databases (KDD)

The two areas overlap in many ways: data mining uses many machine learning methods, but often with a slightly different goal in mind. On the other hand, machine learning also employs data mining methods as "unsupervised learning" or as a preprocessing step to improve learner accuracy. Please read further in the blog to learn how it is applied.

ML and NON-ML Algorithms
Few days back in a class setting, I was asked what was difference between ML and NON-ML algorithms as we find in computer science. Here is my view.

ML algorithms are kind of non-deterministic algorithms. These algorithms constantly evolve with a goal to optimize a set of model parameters for meeting objective functions i.e detect fraud accurately, predict mortality of patient etc with the help of machines. These algorithms usually run in distributed computing environment and adopt a platform model. Non-ML algorithms are mostly deterministic. They do not require distributed computing in general. Their goals are focused on a particular objective. Let me give two examples to explain the differences.

Classic NON-ML Heapsort Algorithm has best case performance of O(n) while average case performance of O(nlogn). Heapsort is a comparison-based sorting algorithm. Heapsort is part of the selection sort family; it improves on the basic selection sort by using a logarithmic-time priority queue rather than a linear-time search. These algorithms express gradual improvement upon a base technique. Goal is to reduce complexity and improve performance for a particular task such as sorting.

Classic ML Logistic Regression always tries to predict binary output of set of input data. Logistic regression is a statistical method for analyzing a dataset in which there are one or more independent variables that determine an outcome. The outcome is measured with a dichotomous variable (in which there are only two possible outcomes). The goal of logistic regression is to find the best fitting model to describe the relationship between the dichotomous characteristic of interest (dependent variable = response or outcome variable) and a set of independent (predictor or explanatory) variables.

ML Modeling and Big Data
A model is then, a structure that represents or summarizes some data. Its summarization process is based on an algorithm.
Here is an example. ML program gets a set of patient cases with their diagnoses. The program will either:

Predict a disease present in future patients, or
Describe the relationship between diseases and symptoms

So, ML is like searching a very large space of hypotheses to find the one that best fits the observed data, and that can generalize well with observations outside the training set.

Goal is to tell the computer what task we want it to perform and make it to learn to perform that task efficiently. ML imparts emphasis on learning, different than expert systems: emphasis is on expert knowledge which is basis of AI. Expert systems don't learn from experiences. They encode expert knowledge about how they make particular kinds of decisions.

ML is an interdisciplinary field using principles and concepts from statistics, computer science, applied mathematics, cognitive science, engineering, economics and neuroscience. ML included algorithms and techniques are found in Data Mining, Pattern Recognition, Neural Networks and other sophisticated research areas.

ML compelling cases are many. Here are few of them.

When expertise does not exist (navigating on Mars)
Solution cannot be expressed but a deterministic equation (face recognition)
Solution changes in time (routing on a computer network)
Solution needs to be adapted to particular cases (user biometrics)

Now, let us discuss broadly ML algorithms and see how big data technology has made them easier to apply.
The algorithms come in several categories.

Supervised learning - It is used when the observed data includes the correct or expected output.

Example: Fraud detection
Detection if output is binary (Y/N, 0/1, True/False).

Classification - If output is one of several classes (e.g., output is either low, medium, or high).

Example: Credit Scoring
Two classes of customers asking for a loan: low-risk and high-risk.
Input features are their income and savings.
Classifier using discriminant: IF income > θ1 AND savings > θ2 THEN low-risk ELSE high-risk
So, finding the the right values for θ1 and θ2 is part of learning.

Other classifiers use a density estimation functions (instead of finding a discriminant) where each class is represented by a probability density function (e.g., Gaussian). There are several classification applications in use today. Those include face recognition, character recognition, speech recognition, medical diagnosis etc.

Example: Determining the price of a car
x: car attributes, y: price (y = wx+w0)

The model with right values for w parameters and regression model (e.g., linear, quadratic) is fundamental to learning.

Unsupervised learning - When the correct output is not given with the observed data, this method is used.

ML tries to learn relations or patterns in the data components (also called as attributes or features)
ML program can group the observed data into classes and assign the observations to these classes.

Example: Finding the right number of classes and their centers or discriminant is learning.

Clustering is used in customer segmentation in CRM, in learning motifs (sequence of aminoacids that occur in proteins),
polling populations, student segmentation etc.

Reinforcement learning - When the correct output is a sequence of actions, and not a single action or output.
The model produces actions and rewards (or punishments). The goal is to find a set of actions that maximizes rewards (and minimizes punishments).

Example: Game playing where a single move by itself is not important. ML evaluates a sequence of moves and determine whether how good is the game playing policy.

Concept learning - In this method, machine learner predicts the value of some concept (e.g., playing some sport) given values of some attributes (e.g., temperature, humidity, wind speed, sky outlook) for some past observations or examples.

Example: Predict PLAY as Yes or No
From values of a past: outlook=sunny, temperature=hot, humidity=high, windy=false

Other types of concept learning are instance-based learning, explanation-based learning, bayesian learning, case-based learning, statistical learning etc.

Generalization - In this method, machine learner uses a collection of observations (called training set) for learning
Good generalization requires the reduction of error during the evaluation of a learner using a testing set. Here, we
avoid model over fitting that happens when the training error is low and the generalization error is high.

Example: Find a polynomial of order n-1
It fits exact n points with zero training error.
It does mean that the model will perform well with unseen data.

Now, you know how ML modeling can be used in solving practical problems we face day-to-day. You also know that it all depends on data on hand and our objectives. To our advantage, we have many ways to apply ML for our benefit. If we handle larger data sets, we can solve bigger problems.

With the ability to process large sets of data with variety and velocity, Big Data technology (open source Apache Hadoop, Solr, Cassandra, Storm, Kafka, MongoDB, R) empowered by swift cloud deployment has definitely helped ML modeling being further usable. Let us discuss now some ML application areas, and then conclude with a goal we all should strive for.

Conclusion
Recent rise of big data solution deployments has accelerated the application of ML. Now, we find it being used successfully in the areas of:

Medicine diagnosis
Market basket analysis
Image/text retrieval
Automatic speech recognition
Object, face, or hand writing recognition
Financial prediction
Bioinformatics (e.g., protein structure prediction)
Robotics

It is an exciting time. With big data storage and processing made affordable for mass, we are now empowered to solve new problems which were impractical few years back. I have said in my previous blog that big data recommendation engines were at attempt to emulate human intuition. Google, Amazon, LinkedIn and Netflix are ultimate users and beneficiaries of it. ML is the foundation of recommendation engines. As stated earlier in the blog, ML is just a subset of AI which has a larger goal of simulating human thinking. It is now almost 60 years that we have been thinking about AI. Will we achieve the goal? Who can deliver it? May be, Google or you!

I will be speaking at Silicon Valley Code Camp 2014. If you are in San Francisco Bay Area, please attend the session. It is FREE. Please find the session details at Developing Real Time Recommendation Engine

4 Comments

Big Data Recommendation Engines - Overview

5/31/2014

3 Comments

Introduction

In this blog, I give an overview of the recommendation engines widely used in Big Data applications. I have reviewed several articles on the web and now write the blog. With Big Data in limelight for sometime now, there is emphasis on the value aspect of Big Data and how to extract it. Not to our surprise, these engines are workhorses and extract value from big data if you consider now value as 4th V after 3Vs of Big Data i.e Volume, Velocity and Variety.

Recommendation systems are quite popular among shopping sites and social network these days. How do they do it ? Basically, the user interaction data available from items and products in shopping sites and social networks are enough information to build a recommendation engine using classic techniques such as Collaborative Filtering. We know map-reduce is a powerful technique for numerical computation and specially when you have to compute large data sets on Hadoop. The numerical computation is foundation of algorithms used to recommend. Cloudera Platform that combines Hadoop framework and Mahout ( algorithms ) is given below.

At its core, recommendation engines sort through massive amounts of data to identify potential user preferences. Recommendation systems changed the way inanimate websites communicate with their users. Rather than providing a static experience in which users search for and potentially buy products, recommender systems increase interaction to provide a richer experience. If the recommendation benefits a supplier, the engine provider i.e recommendation platform owner benefits financially as well. Recommender systems can identify recommendations autonomously for individual users based on past purchases and searches, and on other users' current behavior. This article introduces you to recommender systems and the algorithms that they implement. It also covers how it is being implemented, with examples from open source, Microsoft and Cloudera.

Examples of Recommendation Engines:

LinkedIn, the business-oriented social networking site, forms recommendations for people you might know, jobs you might like, groups you might want to follow, or companies you might be interested in. LinkedIn uses Apache Hadoop to build its specialized collaborative-filtering capabilities.

Amazon, the popular e-commerce site, uses content-based recommendation. When you select an item to purchase, Amazon recommends other items other users purchased based on that original item (as a matrix of item-to-likelihood-of-next-item purchase). Amazon patented this behavior, called item-to-item collaborative filtering.

Hulu, a streaming-video website, uses a recommendation engine to identify content that might be of interest to users. It also uses (offline) item-based collaborative filtering with Hadoop to scale the processing of massive amounts of data. Details of Hulu's online and offline ItemCF architecture are publicly available.

Netflix, the video rental and streaming service, is a famous example. In 2006, Netflix held a competition to improve its recommendation system, Cinematch. In 2009, three teams combined to build an ensemble of 107 recommendation algorithms that resulted in a single prediction. This ensemble proved to be the key to improving predictive accuracy, and the combined team won the prize.

Other sites that incorporate recommendation engines include Facebook, Twitter, Google, MySpace, Last.fm, Del.icio.us, Pandora, Goodreads, and your favorite online news site. Use of a recommendation engine is becoming a standard element of a modern web presence.

Basic Approaches

Most recommender systems take either of two basic approaches: collaborative filtering or content-based filtering.
Other approaches (such as hybrid approaches) also exist.

Collaborative filtering
Collaborative filtering arrives at a recommendation that's based on a model of prior user behavior. The model can be constructed solely from a single user's behavior or — more effectively — also from the behavior of other users who have similar traits. When it takes other users' behavior into account, collaborative filtering uses group knowledge to form a recommendation based on like users. In essence, recommendations are based on an automatic collaboration of multiple users and filtered on those who exhibit similar preferences or behaviors.

For example, suppose you're building a website to recommend blogs. By using the information from many users who subscribe to and read blogs, you can group those users based on their preferences. For example, you can group together users who read several of the same blogs. From this information, you identify the most popular blogs that are read by that group. Then — for a particular user in the group — you recommend the most popular blog that he or she neither reads nor subscribes to.

Another way to view these relationships is based on their similarities and differences, as illustrated in the Venn diagram. The similarities define (based on the particular algorithm used) how to group users who have similar interests. The differences are opportunities that can be used for recommendation — applied through a filter of popularity or likes.

Content-based filtering

Content-based filtering constructs a recommendation on the basis of a user's behavior. For example, this approach might use historical browsing information, such as which blogs the user reads and the characteristics of those blogs. If a user commonly reads articles about Linux or is likely to leave comments on blogs about software engineering, content-based filtering can use this history to identify and recommend similar content (articles on Linux or other blogs about software engineering). This content can be manually defined or automatically extracted based on other similarity methods.

Hybrid Filtering

Hybrid approaches that combine collaborative and content-based filtering are also increasing the efficiency (and complexity) of recommender systems. A simple example of a hybrid system could use combined approaches mentioned above. Incorporating the results of collaborative and content-based filtering creates the potential for a more accurate recommendation. The hybrid approach could also be used to address collaborative filtering that starts with sparse data — known as cold start— by enabling the results to be weighted initially toward content-based filtering, then shifting the weight toward collaborative filtering as the available user data set matures.

Recommendation Engine Algorithms

As demonstrated by the winning approach for the Netflix prize, many algorithmic approaches are available for recommendation engines. Results can differ based on the problem the algorithm is designed to solve or the relationships that are present in the data. Many of the algorithms come from the field of machine learning, a sub-field of artificial intelligence that produces algorithms for learning, prediction, and decision-making.

Pearson correlation

Similarity between two users (and their attributes, such as articles read from a collection of blogs) can be accurately calculated with the Pearson correlation. This algorithm measures the linear dependence between two variables (or users) as a function of their attributes. But it doesn't calculate this measure over the entire population of users. Instead, the population must be filtered down to neighborhoods based on a higher-level similarity metric, such as reading similar blogs.
The Pearson correlation, which is widely used in research, is a popular algorithm for collaborative filtering.

Clustering

Clustering algorithms are a form of unsupervised learning that can find structure in a set of seemingly random (or unlabeled) data. In general, they work by identifying similarities among items, such as blog readers, by calculating their distance from other items in a feature space. (Features in a feature space could represent the number of articles read in a set of blogs, time spent on articles, comments on blogs etc. .) The number of independent features defines the dimensionality of the space. If items are "close" together, they can be joined in a cluster.

Many clustering algorithms exist. The simplest one is k-means, which partitions items into k clusters. Initially, the items are randomly placed into clusters. Then, a centroid (or center) is calculated for each cluster as a function of its members. Each item's distance from the centroids is then checked. If an item is found to be closer to another cluster, it's moved to that cluster. Centroids are recalculated each time all item distances are checked. When stability is reached (that is, when no items move during an iteration), the set is properly clustered, and the algorithm ends.

Calculating the distance between two objects can be difficult to visualize. One common method is to treat each item as a multidimensional vector and calculate the distance by using the Euclidean algorithm. Other clustering variants include the Adaptive Resonance Theory (ART) family, Fuzzy C-means, and Expectation-Maximization (probabilistic clustering), to name a few.

Other algorithms

Many algorithms — and an even larger set of variations of those algorithms — exist for recommendation engines. Some that have been used successfully include:

Bayesian Belief Nets, which can be visualized as a directed acyclic graph, with arcs representing the associated probabilities among the variables.

Markov chains, which take a similar approach to Bayesian Belief Nets but treat the recommendation problem as sequential optimization instead of simply prediction.

Rocchio classification (developed with the Vector Space Model), which exploits feedback of the item relevance to improve recommendation accuracy.

Building a Recommendation Engine

There are several optimizations we can do in those scripts such as Numpy vectorizations , R packages etc. for computing the similarities between items, interests and so on. These similarities when applied to a model emit recommendations for the user on-line ( Solr ) or off-line ( Hbase/Impala ). Hadoop constitutes an integral part of a big data driven recommendation engines of today.

There are many open source offerings to build recommendation engines. I hereby give examples that cover both Microsoft Windows and Linux community. In building recommendation engine, you have to understand user/community, model their behavior, analyze the interaction and then present them with recommendations in on-line and off-line mode. It can turn out to be a complex setup considering we have to tap Hadoop modules and enact them with right approach and algorithm.

If you want to build a recommender using hybrid approach as mentioned above, you can use free KIJI Framework available from http://www.kiji.org/ . If you want to know how to build a Pearson Collaboration based recomemnder using Microsoft technology, find it at http://www.codeproject.com/Articles/620717/Building-A-Recommendation-Engine-Machine-Learning .

I now give an example of recommendation platform promoted by Cloudera. Here, real time recommendations emanate from the recommendation server while the real time interactions are recorded into Hbase. Apache Giraph is used to calculate matrix of similarity to enable collaborative filtering. Mahout has built in algorithms for clustering, alternating least square etc. The input interaction data, Solr indexes and Mahout results constitute the final recommendation to the user.

Conclusion

As you know by now, recommender systems take data collected on existing user behaviors, and use it to determine what users might also like. It’s a very technical version of what humans do intuitively—we might recommend an ice cream shop to a friend with a sweet tooth, but a coffee place to another friend who is avoiding carbs. By using past behavior of a large number of people, we can predict the taste preferences of an individual or community. Recommendation engines are value drivers for Big Data applications.

3 Comments

JAVA EE 7 and BIG DATA

4/30/2014

0 Comments

Introduction

Hadoop is synonymous with Big Data. As you may already know, Hadoop is built on JAVA, so also Cassandra NoSQL database. Java is the foundation of what Google Big Table is. Java is used everywhere, starting from sensors to enterprise applications. JAVA EE 7 is generally available since June 2013. Many enterprises and software companies are adopting it.
As you have seen in my blog series and elsewhere, big data is adoption is also on rise.

I have not seen a blog/article that explains how and what JAVA EE 7 features are useful for big data enterprise apps. This blog is an attempt in that regard. I will focus more on JAVA EE and then end with two use cases of JAVA EE in the context of of BIG DATA applications. Goal is to make a connection between best features of JAVA EE 7 and proven features of Hadoop framework.

JAVA EE 7 - Introduction

Java EE is a set of specifications implemented by different containers. Containers are Java EE runtime environments
that provide certain services to the components they host such as life-cycle management, dependency injection,
concurrency, and so on. These components use well-defined contracts to communicate with the Java EE
infrastructure and with the other components. They need to be packaged in a standard way (following a defined
directory structure that can be compressed into archive files) before being deployed. Java EE is a superset of the Java
SE platform, which means Java SE APIs can be used by any Java EE components. Java EE 7 consists of nearly 30 specifications and is an important milestone for the enterprise layer (CDI 1.1, Bean Validation 1.1, EJB 3.2, JPA 2.1), for the web tier (Servlet 3.1, JSF 2.2, Expression Language 3.0), and for interoperability (JAX-WS 2.3 and JAX-RS 2.0).

JAVA EE Components

The Java EE runtime environment defines four types of components that an implementation must support:

• Applets are GUI (graphic user interface) applications that are executed in a web browser. They use the rich Swing API to provide powerful user interfaces.
• Applications are programs that are executed on a client. They are typically GUIs or batch- processing programs that have access to all the facilities of the Java EE middle tier.
• Web applications (made of servlets, servlet filters, web event listeners, JSP and JSF pages) are executed in a web container and respond to HTTP requests from web clients. Servlets also support SOAP and RESTful web service endpoints. Web applications can also contain EJBs Lite.
• Enterprise applications (made of Enterprise Java Beans, Java Message Service, Java Transaction API, asynchronous calls, timer service, RMI/IIOP) are executed in an EJB container. EJBs are container-managed components for processing transactional business logic. They can be accessed locally and remotely through RMI (or HTTP for SOAP and RESTful web services).

JAVA EE Containers

The Java EE infrastructure is partitioned into logical domains called containers. Each container has a specific role, supports a set of APIs, and offers services to components (security, database access, transaction handling, naming directory, resource injection). Containers hide technical complexity and enhance portability. Depending on the kind of application you want to build, you will have to understand the capabilities and constraints of each container in order to use one or more. For example, if you need to develop a web application, you will develop a JSF tier with an EJB Lite tier and deploy them into a web container. But if you want a web application to invoke a business tier remotely and use messaging and asynchronous calls, you will need both the web and EJB containers.

• Applet containers are provided by most web browsers to execute applet components. When you develop applets, you can concentrate on the visual aspect of the application while the container gives you a secure environment. The applet container uses a sandbox security model where code executed in the “sandbox” is not allowed to “play outside the sandbox.” This means that the container prevents any code downloaded to your local computer from accessing local system resources, such as processes or files.

• Application client container (ACC) includes a set of Java classes, libraries, and other files required to bring injection, security management, and naming service to Java SE applications (swing, batch processing, or just a class with a main() method). The ACC communicates with the EJB container using RMI-IIOP and the web container with HTTP (e.g., for SOAP and REST web services).

• Web container provides the underlying services for managing and executing web components (servlets, EJBs Lite, JSPs, filters, listeners, JSF pages, and web services). It is responsible for instantiating, initializing, and invoking servlets and supporting the HTTP and HTTPS protocols. It is the container used to feed web pages to client browsers.

• EJB container is responsible for managing the execution of the enterprise beans (session beans and message-driven beans) containing the business logic tier of your Java EE application. It creates new instances of EJBs, manages their life cycle, and provides services such as transaction, security, concurrency, distribution, naming service, or the possibility to be invoked asynchronously.

Programming Model

Most of the Java EE 7 specifications use the same programming model. It’s usually a POJO ( plain old java object )with some metadata (annotations or XML) deployed into a container. Most of the time the POJO doesn’t even implement an interface or extend a superclass. Thanks to the metadata, the container knows which services to apply to this deployed component. In Java EE 7, servlets, JSF backing beans, EJBs, entities, SOAP and REST web services are annotated classes with optional XML deployment descriptors. Listing 1 shows a JSF backing bean that turns out to be a Java class with a single CDI annotation.

Listing 1. A JSF Backing Bean

@Named public class BookController {
@Inject private BookEJB bookEJB;
private Book book = new Book(); private List<Book> bookList = new ArrayList<Book>();
public String doCreateBook() { book = bookEJB.createBook(book); bookList = bookEJB.findBooks(); return "listBooks.xhtml"; }
// Getters, setters }

EJBs also follow the same model. As shown in Listing 2 , if you need to access an EJB locally, a simple annotated class with no interface is enough. EJBs can also be deployed directly in a war file without being previously packaged in a jar file. This makes EJBs the simplest transactional component that can be used from simple web applications to complex enterprise ones.

Listing 2. A Stateless EJB

@Stateless public class BookEJB {
@Inject private EntityManager em;
public Book findBookById(Long id) { return em.find(Book.class, id); }
public Book createBook(Book book) { em.persist(book); return book; } }

RESTful web services have been making their way into modern applications. Java EE 7 attends to the needs of enterprises by improving the JAX-RS specification. As shown in Listing 3 , a RESTful web service is an annotated Java class that responds to HTTP actions.

Listing 3 . A RESTful Web Service

@Path("books") public class BookResource {
@Inject private EntityManager em;
@GET @Produces({"application/xml", "application/json"}) public List<Book> getAllBooks() { Query query = em.createNamedQuery("findAllBooks"); List<Book> books = query.getResultList(); return books; } }

Java EE 7 broadens the use of annotations and enhances application portability with standard RESTful Web Services client support as shown above. This release also delivers improvements to Contexts and Dependency Injection (CDI), a Java standard for dependency-injection-based module configuration at runtime. It aims to reduce boiler-plate code using dependency injection and default resources as shown above. The new platform also updates the Java Message Service (JMS). Version 2.0 supports annotations and CDI Beans, reducing significantly the code required to send and receive messages.

BIG DATA and JAVA EE

Enterprise Java spec emphasizes simplification, productivity, and support for a number of web standards, including HTML5, Web Sockets, JSON, and a modern HTTP client API. WebSockets, which supports simultaneous two-way (full-duplex) communication channels over a TCP, reduces the response times of HTML5 apps. JSON (JavaScript Object Notation), the text-based standard for human-readable data interchange based on JavaScript, simplifies data parsing and exchange. Let us
focus on two use cases while there are so many we can consider.

DATA VALIDATION

Context and Dependency Injection has become a central and common specification across Java EE. It solves recurrent problems (injection, alternatives, stereotypes, producers . . .) that developers have in their day-to-day job. Validating data is also a common task that is spread across several, if not all, layers of today’s applications (from presentation to database). Because processing, storing, and retrieving valid data are crucial for an application, each layer defines validation rules its own way. Often the same validation logic is implemented in each layer, proving to be time-consuming, harder to maintain, and error prone.

To avoid duplication of these validations in each layer, developers often bundle validation logic directly into the domain model, cluttering domain classes with validation code that is, in fact, metadata about the class itself. Bean Validation solves the problem of code duplication and cluttering domain classes by allowing developers to write a constraint once, use it, and validate it in any layer. Bean Validation implements a constraint in plain Java code and then defines it by an annotation (metadata). This annotation can then be used on your bean, properties, constructors, method parameters, and return value. In a very elegant yet powerful way, Bean Validation exposes a simple API so that developers can write and reuse business logic constraints.

HADOOP enhanced by JAVA EE

Now that we can have valid data, we can focus on enterprise need, Thanks to JAVA EE architecture, you can develop web applications invoking services in a web container that in turn invokes logic contained in an EJB container, all in an enterprise application context wrapped in a .WAR file. JAVA EE 7's web services exposure using HTTP SOAP/REST has accelerated use of JAVA as first class language for Hadoop.

Conclusion

Now, JAVA APIs exist to write map reduce code. JAVA APIs also exist to write PIG UDFs. JAVA can be used to interface with Hbase. List goes on. It is now prudent to use proven JAVA APIs for Hadoop framework in context of new features of JAVA EE 7. That way, you can enhance Hadoop experience from client, server and a service point of view.

0 Comments

<<Previous

Forward>>

Agile Big Data - Idea to Implementation

Java Tuning of Servers

Affordable Event Log Analysis with Logstash and Hadoop

Speaking at Silicon Valley Code Camp

Handling Slowness of Hadoop for Real Time Use

Troubleshooting and High Availability for Hadoop - Step by Step Approach

Developing and Adopting Big Data Machine Learning Practice

Basics of Machine Learning

Big Data Recommendation Engines - Overview

JAVA EE 7 and BIG DATA

Author

Archives

Categories