Troubleshooting and High Availability for Hadoop - Step by Step Approach

8/31/2014

Introduction

In this blog, I present the steps to troubleshoot Hadoop and its components. Hadoop is a complex framework. Consider a production deployment now-a-days. Usage of Hadoop clusters deploying HDFS, Hbase, Hive, Mapreduce ( Pig/Hive) is fairly common. The deployments run 24x7. The systems need to be highly available. Still, they fail for basic reasons of configuration, permissions, file locations, inadequate cpu/memory allocation, etc.

To debug Hadoop issues, it is advised that we are aware of common trouble spots. It is best to get a thorough understanding of the configuration/module specific troubleshooting, and then high availability aspect of running Hadoop clusters early in cycle.

There are several ways to troubleshoot Hadoop production deployment related issues. In this blog, I focus on MapeReduce 1 aka Hadoop 1 because most of the installations are on that platform. Where needed, I also cover MapReduce 2 aka Hadoop 2, also known as YARN. We will review high availability aspects of Hadoop as well. Because, on production usage, Hadoop clusters need to be higly available. Troubleshooting is key task of keeping systems highly available.

You will also find step by step procedures and how to do a module specific diagnosis from a setup perspective. Assumption is that if you have right permission with right configuration of Java settings, you should be able to run your Hadoop jobs and commands. Each module is vast and they have their own diagnostics. For example, region servers in Hbase may fill over and cluster can stall. To fix the issue, you have to use diagnostics that come with Hbase.

These steps assume you have sized your Hadoop deployment properly i.e your nodes are configured with cpu, memory and disks correctly. They are also networked as per industry standard. Please refer to [1] and [2] below for more information on approaches to Hadoop troubleshooting.

Steps:

Step 1 ) Validate environment information including Version installed / used

1.1 Hadoop Distribution and Version installed

1.2. Validate if above documented Versions are supported

1.3 Confirm client connectivity by hostname to each of the nodes you have configured for the cluster!

Use PING <Nodes Hostname> for each node.

If you can't reach a node contact

Hadoop administrator and / or
IT Administrator

Step 2) Validate Hadoop Cluster is healthy and running.

2.1 Check Hadoop Cluster management interface:

Cloudera Manager
Apache Ambari (Hortonworks)
MapR Control System (CLDB Web Interface) for any errors and issues.

If any issues are identified, please check

Namenode (web) for MapReduce 1
Job tracker (web ) for MaprReduce 1,The URL http://mymachine.com:50030 is the web address for the JobTracker daemon.
If you have installed YARN (MapReduce 2 ), it runs ResourceManager.

To see the YARN ResourceManager UI, check your yarn-site.xml file for the following property:yarn.resourcemanager.webapp.address
By default, it should point to : resource_manager_hostname:8088
Assuming your ResourceManager runs on mymachine, you should see the ResourceManager UI at http://mymachine.com:8088/
For YARN related commands including how to capture logs, please refer to Apache documentation .

Make sure all your deamons are up and running if you are running MapReduce 1 or 2.

2.2 Collect the usernames under which you run the various hadoop functions as below :

Cluster
Distributed filesystem

hdfs from Cloudera, Hortonworks or
maprfs from MapR or
others like gpfs from IBM

Map/Reduce 1 and 2
Hive
Hive2
Impala (Cloudera)
Hbase
Pig
Individual directories

Step 3) Troubleshooting HDFS

3.1  Identify the User(s) under which you will access the HDFS
3.2  SSH into cluster as above identified USER(s)
3.3  Issue the following commands on HDFS to verify that you have the authority to execute them.

hadoop fs -ls /
hadoop fs -ls
hadoop fs -cp <somefile> <somefile>.temp
hadoop fs -rm <somefile>.temp

3.4 If any of the above commands failed contact your Hadoop Administrator or your Hadoop vendor.

Step 4) Troubleshooting Hive, Hive2 and Impala

For Hive, Hive2 and Impala, collect following information.

Who is the service running as?
Who owns <fill in distro name)?
On which ports is the service running?
netstat to verify the ports are open

Hive Server Example: netstat -an | grep 10000

4.1 What Hive Metastore are you using?
local
remote
thrift url
database
ports
You can find this information either by

Contacting your Hadoop Administrator
Reviewing the Hadoop Management console
Reviewing your configuration files for this service

4.2  Identify the user(s) under which you will access the above selected service.

4.3 ssh into the Cluster with the above selected user(s).

4.4 For HIVE, execute

hive

4.5 For HIVE2, execute

beeline

4.6 For IMPALA, execute

impala-shell
Once you are connected to the command prompt, please refresh your tables. Refer to the IMPALA documentation for the proper refresh command.

4.7 Execute the command in impala shell

show tables;
Verify that the correct tables are shown.

4.8 To validate that you have the right access rights please execute a

"simple" select statement for a table of your choosing selecting a single column to display.
Please note, to minimize executing time and keep your frustration level low select a "small" table.

Example:

SELECT COUNT(*) FROM <mytab>

Step 5) Troubleshooting HBASE

5.1  Identify correct User for HBASE

5.2  SSH into the cluster as above identified User.

5.3  Connect to HBASE

For most distributions, use following command.

hbase shell

5.4   Issue the following commands to verify that you have the authority to execute them.

list
create '<tablename>'
delete '<tablename>'
describe '<tablename>'

5.5   If any of the above commands failed contact your Hadoop administrator or your Hadoop vendor

Step 6) Troubleshooting PIG
6.1 Identify the User(s) under which you will access the HDFS, see information you collected in Step 3
6.2 SSH into cluster as above identified USER(s)
6.3 Validate that your PIG client is working correctly.
6.4 Command to execute PIG is
Local Mode
$ pig -x local
... - Connecting to ...
grunt>
Mapreduce Mode
$ pig -x mapreduce
... - Connecting to ...
grunt>

Step 7) Troubleshooting MAPR
MAPR adopts a different file system unlike HDFS. It is still Apache Hadoop compatible.

7.1   Validate that the MAPR client is installed and configured, You can test your mapr file system with

hadoop fs -ls maprfs:///

7.2  Issue the following commands from the MAPR Client to verify that you have the authority to execute them.

hadoop fs -ls
hadoop fs -cp <somefile> <somefile>.temp
hadoop fs -rm <somefile>.temp

Step 8 ) Sanity checks

8.1 Run famous word-count (or similar) sample which is included in your distribution to validate that it is working correctly for distribution of your choice. If any of the above commands failed contact your hadoop administrator or your hadoop vendor.

8.2 For most distributions, Hadoop configuration files (core-site.xml, hdfs-site.xml, mapred-site.xml,yarn-site.xml, capacity-scheduler.xml ) are required depending on if you are running Hadoop 1 or 2, or both. Ensure that these files are copied to the hadoop configurations folder of your application, and that they match the current files on the Hadoop Cluster. These XML files help the application understand which Hadoop cluster to connect to, and the default configurations of various Hadoop services. Without these files, the application will not know about your Hadoop cluster.

Step 9) Hadoop cluster in High Availability ( HA ) Mode
The Hadoop stack contains multiple services (HDFS, MapReduce, HBase, etc.) and each of these services have their own co-dependencies. A client application, that interacts with Hadoop, can depend on one or more of these services. A highly available Hadoop platform must ensure that the NameNode master service as well as client applications are resilient to critical failure services.

HA architecture has the following key properties:

It provides high availability for the NameNode master daemon service.
When the NameNode master daemon fails over, the HA solution initiates the following actions:
- Dependent services (like JobTracker) automatically detect the failure or fail over of the co-dependent component (NameNode) and these dependent services pause, retry, and recover the failed service. (For example, the JobTracker does not launch new jobs or kill jobs that have been waiting for the NameNode.)
- Applications running inside and outside the Hadoop cluster also automatically pause and retry their connection to the failed services.

9.1 For Mapreduce 1, connect to your NameNode(s) / Master Server and check if the following files

core-site.xml
hdfs-site.xml

Make sure that they are configured correctly as per your Hadoop administrator.

If you have the following parameter configured in the hdfs-site.xml

dfs.nameservices ( MapReduce 1 )

then you are configured for High Availability. Please refer to [3] below for more information,

9.2 High Availability ( HA ) in MapReduce 2 ( YARN )

To enable HA, set yarn.resourcemanager.ha.enabled to true in yarn-site.xml. YARN uses identifiers (rm-ids) as logical names for each resourcemanager. Set yarn.resourcemanager.ha.rm-ids to the list of identifiers you wish to use; for example, rm1,rm2 (the value is a comma-separated list). For each of the identifiers (corresponding to each ResourceManager), define yarn.resourcemanager.hostname. (should be the hostname of the ResourceManager). NodeManagers and clients use the configured addresses to find the ResourceManager service they need to talk to. They go through the list of rm-ids and successively try the address corresponding to each rm-id.

For a YARN application to complete successfully, both the ApplicationMaster and the ResourceManager must be live. Containers with completed tasks need to report to the ApplicationMaster, which in turn needs to report to the ResourceManager to commit the job completion.

The web UI of the standby ResourceManager will automatically redirect to that of the active ResourceManager, which is convenient. If you want to ascertain the HA status of a particular ResourceManager in an HA cluster, you can go to the /cluster/cluster page of the ResourceManager web address (this page does not redirect), or you can use the REST API, by going to /ws/v1/cluster/info of the web address.

9.3 Using cluster name

Are you using the clustername or an IP and/or hostname to connect to your cluster? Preferred method is via clustername. If you are using the clustername and can't connect to the cluster, something is wrong with the cluster.
If you are using IP / Hostname and can't connect to the cluster , you might be trying to connect to the passive cluster.

Step 10) In some cases, you may have to turn off permissions to debug job runs. Permission can be turned off by

cluster management user interface
dfs.permissions parameter (by setting this to false) in hdfs-site.xml

Without permission check, make sure your jobs run to completion thus eliminating permission related command/job failures. Why? Mismatched class libraries and permissions are two main factors why Hadoop jobs fail. It is paramount to make sure permission issues are resolved at first. There are several techniques to fix class library issues given rich history of Java deployments. Each module has specific diagnostics attached to module's rich class methods and behavior.

Step 11) If jobs run without any permission, then
Turn on the permission flag disabled earlier and fix your permissions issues in Hadoop.

Ste 12 ) If a hadoop job fail after permission issues are fixed, use hdfs and module specific log files to analyze and fix class libraries. Correct class libraries need to be deployed at Java run time for successful completion of job or command executed on your Hadoop infrastructure.

Conclusion

In this blog, we reviewed the steps to troubleshoot Hadoop framework operational at your organization. We covered popular Hadoop distributions. Each module in Hadoop framework behaves differently. Again, important module diagnostics were covered as well. I did not go over log files of each module. Log file analysis will help us pin point real issues. But, that would make the blog bigger and complex. That can be addressed in another blog. I wish your Hadoop deployments be healthy and operational for your benefit.

Reference

1 ) Hadoop Troubleshooting
2) Avoiding Mapreduce 1 and 2 Time Consuming Gotchas
3) High Availability on Hadoop

9 Comments

vamshikrishna link

4/15/2017 04:43:11 am

Thanks for sharing the information very useful about hadoop and keep updating us, Please........

Ananthi link

5/8/2017 11:27:47 pm

Great and helpful blog to everyone.. This blog having more useful information which having clear explanation so easy and interesting to read.. This blog really useful to develop my knowledge in hadoop and cracking interview easily.. thanks a lot for sharing this blog to us...

<a href="http://www.credosystemz.com/training-in-chennai/best-bigdata-training-in-chennai/">hadoop training institute in adyar</a>

thilak link

5/29/2017 12:29:45 am

Pretty informed post! I'm seeking for some topics I need to see that our site affection and then drove it our site all report is really good.

Darshan N link

5/3/2019 01:01:34 am

Thank you for the information, i found the information very useful.
If anyone looking for Big Data training in Bangalore i suggest Apponix Technologies, they provide best Big Data training. For more information visit : https://www.apponix.com/Big-Data-Institute/hadoop-training-in-bangalore.html

Swathi link

9/6/2019 12:25:25 am

Hi,
Good job & thank you very much for the new information, i learned something new. Very well written. It was sooo good to read and usefull to improve knowledge. Who want to learn this information most helpful. One who wanted to learn this technology IT employees will always suggest you take <a href="https://prwatech.in/big-data-hadoop-training-in-pune/">big data hadoop training in pune</a>. Because big data hadoop course in pune is one of the best that one can do while choosing the course.

merlin

5/19/2020 08:59:24 am

Thanks for this informative blog please keep posting more often as it might help someone who is looking to gain more knowledge.
<a href="https://www.acte.in/selenium-training-in-chennai"> Selenium Training in chennai </a> | <a href="https://www.acte.in/selenium-training-in-annanagar"> Selenium Training in anna nagar</a> | <a href="https://www.acte.in/selenium-training-in-omr"> Selenium Training in omr</a> | <a href="https://www.acte.in/selenium-training-in-porur"> Selenium Training in porur</a> | <a href="https://www.acte.in/selenium-training-in-tambaram">Selenium Training in tambaram</a> | <a href="https://www.acte.in/selenium-training-in-velachery"> Selenium Training in velachery</a>

data science training in indore

7/8/2020 11:08:56 pm

I really enjoy simply reading all of your weblogs. Simply wanted to inform you that you have people like me who appreciate your work. Definitely a great post. Hats off to you! The information that you have provided is very helpful.
<a href="https://360digitmg.com/india/data-science-using-python-and-r-programming-indore">data science training in indore</a>

360digitmg_ecil link

7/15/2020 04:09:33 am

Great post i must say and thanks for the information. Education is definitely a sticky subject. However, is still among the leading topics of our time. I appreciate your post and look forward to more.
<a href="https://360digitmg.com/india/data-science-using-python-and-r-programming-ecil-hyderabad">data science course in ecil</a>

360digitmg link

7/22/2020 09:20:11 am

Troubleshooting and High Availability for Hadoop - Step by Step Approach

Leave a Reply.

Author

Archives

Categories