In this blog, I present the steps to troubleshoot Hadoop and its components. Hadoop is a complex framework. Consider a production deployment now-a-days. Usage of Hadoop clusters deploying HDFS, Hbase, Hive, Mapreduce ( Pig/Hive) is fairly common. The deployments run 24x7. The systems need to be highly available. Still, they fail for basic reasons of configuration, permissions, file locations, inadequate cpu/memory allocation, etc.
To debug Hadoop issues, it is advised that we are aware of common trouble spots. It is best to get a thorough understanding of the configuration/module specific troubleshooting, and then high availability aspect of running Hadoop clusters early in cycle.
There are several ways to troubleshoot Hadoop production deployment related issues. In this blog, I focus on MapeReduce 1 aka Hadoop 1 because most of the installations are on that platform. Where needed, I also cover MapReduce 2 aka Hadoop 2, also known as YARN. We will review high availability aspects of Hadoop as well. Because, on production usage, Hadoop clusters need to be higly available. Troubleshooting is key task of keeping systems highly available.
You will also find step by step procedures and how to do a module specific diagnosis from a setup perspective. Assumption is that if you have right permission with right configuration of Java settings, you should be able to run your Hadoop jobs and commands. Each module is vast and they have their own diagnostics. For example, region servers in Hbase may fill over and cluster can stall. To fix the issue, you have to use diagnostics that come with Hbase.
These steps assume you have sized your Hadoop deployment properly i.e your nodes are configured with cpu, memory and disks correctly. They are also networked as per industry standard. Please refer to [1] and [2] below for more information on approaches to Hadoop troubleshooting.
Steps:
Step 1 ) Validate environment information including Version installed / used
1.1 Hadoop Distribution and Version installed
1.2. Validate if above documented Versions are supported
1.3 Confirm client connectivity by hostname to each of the nodes you have configured for the cluster!
Use PING <Nodes Hostname> for each node.
If you can't reach a node contact
- Hadoop administrator and / or
- IT Administrator
2.1 Check Hadoop Cluster management interface:
- Cloudera Manager
- Apache Ambari (Hortonworks)
- MapR Control System (CLDB Web Interface) for any errors and issues.
- Namenode (web) for MapReduce 1
- Job tracker (web ) for MaprReduce 1,The URL http://mymachine.com:50030 is the web address for the JobTracker daemon.
- If you have installed YARN (MapReduce 2 ), it runs ResourceManager.
- To see the YARN ResourceManager UI, check your yarn-site.xml file for the following property:yarn.resourcemanager.webapp.address
- By default, it should point to : resource_manager_hostname:8088
- Assuming your ResourceManager runs on mymachine, you should see the ResourceManager UI at http://mymachine.com:8088/
- For YARN related commands including how to capture logs, please refer to Apache documentation .
Make sure all your deamons are up and running if you are running MapReduce 1 or 2.
2.2 Collect the usernames under which you run the various hadoop functions as below :
- Cluster
- Distributed filesystem
- hdfs from Cloudera, Hortonworks or
- maprfs from MapR or
- others like gpfs from IBM
- Map/Reduce 1 and 2
- Hive
- Hive2
- Impala (Cloudera)
- Hbase
- Pig
- Individual directories
3.1 Identify the User(s) under which you will access the HDFS
3.2 SSH into cluster as above identified USER(s)
3.3 Issue the following commands on HDFS to verify that you have the authority to execute them.
hadoop fs -ls /
hadoop fs -ls
hadoop fs -cp <somefile> <somefile>.temp
hadoop fs -rm <somefile>.temp
3.4 If any of the above commands failed contact your Hadoop Administrator or your Hadoop vendor.
Step 4) Troubleshooting Hive, Hive2 and Impala
For Hive, Hive2 and Impala, collect following information.
- Who is the service running as?
- Who owns <fill in distro name)?
- On which ports is the service running?
- netstat to verify the ports are open
4.1 What Hive Metastore are you using?
- local
- remote
- thrift url
- database
- ports
- Contacting your Hadoop Administrator
- Reviewing the Hadoop Management console
- Reviewing your configuration files for this service
4.3 ssh into the Cluster with the above selected user(s).
4.4 For HIVE, execute
hive
4.5 For HIVE2, execute
beeline
4.6 For IMPALA, execute
impala-shell
Once you are connected to the command prompt, please refresh your tables. Refer to the IMPALA documentation for the proper refresh command.
4.7 Execute the command in impala shell
show tables;
Verify that the correct tables are shown.
4.8 To validate that you have the right access rights please execute a
"simple" select statement for a table of your choosing selecting a single column to display.
Please note, to minimize executing time and keep your frustration level low select a "small" table.
Example:
SELECT COUNT(*) FROM <mytab>
Step 5) Troubleshooting HBASE
5.1 Identify correct User for HBASE
5.2 SSH into the cluster as above identified User.
5.3 Connect to HBASE
For most distributions, use following command.
hbase shell
5.4 Issue the following commands to verify that you have the authority to execute them.
list
create '<tablename>'
delete '<tablename>'
describe '<tablename>'
5.5 If any of the above commands failed contact your Hadoop administrator or your Hadoop vendor
Step 6) Troubleshooting PIG
6.1 Identify the User(s) under which you will access the HDFS, see information you collected in Step 3
6.2 SSH into cluster as above identified USER(s)
6.3 Validate that your PIG client is working correctly.
6.4 Command to execute PIG is
Local Mode
$ pig -x local
... - Connecting to ...
grunt>
Mapreduce Mode
$ pig -x mapreduce
... - Connecting to ...
grunt>
Step 7) Troubleshooting MAPR
MAPR adopts a different file system unlike HDFS. It is still Apache Hadoop compatible.
7.1 Validate that the MAPR client is installed and configured, You can test your mapr file system with
hadoop fs -ls maprfs:///
7.2 Issue the following commands from the MAPR Client to verify that you have the authority to execute them.
hadoop fs -ls
hadoop fs -cp <somefile> <somefile>.temp
hadoop fs -rm <somefile>.temp
Step 8 ) Sanity checks
8.1 Run famous word-count (or similar) sample which is included in your distribution to validate that it is working correctly for distribution of your choice. If any of the above commands failed contact your hadoop administrator or your hadoop vendor.
8.2 For most distributions, Hadoop configuration files (core-site.xml, hdfs-site.xml, mapred-site.xml,yarn-site.xml, capacity-scheduler.xml ) are required depending on if you are running Hadoop 1 or 2, or both. Ensure that these files are copied to the hadoop configurations folder of your application, and that they match the current files on the Hadoop Cluster. These XML files help the application understand which Hadoop cluster to connect to, and the default configurations of various Hadoop services. Without these files, the application will not know about your Hadoop cluster.
Step 9) Hadoop cluster in High Availability ( HA ) Mode
The Hadoop stack contains multiple services (HDFS, MapReduce, HBase, etc.) and each of these services have their own co-dependencies. A client application, that interacts with Hadoop, can depend on one or more of these services. A highly available Hadoop platform must ensure that the NameNode master service as well as client applications are resilient to critical failure services.
HA architecture has the following key properties:
- It provides high availability for the NameNode master daemon service.
- When the NameNode master daemon fails over, the HA solution initiates the following actions:
- Dependent services (like JobTracker) automatically detect the failure or fail over of the co-dependent component (NameNode) and these dependent services pause, retry, and recover the failed service. (For example, the JobTracker does not launch new jobs or kill jobs that have been waiting for the NameNode.)
- Applications running inside and outside the Hadoop cluster also automatically pause and retry their connection to the failed services.
- Dependent services (like JobTracker) automatically detect the failure or fail over of the co-dependent component (NameNode) and these dependent services pause, retry, and recover the failed service. (For example, the JobTracker does not launch new jobs or kill jobs that have been waiting for the NameNode.)
- core-site.xml
- hdfs-site.xml
Make sure that they are configured correctly as per your Hadoop administrator.
If you have the following parameter configured in the hdfs-site.xml
- dfs.nameservices ( MapReduce 1 )
9.2 High Availability ( HA ) in MapReduce 2 ( YARN )
To enable HA, set yarn.resourcemanager.ha.enabled to true in yarn-site.xml. YARN uses identifiers (rm-ids) as logical names for each resourcemanager. Set yarn.resourcemanager.ha.rm-ids to the list of identifiers you wish to use; for example, rm1,rm2 (the value is a comma-separated list). For each of the identifiers (corresponding to each ResourceManager), define yarn.resourcemanager.hostname. (should be the hostname of the ResourceManager). NodeManagers and clients use the configured addresses to find the ResourceManager service they need to talk to. They go through the list of rm-ids and successively try the address corresponding to each rm-id.
For a YARN application to complete successfully, both the ApplicationMaster and the ResourceManager must be live. Containers with completed tasks need to report to the ApplicationMaster, which in turn needs to report to the ResourceManager to commit the job completion.
The web UI of the standby ResourceManager will automatically redirect to that of the active ResourceManager, which is convenient. If you want to ascertain the HA status of a particular ResourceManager in an HA cluster, you can go to the /cluster/cluster page of the ResourceManager web address (this page does not redirect), or you can use the REST API, by going to /ws/v1/cluster/info of the web address.
9.3 Using cluster name
Are you using the clustername or an IP and/or hostname to connect to your cluster? Preferred method is via clustername. If you are using the clustername and can't connect to the cluster, something is wrong with the cluster.
If you are using IP / Hostname and can't connect to the cluster , you might be trying to connect to the passive cluster.
Step 10) In some cases, you may have to turn off permissions to debug job runs. Permission can be turned off by
- cluster management user interface
- dfs.permissions parameter (by setting this to false) in hdfs-site.xml
Step 11) If jobs run without any permission, then
Turn on the permission flag disabled earlier and fix your permissions issues in Hadoop.
Ste 12 ) If a hadoop job fail after permission issues are fixed, use hdfs and module specific log files to analyze and fix class libraries. Correct class libraries need to be deployed at Java run time for successful completion of job or command executed on your Hadoop infrastructure.
Conclusion
In this blog, we reviewed the steps to troubleshoot Hadoop framework operational at your organization. We covered popular Hadoop distributions. Each module in Hadoop framework behaves differently. Again, important module diagnostics were covered as well. I did not go over log files of each module. Log file analysis will help us pin point real issues. But, that would make the blog bigger and complex. That can be addressed in another blog. I wish your Hadoop deployments be healthy and operational for your benefit.
Reference
1 ) Hadoop Troubleshooting
2) Avoiding Mapreduce 1 and 2 Time Consuming Gotchas
3) High Availability on Hadoop