Tuesday, 16 February 2016

Troubleshooting Hanging Situations in HANA

Purpose
The purpose of this document is to instruct SAP customers on how to analyse hanging situations in their HANA system.

Overview
Analyzing hanging situations.

So what constituencies a hanging situation in HANA? When we talk about a hanging situation we generally refer to a whole system wide hang, as opposed to highlighting one specific activity, ie: queries or operations. This means we are going to look at the systems conditions (performance) which leads the DB to run slowly, or not at all in some cases.

Like all other software systems, SAP HANA relies on hardware to run its processes. Even when looking at the most basic single sever, single host, we can see many areas which over lap and interact with one another, in other words, when you see a hang do not just assume the cause is all related to HANA.

SAP HANA Troubleshooting Hanging Situations

So a small example of a hang / freeze situation I have witnessed is when a user goes to open the "Administrative Tab" in HANA Studio and the system hangs for a long period of time. Below is a small example of troubleshooting this issue.

Troubleshooting

The first thing you have to think of when you face this situation is to execute Runtime Dumps immediately.

Runtime Dumps can useful for the following situations:
  • Standstill situations
  • High CPU
  • Hanging Threads
  • Query cancellation not working
By checking an RTE dump you can look for certain key indicators such as Large Query plans. Large Query plans can indicate problematic SQL Queries. The thread of this SQL query can then be checked via the parent and also its child threads. These threads then link you back to short stacks which then can checked to see what exactly this stack is doing.  See Troubleshooting High CPU for further info.

As the HANA Studio will more than likely be hanging during the hang / slow performance, you can use SSH to run the Runtime Dumps with 2 minute intervals through the means of this hdbsql script:

DATE=`date +%Y%m%d.%H%M%S` ; for PID in `ps x|grep -E "hdb(index)server"|grep -v grep|awk '{print $1}'` ; do CMDLINE=`ps x|grep -E "^ *${PID}"|grep -v grep |awk '{for(i=5;i<=NF;i++) printf $(i)}'` ; echo $PID - $CMDLINE ; hdbcons -p ${PID} "runtimedump dump -c" >${CMDLINE}-${PID}-${DATE}.dump ; done ; sleep 120 ; DATE=`date +%Y%m%d.%H%M%S` ; for PID in `ps x|grep -E "hdb((index+)server"|grep -v grep|awk '{print $1}'` ; do CMDLINE=`ps x|grep -E "^ *${PID}"|grep -v grep |awk '{for(i=5;i<=NF;i++) printf $(i)}'` ; echo $PID - $CMDLINE ; hdbcons -p ${PID} "runtimedump dump -c" >${CMDLINE}-${PID}-${DATE}.dump ; done

After running the script you can then open the generated RTE dumps. The dumps will show you what exact queries were running at the time of the hang / freeze.

These queries can then either be searched on the SAP search or you can check to see if these are your own customer queries which need to be looked at it terms of optimization. (Also if you have to open an incident with SAP this information will be what the engineer will be looking for)
In relation to the HANA Studio Hang, the solution for this can be found by searching the SQL Query generated which will return the Note High CPU when opening admin console

The vast majority of hanging situations are related to bottleneck issues with with CPU, Storage, Network etc.

Usually the DBA will know the time and date of the hang that is causing the issues, but if this is not known you can always use the Performance load graph. As of SP9 you can now use the HANA Cockpit load graph. (I know this was a function already available in previous revisions but this did not work very well and crashed a lot). This preforms better and does not crash like its predecessor in Studio if the nameserverhistory file was large.

Going to the SAP HANA Cockpit, you can then see the SAP HANA Database Administration section with nice looking Fiori designed tiles:

SAP HANA Troubleshooting Hanging Situations

Here you can check at what time and date did the system experience the issues.

Please also be aware of the HANA Offline Cockpit functionality that became available recently. By logging in with the SIDADM user you can use the "Troubleshoot Unresponsive System" also:

SAP HANA Tutorial

If the load graph cannot be access by either Studio or Cockpit you can also use the TOP command at OS level which will show you the running processes:

SAP HANA Certifications

If the load graph cannot be access by either Studio or Cockpit you can also use the TOP command at OS level which will show you the running processes:

SAP HANA Certifications

So now you have the Time Stamp of the issue. Now you go to HANA Diagnosis Tab in Studio , or its corresponding tile in Cockpit.

Here is where you locate the time stamp in the relevant files so you can see what was happening before, during and after the hang.

The first files to look into are the indexserver + nameserver. Check the corresponding time stamps (Before and during the time) in these files to see if any obvious errors are apparent. Some examples of errors you may see before the system hang can be:

SQL error 131: transaction rolled back by lock wait timeout
SQL error 133: transaction rolled back by detected deadlock

If you see these please see Note on Lock Analysis. Many Useful SQL scripts exist also at Note 1969700

mvcc_anti_ager.cc(01291) : There are too many un-collected versions.
        ('number of versions > 1000000' or 'maximum number of versions per record > 100000')

The cursor possibly block the garbage collection of HANA database.
         mvcc_anti_ager.cc(01291) : There are too many un-collected versions on table "<schema>"."<table>"
         ('number of versions for one of the partitions > 1000000' or 'maximum number of versions per record > 100000')

The transaction blocks the garbage collection of HANA database.
        mvcc_anti_ager.cc(01199) : long running uncommitted write transaction detected.
        mvcc_anti_ager.cc(01082) : The Connection is disconnected forcefully because it is blocking garbage collection for too long period.
        Statement.cc(03190) : session control command is performed by ..., user=SYSTEM, query=ALTER SYSTEM DISCONNECT SESSION '<conn_id>'
        mvcc_anti_ager.cc(00834) : long running cursor detected.

The open cursor possibly blocks the garbage collection of HANA database.
         Please close a cursor in application or kill the connection by "ALTER SYSTEM DISCONNECT SESSION '<conn_id>' "

The above refers to a long running transaction that has yet to be committed and could be causing your system to hang. If you were to see any of these errors please see FAQ on Garbage Collection.


If you have checked the HANA logs and you can nothing obvious or problematic, then here is when you check the var/log/messages.

If you see some irregularities in these files then open a ticket with your hardware provider.

The main point to take from this document is to ALWAYS try and capture the hang with runtime dumps. This will then give you the DBA or SAP a very good chance of identifying the root cause of the hang.

Source: scn.sap.com