Wednesday, 23 March 2016

How to screw up your HANA database in 5 seconds

I like the SAP HANA database. I really do. Writing demanding SQL statements has never been so much fun since I throw them at SAP HANA. And the database simply answers, really quickly. While the database itself works fine, from time to time I stumble upon some strange issues around HANA administration where I notice that SAP HANA is still a quite new database. In certain cases the database is in real danger, so I want to share with you a perfidious trap.

You remember that starting with SAP HANA revision 93, a revision update automatically changed the database from the standalone statisiticsserver to the embedded statisticsserver? You could in theory keep the standalone statisticsserver, but I believe no one actually did this. So did you ever wonder why the systemOverview.py script provides this irritating warning?

SAP HANA Material and Certifications

I double-checked this on revision 111. The warning is still there. Now you could say, this is a harmless warning and should be ignored. Since SPS09 a standalone statisticsserver is against the clear recommendation from SAP. However, what if some lesser experienced HANA administrator sees this message, takes it seriously and tries to start the standalone statisticsserver anyway?

TL;DR: DO NOT DO THIS!

First of all, SAP did not yet remove the hdbstatisticsserver binary from the IMDB_SERVER.SAR packages. It is still available, even in revision 112.

SAP HANA Material and Certifications

However, it should not be possible to run it if you use the embedded statisticsserver, right? Starting the standalone statisticsserver in this scenario should result in an error message and no harm be done? Well, not quite. So far the topology for my HANA instance looks like this:

SAP HANA Material and Certifications

And now I screw up my HANA database via one simple command:

SAP HANA Material and Tutorial

Oh no! What have I done? When checking the trace file of this new process, it detects the embedded statistics server and disables itself, but only after the topology was already botched up.

[31147]{-1}[-1/-1] 2016-03-22 10:16:36.813528 i StatsServ   StatisticsServerStarter.cpp(00081) : new StatisticsServer active. Disabling myself...
[31147]{-1}[-1/-1] 2016-03-22 10:16:36.834024 i StatsServ   StatisticsServerStarter.cpp(00096) : new StatisticsServer active. Disabling myself DONE.
[31147]{-1}[-1/-1] 2016-03-22 10:16:36.836820 i assign       TREXIndexServer.cpp(01793) : assign to volume 5 finished

So I stop the ominous process asap:

SAP HANA Certifications and Material

However, in M_SERVICES I still see the "new" service! This is not nice. How do I clean up this mess?

HANA Material and Certifications

SAP HANA Material

This is not just a cosmetic issue. Important systems are protected by HANA system replication. Now this new (but inactive) service breaks the system replication! This is really bad:

How to screw up your HANA database in 5 seconds

How can we fix the system replication? Let's try the obvious way on the secondary site:
HDB stop
hdbnsutil -sr_unregister
hdbnsutil -sr_register --name=site2 --mode=sync --remoteHost=eahhan01 --remoteInstance=10
HDB start

The procedure seems to work. Unfortunately this does not really reinitialize the replication, because if I try a takeover then I get this error:

How to screw up your HANA database in 5 seconds

I cannot even perform a backup on the primary site, because that stupid statisticsserver is not active. Dang!

If you have been curious and screwed up your crash&burn instance, then you can try to fix the situation with such commands. Proceed at your own risk:
ALTER SYSTEM ALTER CONFIGURATION ('daemon.ini','host','eahhan01') UNSET ('statisticsserver','instances') WITH RECONFIGURE
ALTER SYSTEM ALTER CONFIGURATION ('topology.ini','system') UNSET ('/host/eahhan01','statisticsserver') WITH RECONFIGURE
ALTER SYSTEM ALTER CONFIGURATION ('topology.ini','system') UNSET ('/volumes','5') WITH RECONFIGURE
For more details, have a look at SAP notes 1697613, 2222249, 1950221.

Now the Python script shows that the system replication looks fine again:

How to screw up your HANA database in 5 seconds

IMPORTANT: Never solely rely on the output of this check script or what you see in the HANA studio on system replication. I recommend to test the takeover after all changes of the topology. It might happen that all lights are green and nevertheless the takeover fails after some topology change.

Source: scn.sap.com