Saturday, August 29, 2009

High Availability Cluster using Heartbeat

Recently the problem I was facing while using the heartbeat 2.1.4 package was that my slave wasnt taking over, after the fail over condition. When I saw the logs I saw that there were messages related to failure in starting the apache:


ResourceManager[1769]: 2009/08/27_06:57:37 info: Running /etc/ha.d/resource.d/apache start
apache[2212]: 2009/08/27_06:57:38 INFO: httpd2: Could not reliably determine the server's fully qualified domain name, using 192.168.69.209 for ServerName
apache[2212]: 2009/08/27_06:57:38 INFO: apache not running
apache[2212]: 2009/08/27_06:57:38 INFO: waiting for apache /etc/apache2/httpd.conf to come up
apache[2212]: 2009/08/27_06:57:39 ERROR: command failed: sh -c wget -O- -q -L --bind-address=127.0.0.1 http://localhost:80/server-status | tr '\012' ' ' | grep -Ei "[[:space:]]*" >/dev/null
apache[2201]: 2009/08/27_06:57:39 ERROR: Generic error
ResourceManager[1769]: 2009/08/27_06:57:39 ERROR: Return code 1 from /etc/ha.d/resource.d/apache
ResourceManager[1769]: 2009/08/27_06:57:39 CRIT: Giving up resources due to failure of apache
ResourceManager[1769]: 2009/08/27_06:57:39 info: Releasing resource group: node09 IPaddr::9.12.34.100 ldirectord apache
ResourceManager[1769]: 2009/08/27_06:57:39 info: Running /etc/ha.d/resource.d/apache stop
apache[2456]: 2009/08/27_06:57:41 INFO: Killing apache PID 2363
apache[2456]: 2009/08/27_06:57:41 INFO: apache stopped.
apache[2445]: 2009/08/27_06:57:41 INFO: Success
ResourceManager[1769]: 2009/08/27_06:57:41 info: Running /etc/ha.d/resource.d/ldirectord stop
ResourceManager[1769]: 2009/08/27_06:57:41 info: Running /etc/ha.d/resource.d/IPaddr 9.12.34.100 stop
IPaddr[2679]: 2009/08/27_06:57:41 INFO: ifconfig eth0:0 down
IPaddr[2662]: 2009/08/27_06:57:41 INFO: Success
heartbeat[1755]: 2009/08/27_06:57:41 info: local HA resource acquisition completed (standby).
heartbeat[1666]: 2009/08/27_06:57:41 info: Standby resource acquisition done [foreign].
heartbeat[1666]: 2009/08/27_06:57:41 info: Initial resource acquisition complete (auto_failback)
heartbeat[1666]: 2009/08/27_06:57:42 info: remote resource transition completed.
hb_standby[2751]: 2009/08/27_06:58:12 Going standby [foreign].
heartbeat[1666]: 2009/08/27_06:58:12 info: node09 wants to go standby [foreign]

After inspecting the apache, everything seemed to fine. Even the ha apache scripts could successfully start/stop the apache except for the unknown wget error:

apache[2212]: 2009/08/27_06:57:39 ERROR: command failed: sh -c wget -O- -q -L --bind-address=127.0.0.1 http://localhost:80/server-status | tr '\012' ' ' | grep -Ei "[[:space:]]*" >/dev/null

After some initial debugging by setting the debug option "set -x" in apache start script (/etc/ha.d/resource.d/apache), I found the script where the problem was occurring.

/usr/lib/ocf/resource.d//heartbeat/apache

It seems that even if the apache starts successfully, the script returns an error code because of the failure in the execution of command. In general, apache doesnt seem to have the server-status facility enabled by default. (and I dont know how and why should I enable it) So for the quick fix its better to comment the erroring command.

##ocf_run sh -c "$WGET $WGETOPTS $STATUSURL | tr '\012' ' ' | grep -Ei \"$TESTREGEX\" >/dev/null"

I still dont understand the reason when ha should give up the network resources just because a service failed to start. It mught be a bug though, I never faced such a problem in former ha versions. In those times the apache scripts used to be very simple.

LInks:
http://www.linux-ha.org/

No comments:

Post a Comment