April 05, 2012

CRS DAEMON Can't Up Oracle 11g Aix

Once upon time i installed RAC on one of my client, after the installation i reboot the server after that i found that the crs  daemon can't up, i wondering what is the problem, before rebooting every things is find..
the error message is
2010-11-29 11:13:49.817: [ OCRMAS][3342]proath_master:100b: Polling, connect to master not complete retval1 = 203, retval2 = 203

below is the explanation about the error message..

In this Document
Symptoms
Changes
Cause
1) AIX-specific cause
2) UNIX-generic cause
Solution
1) Solution for AIX-specific cause
2) Solution for Unix-generic cause
Scalability RAC Community
References

Applies to:
Oracle Server - Enterprise Edition - Version: 11.2.0.2 and later [Release: 11.2 and later ]
Information in this document applies to any platform.
Symptoms
11.2.0.2 grid infrastructure upgrade or install on >1 node cluster
rootcrs.pl is failing and the following is found in the crsd.log
...
2010-11-29 10:52:38.603: [GIPCHALO][2314] gipchaLowerProcessNode: no valid interfaces found to node for 2614824036 ms, node 111ea99b0 { host 'racdb1', haName '1e0b-174e-37bc-a515', srcLuid 2612fa8e-3db4fcb7, dstLuid 00000000-00000000 numInf 0, contigSeq 0, lastAck 0, lastValidAck 0, sendSeq [55 : 55], createTime 2614768983, flags 0x4 }
2010-11-29 10:52:42.299: [ CRSMAIN][515] Policy Engine is not initialized yet!
2010-11-29 10:52:43.554: [ OCRMAS][3342]proath_connect_master:1: could not yet connect to master retval1 = 203, retval2 = 203
2010-11-29 10:52:43.554: [ OCRMAS][3342]th_master:110': Could not yet connect to new master [1]
2010-11-29 10:52:43.605: [GIPCHALO][2314] gipchaLowerProcessNode: no valid interfaces found to node for 2614829038 ms, node 111ea99b0 { host 'racdb1', haName '1e0b-174e-37bc-a515', srcLuid 2612fa8e-3db4fcb7, dstLuid 00000000-00000000 numInf 0, contigSeq 0, lastAck 0, lastValidAck 0, sendSeq [60 : 60], createTime 2614768983, flags 0x4 }
2010-11-29 10:52:43.754: [ OCRMAS][3342]proath_master:100b: Polling, connect to master not complete retval1 = 203, retval2 = 203
2010-11-29 10:52:43.955: [ OCRMAS][3342]proath_master:100b: Polling, connect to master not complete retval1 = 203, retval2 = 203
...
2010-11-29 11:13:49.817: [ OCRMAS][3342]proath_master:100b: Polling, connect to master not complete retval1 = 203, retval2 = 203
2010-11-29 11:13:50.018: [ OCRMAS][3342]proath_master:100b: Polling, connect to master not complete retval1 = 203, retval2 = 203
...

evmd.log shows:
2010-11-29 10:52:38.694: [ GIPCNET][2314] gipcmodNetworkProcessSend: slos op : sgipcnUdpSend
2010-11-29 10:52:38.694: [ GIPCNET][2314] gipcmodNetworkProcessSend: slos dep : Message too long (59)
2011-11-29 10:52:38.694: [ GIPCNET][2314] gipcmodNetworkProcessSend: slos loc : sendto

Changes
Upgrade or install of 11.2.0.2 grid infrastructure on >1 node cluster
Cause
2 causes found for this symptom. One cause is AIX-specific and the other cause is Unix-generic
1) AIX-specific cause
udp_sendspace is set as default 9216, it is smaller than 10240 bytes which is the size used by CRS.
#no -o udp_sendspace
will show the current setting
2) UNIX-generic cause
Netmask mismatch between the nodes. The private interface must have the same netmask on all nodes. Mismatch between netmask on different nodes can cause this symptom.
Solution
The two causes have two separate solutions.
1) Solution for AIX-specific cause
Increase udp_sendspace to >= 10240.
# no -o udp_sendspace=65536

Note that the 11gR2 documentation instructs to set udp_sendspace to 65536:
Network tuning parameter
Recommended value
ipqmaxlen
512
rfc1323
1
sb_max
4194304
tcp_recvspace
65536
tcp_sendspace
65536
udp_recvspace
655360
udp_sendspace
65536

See Oracle Grid Infrastructure Installation Guide
11g Release 2 (11.2) for IBM AIX on POWER Systems (64-Bit)
2.11.7 Configuring Network Tuning Parameters
http://download.oracle.com/docs/cd/E11882_01/install.112/e17210/preaix.htm#CWAIX219
for more details.
If problem happens during rootupgrade.sh (usually on 2nd node), please do:
1). Increase udp_sendspace to 65536:
# no -o udp_sendspace=65536

2). Stop CRS on both nodes:
# crsctl stop crs -f
# ps -ef |grep d.bin - to ensure there is no left over CRS process

3). Restart CRS on node 1:
# crsctl start crs
wait till CRS start on node 1.

4). On node 2, rerun rootupgrade.sh
# rootupgrade.sh

It should complete on node 2 this time.
Please note, if any platform, if udp_sendspace (or similar) setting is < 10240, this problem will occur.

2) Solution for Unix-generic cause
Check that netmask matches on private interface on all nodes.
[grid@mynode1 ~]$ ifconfig eth1
eth1 Link encap:Ethernet HWaddr 00:19:B9:1E:6D:97
inet addr:192.168.1.110 Bcast:192.168.1.255 Mask:255.255.255.0
...
[grid@mynode2 ~]$ ifconfig eth1
eth1 Link encap:Ethernet HWaddr 00:19:B9:1E:6D:97
inet addr:192.168.1.111 Bcast:192.168.1.255 Mask:255.255.255.0
...


In case of mismatch, customer sysadmin must correct the netmask on the private interface(s) where it's wrong.

and after long of journey finding the solution, i realize that all the parameter has change after rebooting the server, then i ask to the sysadmin to make persistent change to all the parameter is used by cluster.. :) below the detail parameter

Network tuning parameter
Recommended value
ipqmaxlen
512
rfc1323
1
sb_max
4194304
tcp_recvspace
65536
tcp_sendspace
65536
udp_recvspace
655360
udp_sendspace
65536

No comments:

Post a Comment