Patch Name: PHSS_27158 Patch Description: s700_800 11.X MC/ServiceGuard and SG-OPS Edition A.11.09 Creation Date: 03/01/28 Post Date: 03/02/04 Hardware Platforms - OS Releases: s700: 11.00 11.11 s800: 11.00 11.11 Products: MC/ServiceGuard A.11.09 ServiceGuard OPS Edition A.11.09 Filesets: DLM.CM-DLM,fr=A.11.09,fa=HP-UX_B.11.00_32/64,v=HP DLM.CM-DLM-CMDS,fr=A.11.09,fa=HP-UX_B.11.00_32/64,v=HP DLM-Pkg-Mgr.CM-PKG,fr=A.11.09,fa=HP-UX_B.11.00_32/64,v=HP Package-Manager.CM-PKG,fr=A.11.09,fa=HP-UX_B.11.00_32/64,v=HP DLM-Clust-Mon.CM-CORE,fr=A.11.09,fa=HP-UX_B.11.00_32/64,v=HP Cluster-Monitor.CM-CORE,fr=A.11.09,fa=HP-UX_B.11.00_32/64,v=HP DLM-NMAPI.CM-NMAPI,fr=A.11.09,fa=HP-UX_B.11.00_32/64,v=HP CM-Provider-MOF.CM-MOF,fr=A.11.09,fa=HP-UX_B.11.00_32/64,v=HP CM-Provider-MOF.CM-PROVIDER,fr=A.11.09,fa=HP-UX_B.11.00_32/64,v=HP ATS-CORE.ATS-RUN,fr=A.11.09,fa=HP-UX_B.11.00_32/64,v=HP Automatic Reboot?: No Status: General Release Critical: Yes PHSS_27158: PANIC ABORT When using the unsupported contributed cmsetsafety tool to disable and then re-enable safety time, the tool fails to restore safety time protection properly, resulting in node TOC. At cmcld start up, i.e. cmrunnode or cmruncl, syslog shows this message, "cmcld: Assertion failed: pnet != NULL, file: comm_link.c, line: 140." cmcld immediately aborts and dumps core. In rare conditions, a system clock does not get updated for an extended period of time and the whole ServiceGuard cluster fails. The node which has the system clock problem does a TOC first while the other nodes do a TOC shortly after that. The syslog on the node with the system clock problem stops logging anything. One of other nodes in the cluster will log the message below in the syslog at an interval equal to the node timeout. For example if the node timeout is 3 seconds then every 3 seconds following message will be seen in syslog (in addition to other messages): 10:10:03 Timed out node NODEA. It may have failed. 10:10:03 Attempting to adjust cluster membership ...... ...... 10:10:06 Timed out node NODEA. 10:10:06 Attempting to adjust cluster membership PHSS_26750: ABORT A series of single point network card or hub failures may cause a cl_sync timeout resulting in the entire cluster going down. The ServiceGuard daemon cmcld may abort if an EMS monitor on the system is not yet ready to monitor a resource on which a package is dependent. In 2 node ServiceGuard cluster, if cmcld on one node experiences long kernel hang and again tries to join the cluster then the whole cluster can crash. This can be seen on more than 2 nodes if cmcld on all the nodes except on one node experience long kernel hangs. cmcld aborts with the following message: "Aborting: cdb/cdb_xaction.c 803 (Failed to unlock cluster config lock)" PHSS_26338: ABORT PANIC When halting a package with a DEFERRED resource which has multiple RESOURCE_UP_VALUEs, the package halt will fail with a message in /var/adm/syslog/syslog.log: Unable to unregister resource Second critical issue fixed: cmcld abort and TOC when cmcld receives DL_UDERROR_IND error PHSS_25935: ABORT PANIC cmcld abort and TOC when halting packages using multiple resources with Resource_Start = "DEFERRED" If cmrunnode or cmapplyconf is aborted in the middle of execution and there are multiple such commands going at the same time in the cluster, then the node may fail with cmcld aborting. If a configuration operation gets aborted during a cluster reformation with a down node joining the cluster, the cluster may abort. PHSS_25499: ABORT SG cluster of more than 3 nodes with dual heartbeat can fail with core if one heartbeat network was disconnected for a while, then recovered and later second heartbeat network failed. PHSS_24850: OTHER This patch is critical since it contains a fix for the failure to release volume group problem, which, without this patch, would need a reboot to fix. PHSS_24536: ABORT PANIC SG nodes can TOC when the cluster next reforms after an online node addition when there have been more than 8 nodes involved in the configuration operation PHSS_24033: ABORT HANG stquerycl/cmruncl or cmrunnode command may abort with an assertion if receives incomplete or corrupted message. On SG-OPS clusters running cmgmsd, during high transaction times, cmcld can take up a significant amount of CPU times, causing machine to hang. PHSS_23511: ABORT OTHER If a multiple-node cluster only has 1-node running and has been running as a one node cluster for a long period of time, then when another node attempts to join the cluster, the existing node may fail with an assertion. Attempting to do cmhaltnode on a SG OPS node during a CDB transaction may cause the node to TOC. PHSS_22876: ABORT OTHER The ContinentalCluster command, cmrecovercl, can fail and produce a core when the remote cluster restarts or becomes visible again while cmrecovercl is running. MC/ServiceGuard nodes can TOC when the serial heartbeat data gets corrupted. This can occur during heavy system loads. PHSS_22540: OTHER The cluster node may TOC when a corrupted DLPI packet is received. PHSS_21996: OTHER During heavy network congestion, one of the cluster nodes can get stuck in a "reforming" state. Category Tags: defect_repair general_release critical panic halts_system Path Name: /hp-ux_patches/s700_800/11.X/PHSS_27158 Symptoms: PHSS_27158: 1. When using the unsupported contributed cmsetsafety tool to disable and then re-enable safety time, the tool fails to restore safety time protection properly, resulting in node TOC. 2. When cmcld is running with more than ten network interface cards configured on a cluster node, its CPU utilization percentage raises significantly. This problem is mostly exposed with Superdome machines, or systems with large VLAN configuration. 3. When a node with a node ID that is not the first or last node ID in the cluster is removed from a ServiceGuard OPS Cluster, the "cmviewcl -l group" command will return an error message like: cmviewcl : Failed to convert node_name xxx to node_id. 4. When a node failure causes packages to go down, cmsnmpd fails to send PkgDown traps and stores the package status as "unknown" instead of "down", in the ServiceGuard MIB table. 5. If cmrunnode or cmruncl times out, in a subsequent cluster formation a package configured with automatic start resources may fail to come up on its primary node. 6. The hpmcSGClusterDowntrap is never generated or sent and the hpmcClusterState mib variable is never set to down by the cmsnmpd subagent when the cluster is halted. 7. An SG command could hang shortly after a cluster formation. 8. At cmcld start up, i.e. cmrunnode or cmruncl, syslog shows this message, "cmcld: Assertion failed: pnet != NULL, file: comm_link.c, line: 140." cmcld immediately aborts and dumps core. 9. Under rare circumstances, if a node cannot update its system clock for an extended period of time, the node and one more node in the cluster will fail. If the cluster is not more than 3 nodes, the whole cluster will fail. If cluster is 4 node with cluster lock or of more than 4 nodes then the rest should reform a cluster. The node which experienced system clock problem does a TOC first while the other node does TOC shortly after that. The syslog on the node with the system clock problem may not log any information. One of other node in the cluster will log the message below in the syslog at an interval equal to the node timeout. For example if the node timeout is 3 seconds then every 3 seconds following message will be seen in syslog (in addition to other messages): 10:10:03 Timed out node NODEA. It may have failed. 10:10:03 Attempting to adjust cluster membership ...... ...... 10:10:06 Timed out node NODEA. 10:10:06 Attempting to adjust cluster membership ...... ...... 10:10:09 Timed out node NODEA. 10:10:09 Attempting to adjust cluster membership ...... ...... 10:10:12 Timed out node NODEA. 10:10:12 Attempting to adjust cluster membership 10. The cmsnmpd subagent doesn't update the subnet status MIB variables hpmcSGSubnetUp and hpmcSGSubnetDown when a package's subnet fails or comes up. 11. When a node is halted, cmsnmpd shows inaccurate hpmcNodeRole mib variables on the halted node. 12. If a port-scanning utility such as the Linux application "nmap" is executed against a node running ServiceGuard, cmcld on the node may hang and unexpectedly fail. PHSS_26750: 1. When a local LAN failover fails, no error messages about the failure are logged to syslog. 2. When SAM GUI switches environments, certain tasks are no longer available. 3. A series of single point network card or hub failures may cause a cl_sync timeout resulting in the entire cluster going down. syslog errors may look like this: - cmcld: Node id 3 did not reach sync step 0 for activity 3 within timeout. This activity appears to be hung at step -1 on that node, so node will be killed. - cmcld: Attempting to kill node - cmcld: Reason: This node did not reach sync step 0 for activity 3 within timeout All 3 nodes abort. 4. The ServiceGuard daemon cmcld may abort if an EMS monitor on the system is not yet ready to monitor a resource on which a package is dependent. The following messages may be seen in syslog: cmcld: Registering resource_up_value for Resource on Package . cmcld: Monitor is not ready for resource , package . Retrying the request. cmcld: Aborting: cl_rwlock.c 1030 (reader/writer lock not locked 5. cmapplyconf continually fails with Error: Unable to begin the configuration change 6. In 2 node ServiceGuard cluster, if cmcld on one node experiences long kernel hang and again tries to join the cluster then the whole cluster can crash. This can be seen on more than 2 nodes if cmcld on all the nodes except on one node experiences long kernel hang. The syslog on node which does not experienced kernel hang will log messages like: cmcld: Timed out node . It may have failed. cmcld: Attempting to form a new cluster cmcld: Safety time set for 128.96 seconds from now cmcld: Did not receive all votes: 1 out of 2 cmcld: All votes (100) are required at this point. vmunix: SCSI: Reset requested from above -- lbolt: 246237, bus: 2^M^M cmcld: Got at least 50 votes: 1 out of 2 last active nodes. cmcld: Obtaining Cluster Lock cmcld: Successfully issued request for cluster lock /dev/dsk/c2t8d0 vmunix: SCSI: Resetting SCSI -- lbolt: 246337, bus: 2^M^M vmunix: SCSI: Reset detected -- lbolt: 246337, bus: 2^M^M cmcld: Cluster lock disk /dev/dsk/c2t8d0 appears healthy cmcld: Successfully obtained the Cluster Lock cmcld: lock id: 6 cmcld: Turning off safety time protection since the cluster cmcld: may now consist of a single node. If ServiceGuard cmcld: fails, this node will not automatically halt cmcld: Active node has voted for me cmcld: Enabling safety time protection cmcld: Enabled safety time with 257774 cmcld: Attempting to adjust cluster membership cmcld: Safety time set for 7.71 seconds from now cmcld: Active node has voted for me cmcld: Clearing Cluster Lock 7. When a shutdown(1m) command is run from two nodes concurrently, it can cause cmhaltnode to fail. This can happen if one node has completed its cmhaltnode and the other node is still running cmhaltnode. This problem can also be seen if a cmhaltnode command is halting the cluster on one node and another node in the cluster does a TOC or a reboot before the cmhaltnode command completes. The /etc/rc.log.old will contain messages or command will exit with messages like: Warning: Do not modify or enable packages until the halt operation is completed. Halting Package cmhaltnode : Unable to halt package : Socket is not connected Check the syslog and pkg log files for more detailed information: cmhaltnode : Warning : node failed to HALT ERROR: Unable to halt cluster on this node. 8. If cmhaltnode command is issued on more than one node with specific time lag then command issued later can leave package running on that node even if it halts cluster on that node. 9. cmsnmpd will not store cluster name in the mib definition when started while cluster or local node are halted. A call to "resls /cluster/status" will result in output which is missing the cluster name. 10. cmcld aborts with the following message: "Aborting: cdb/cdb_xaction.c 803 (Failed to unlock cluster config lock)" 11.Large numbers of the following message are logged to the syslog.log file: Mar 18 10:00:48 HGALUX07 cmclconfd[15865]: Unable to attach to network interface 1. This happens whenever customers try to view properties of objects in SG MGR, or when cmquerycl, cmcheckconf, cmapplyconf are issued. 12.A series of short kernel hangs on one node lead to cluster reformation and continues during reformation. This opens a small timing window where the node that is healthy hits the assertion failure: cmcld: Assertion failed: !node->hb_eligible, file: election.c, line: 5699. 13.A 2 node SG cluster with node timeout of more than 8 seconds, may crash during reformation if one node does not respond to sychronization at the end of cluster reformation. In more than 2 node case, 2 nodes might fail due to this problem while remaining nodes can form a cluster or may crash depending upon the number of nodes remaining in the cluster and whether cluster lock is configured or not. The syslog on node which didn't respond for sychronization will contain messages like: cmcld: Halting to preserve data integrity cmcld: Reason: This node did not reach sync step 0 for activity 1 within timeout cmcld: Aborting! This node did not reach sync step 0 for activity 1 within timeout (file: utils.c, line: 155) During same time another node will log messages into syslog like: cmcld: Attempting to kill node cmcld: Reason: This node did not reach sync step 0 for activity 1 within timeout cmcld: Attempting to adjust cluster membership Approximately after 2 minutes of this, another node may also crash with messages in syslog like: cmcld: Halting to preserve data integrity cmcld: Reason: This node did not respond to a request from Node ID#1 - group id=4, port number=5301 cmcld: Aborting! This node did not respond to a request from Node ID#1 - group id=4, port number=5301 (file: utils.c, line: 155) PHSS_26338: 1. When halting a package with a DEFERRED resource which has multiple RESOURCE_UP_VALUEs, the package halt will fail with a message in /var/adm/syslog/syslog.log: Unable to unregister resource 2. cmcheckconf and cmapplyconf will incorrectly parse the cluster ascii file when a RESOURCE_UP_VALUE has multiple words but is not using quotes. Only the first word is used. For example: RESOURCE_UP_VALUE = X Y Only X is used. 3. For strings and enums, !=x and !=y value in RESOURCE_UP_VALUE is parsed cleanly but ServiceGuard monitors only the first criterion. 4. ServiceGuard daemon cmcld aborts with the message "DLPI error! dl_errno: 1, dl_unix_errno: 0." in syslog. This leads to a system TOC. PHSS_25935: 1. cmcld abort and TOC when halting packages using multiple resources with Resource_Start = "DEFERRED" 2. cmviewconf displays an incorrect HALT_SCRIPT_TIMEOUT value for a package when the RUN_SCRIPT_TIMEOUT is set to NO_TIMEOUT (0) and the HALT_SCRIPT_TIMEOUT is set to a non-zero value. 3. If cmrunnode or cmapplyconf are stopped in the middle of execution and there are multiple such commands running concurrently, then the cmcld may fail with a SIGSEGV or SIGBUS creating a core in /var/adm/cmcluster/core. The syslog will contain messages like: cmlvmd: Could not read messages from /usr/lbin/cmcld: Software caused connection abort cmlvmd: CLVMD exiting cmsrvassistd[]: The cluster daemon aborted our connection. cmsrvassistd[]: Lost connection with ServiceGuard cluster daemon (cmcld): Software caused connection abort The stack trace by GDB typically contains: #0 0x105d94 in cdb_client_port_close () from /usr/lbin/cmcld #1 0x1413a0 in cl_thread_start () from /usr/lbin/cmcld #2 0x1aa8e8 in cma__thread_base () from /usr/lbin/cmcld #3 0x1aca38 in cma__thread_start1()from /usr/lbin/cmcld #4 0x1ac4d4 in cma__thread_start0 () from /usr/lbin/cmcld #5 0x105f0c in cdb_client_port_close () from /usr/lbin/cmcld 4. If a configuration operation gets aborted during a cluster reformation with a down node joining the cluster, the cluster may abort with the following messages in syslog.log (and a core will be placed in /var/adm/cmcluster): Dec 05 21:46:45:0:CDB:1406: Action - Invalid transaction state of NO_TRANS for node id 2, (ABORTED) Dec 05 21:46:45:0:CDB:1406: Internal error - Aborting: cdb/cdb_coord_comm.c 517 (Invalid transaction state) 5. Primarily on SG OPS clusters, the cmrunnode command executed from the cmcluster rc script may fail. When this happens, other nodes in the cluster may log messages in syslog such as: cmcld: Detected different configuration data on node cmcld: Can not form cluster with node cmcld: Quitting due to configuration data version mismatch PHSS_25499: 1. The ServiceGuard command stviewcl can display a stape device status as "STATUS IS CHANGING" for any node. In addition, auto-reclaim of the device will not occur if that node fails and the cluster reforms. 2. cmquerycl/cmapplyconf/cmcheckconf report errors and cmclconfd logs errors in syslog for FC60 device files which access the inactive port of the device. The errors include strings like "Unable to open disk /dev/dsk/c7t2d0: Invalid argument". 3. ServiceGuard commands cmstartres and cmstopres fails for long resource name of more than 40 characters with error, "Resource name should not be longer than 1024 characters." 4. ServiceGuard cluster with serial link will fail if all heartbeat network switches fail. The similar thing with serial link can be seen if crossover cables are used for all heartbeats and one of the node fails. 5. ServiceGuard cluster of more than 2 nodes with dual heartbeat might fail if one heartbeat network was disconnected for a while ( more than tcp_ip_abort_interval), then recovered and later another heartbeat network failed. The cluster can reform and fail immediately with messages in syslog like, "Node id 3 did not reach sync step 0 for activity 3 within timeout. This activity appears to be hung at step -1 on that node, so node will be killed." "Attempting to kill node " "Reason: This node did not reach sync step 0 for activity 3 within timeout" PHSS_24850: 1. After an OPS instance is shutdown normally, it could cause cmgmsd to mistakenly kill some OPS process which is still running on the cluster. The problem only affects OPS 8.1.x and later. 2. "Failed to release volume group " error messages are printed in syslog by cmclconfd when cmcheckconf,cmapplyconf,cmgetconf or cmquerycl were issued. One possible side effect is that syslog may report on cluster start that cluster lock is not initialized, although early on, after cmapplyconf, it did report that cluster lock already got initialized. Another symptom is that subsequent tries to create or import a VG can fail. 3. After ServiceGuard daemon cmcld aborts, sometime system does TOCed and some other times a reboot is done, resulting into inconsistent and non predictable behavior. PHSS_24536: 1. When certain ServiceGuard events are generated or the cmsnmpd subagent is brought up when the local node or cluster is down, the snmpdm agent will log "CloneVarBind: Unable to clone vb->value.os_value" messages in the /var/adm/snmpd.log file. 2. After starting and stopping Oracle OPS edition, sockets are left connected to cmgmsd. The command "netstat -an | grep 5408" shows the TCP connections in FIN_WAIT_2 state. These are TCP connections between cmgmsd and OPS process via NMAPI library. These TCP connections remain until the node is halted. 3. cmrunnode fails to join a node to an existing cluster running OPS 8.0.x. The syslog on the node that fails to join the cluster shows the following error message: "Timeout occurred before receiving GMS connection request." 4. Some SG nodes TOC when cmlvmd dumps core during a cluster reformation. The problem is seen when the cluster next reforms after an online node addition when there have been more than 8 nodes involved in the configuration operation, either being added or removed. 5. The cmcluster start-up script fails with the following error in the /etc/rc.log: Start Highly Available cluster Output from "/sbin/rc3.d/S800cmcluster start": ---------------------------- vxvm:vxdisk: ERROR: IPC failure: Configuration daemon is not accessible AUTOSTART_CMCLD not set to 1 in /etc/rc.config.d/cmcluster, exiting "/sbin/rc3.d/S800cmcluster start" SKIPPED 6. When SG daemon cmcld, aborts sometimes system does TOC and some other times it does reboot. This results into inconsistent and unpredictable behavior of cluster. PHSS_24033: 1. On SG-OPS clusters running cmgmsd, during high transactions times, cmcld can use a large amount of CPU. On single CPUs systems, this can cause cmcld (a real time process) to take over the system, preventing other processes from running. 2. The vxdg import command within the SG package control script will resync the mirrors (resilver) in the foreground, causing the run script to be delayed until the resilvering has completed. For larger VxVM volumes this can take a large amount of time. 3. The service status reported by cmsnmpd once a package has been halted is reported as unknown rather than down. 4. The stquerycl command aborts with an assertion. In the stack trace of the core rexec_cmd_reply appears. This core dump may occur with commands other than stquerycl, such as cmrunnode or cmruncl. The core file will be in the directory where the command was executed. 5. If the /usr/lbin/cmlvmd daemon hangs on one node during a cluster reformation, the entire cluster could go down with the message, "Timed out waiting for replies". 6. When a package starts which depends on a deferred resource and a second cluster reformation happens, the package will fail to start and package log file will have the following error: cmstartres - Unable to complete command : Text file busy. 7. Applications that use the cmsnmpd subagent such as, the EMS Package monitor and Clusterview, will not report the correct node names where all ServiceGuard packages are running, after the SG coordinator node has been halted. 8. SGManager map will show an incorrect node status when connected to cmomd running on A.11.09 system. 9. cmrunnode command fails to start the node when the primary lan card configured for DNS is down,even though there are multiple heartbeat lans still up and the primary lan has a standby lan card which is still up. The command cmrunnode -v fails with: Unable to establish a connection to the configuration daemon (cmclconfd) on node : Connection timed out Successfully started /usr/lbin/cmcld on cmrunnode : Waiting for cluster to form............... ................................................ cmrunnode : Node unable to join Cluster. Check the syslog file on that node for information. 10.Clusterview does not display the correct status for packages when a package was configured with a subnet that was not available on all nodes in the cluster. 11.cmrunnode fails to join an existing cluster with the error: Unable to start cluster on nodes specified. There appears to be a configuration operation in progress. Attempting the operation again may succeed. This is primarily seen on SG-OPS clusters. PHSS_23511: 1. ServiceGuard does not handle non-standard device names when more than one device is associated with a single device. Attempting to use a non-standard disk device name for a cluster lock PV will fail if there is more than one device file associated with a single disk. For example, take VG vglock which contains disk /dev/dsk/c0lun0 (a non-standard name) and the original device file /dev/dsk/c0t0d0 still exists on the system. The cmapplyconf command will fail with: "Error: Unable to determine a unique identifier for physical volume /dev/dsk/c0lun0 on node ... This is an additional scenario not resolved with SG11.09 patch PHSS_21996. 2. Enable VxVM support in ServiceGuard and SG-OPS for non-shared activation. Before using SG or SG-OPS with VxVM please read the whitepaper entitled, "Integrating VERITAS Volume Manager (VxVM) with MC/ServiceGuard 11.09" located on http://docs.hp.com/hpux/ha. 3. After certain failures VxVM packages fail to start. 4. The cmcld logs the message, "timers delayed x.x seconds". Though this message can be due to a kernel latency issue outside the control of ServiceGuard, there are also circumstances which lead to this occurring without any kernel latency. In these cases, the cluster may reform with the same membership, or there may be no cluster reformation. Another possible symptom of this problem can occur on a very static 1-node cluster where there will be no heartbeat activity and no other activity like package failures for a long period of time. In this case, a new node attempting to join the cluster after this semi-dormant period of time may lead to the existing node failing with an assertion. The message would be, "Assertion failed: (tsb_tmp).tsb_low <= TICKS_PER_MAX_USEC, file: timers.c, line: 739" 5. SG commands that query the disk subsystem will generate the following messages on the console: Warning: The disk at /dev/dsk/c0t0d0 on node xyz does not have an ID, or a disk label. Warning: Disks which do not have IDs cannot be included in the topology description. Use pvcreate to give a disk an ID. and the following in syslog: Mar 7 09:50:12 hppinf09 cmclconfd[20049]: Unable to read disk /dev/dsk/c0t0d0: Error 0 In many cases these messages are absolutely correct however they will also be generated for VxVM initialized disk. 6. SG clusters configured with fibre-channel cluster lock disks may fail to obtain the lock before exhausting the allowed retry time. The SG cmcld daemon would log the following message in syslog: "Unable to obtain Cluster Lock. Operation timed out." 7. The SG cmclconfd daemon gives out the error message "Permission denied for user username on node nodename" in syslog after cold installed OEUR and rebooted. 8. On a SG OPS Edition cluster, the cmcld aborts with the following series of messages: "External error - Lost connection with a process participating in configuration changes (235,Socket is not connected)" "Event - Callback of type 7 failed." "Internal error - Aborting: cdb_db_server.c 2524 (Reconfig Prepare, Commit, or Rollback Callback failed)" This may happen when attempting to do a cmhaltnode on a node. In most cases this will result in a TOC of the node we were attempting to halt. 9. When OPS 8.0.x is started up immediately after shutdown, the startup fails with ORA-29701. Oracle trace file (ora_xxxx.trc) reports this following error from nmapi v1 lib. Mon Mar 12 01:15:47 2001 skgxnini: ERROR: Failed to initialize ServiceGuard(238,Connection timed out) Mon Mar 12 01:15:47 2001 skgxnini: RETURN: SKGXN_EUNK, FALSE PHSS_22876: 1. cmgmsd reports in syslog.log that a cdb transaction fails to commit with errno set to 22 (invalid argument). 2. ServiceGuard logs error message "Unable to open disk" in syslog.log when cmgetconf is issued on clusters with XP disk array. 3. ServiceGuard OPS can experience a failed cmgmsd transaction that may result in being unable to halt the node. This problem is normally triggered by an Oracle shutdown which then fails with ORA-600. This can happen when the transaction is committed as the cmgmsd client will call a routine that will attempt to write a temporary configuration file to all nodes in the cluster. 4. Continental Clusters command cmrecovercl can fail and a core file appears in /var/opt/cmom. This can happen when the remote cluster restarts or becomes visible again while cmrecovercl is running. 5. cmquerycl hangs and numerous "Collision with another configuration processes" messages shown in syslog.log 6. MC/ServiceGuard node TOC'ed when the serial heartbeat data gets corrupted. This can happen when the system gets very busy. 7. ServiceGuard returns "Non-uniform connections" error message when "cmquerycl -c clustername" is issued after an online node addition. This happens when nodes of a cluster have inconsistent lan configurations, i.e: different lan ids on the same bridged net or different number of lan cards on each node. PHSS_22683: 1. ContinentalClusters customers can corrupt their data if they run or enable a package (cmrunpkg or cmmodpkg -e, or the analogous SAM GUI operations) when a conflicting package is already running. For example, the customer may attempt to enable a recovery package when the primary package is already running on the remote cluster, or when the data receiver package is already running on the local cluster. cmrunpkg, cmmodpkg, and the SAM GUI do not check to determine if conflicting packages are running. 2. The MC/ServiceGuard command cmrunnode will hang while an upgrading node is trying to join an existing cluster. The command could result in freezing up the network and thus prohibit clients to connect to the system. This happens only on nodes undergoing an HP-UX migration from 11.0 to 11.11 and APA link aggregates were part of its cluster configuration. 3. The lock disk warning not issued in syslog on one node in the cluster. This happens on the node for which the cluster lock volume group is activated. 4. ServiceGuard commands such as cmrunnode could hang when a DLPI error occurs without actually reporting a DLPI error had occurred. ( In fact the node should TOC when a DLPI error occurs but it does not in some cases) 5. Cmsnmpd will return unchanged/incorrect cluster status despite multiple changes in the cluster state. This happens after halting and starting the cluster a substantial number of times. The error message, 'Error: retrieving node status: -7', will be reported in the cmsnmpd log, /var/adm/SGsnmpsuba.log. 6. On a non-coordinator node, cmsnmpd does not update the hpmcSGPkgStatus mib value when the package switching option is changed from enabled to disabled using "cmmodpkg -d pkgname". 7. On SG-OPS clusters running cmgmsd, during high transactions times, cmcld can use a large amount of CPU. On single CPUs systems, this can cause cmcld (a real time process) to take over the system, preventing other processes from running. PHSS_22540: 1. Directory permissions have been repaired and functionality restored. 2. MC/ServiceGuard node TOCd when corrupted DLPI packet is received. PHSS_21996: This patch is required if you are running on HP-UX 11.11. 1. Cmrunnode fails to join an existing cluster due to a configuration operation in progress. A message is printed in syslog on the running node: "cmcld: Detected different configuration data on node ". If a configuration operation is in progress (either cmapplyconf or cmgmsd transaction) and a cmrunnode occurs on another node, it may hit a window where the configuration version will get out of sync. In this case the joining node will not be allowed into the cluster but it will take some time for the cmcld to time out. This defect can also be seen by the SG-OPS customers. 2. During extremely high memory & I/O loads on a system, if a configuration transaction is begun (either by cmgmsd or cmapplyconf) the cmcld may abort with the message, "Aborting! This node did not reply to a CDB request message". This defect can also be seen by the SG-OPS customers. 3. Attempting to use a non-standard disk device name for a cluster lock PV will fail. For example, take VG vglock which contains disk /dev/dsk/c0lun0. Attempting to use /dev/dsk/c0lun0 as either the primary or secondary cluster lock PV will fail. The cmapplyconf command will fail with: "Error: Unable to determine a unique identifier for physical volume /dev/dsk/c0lun0 on node ..... When attempting to use a non-standard disk name....". 4. DLPI checksum messages can occur frequently and do not provide enough detail. 5. During extremely high memory & I/O load on a system while the cluster is reforming, a node may go down i.e. TOC. This is because SG daemon could not get a socket for communicating with other nodes. The socket call returns error "Resource temporarily unavailable." This message may or may not make it to /var/adm/syslog/syslog.log before the system goes down. 6. Config daemon was not ignoring the DLPI interface that has no connection 7. Sometimes if a configuration change is being committed (either by cmgmsd or cmapplyconf) and something prevents the cmclconfd from responding to cmcld (could be high CPU consumption) the cmcld may abort with the message, "Aborting! This node did not reply to a CDB request message". This defect can also be seen by the SG-OPS customers. 8. On SG OPS clusters where cmgmsd is generating a lot of transactions, syslog can be filled up with the following message: "This online configuration request %s has failed - " "another transaction is already in progress.\n" 9. During a cmrunnode, if another node in the cluster fails, there is a very small timing window which could prevent the cluster from reforming and all nodes will TOC with the message "Cluster Formation Failed" appearing in syslog. This can only occur on clusters of 3 or greater nodes. 10.ServiceGuard external logging daemon, cmlogd, exits too early when a node is halted. This prevents some log messages from being written to the syslog file. 11.The cmmodnet command fails to add or remove package IP address(es) with error, "cmmodnet : Unable to complete command : Interrupted system call" 12.The following message is sometimes printed during lan failover. It should not be printed at the default log level: "cmcld: DLPI ack error for primitive 52, errno 4, unix errno 16". 13.Cluster did not reform on reboot. 14.SG Daemon "cmcld" aborts when attempting to log an error causing node to TOC. 15.Node does not rejoin the cluster after reboot when AUTO_STARTCMD is set. The following messages are in /var/adm/syslog/syslog.log: cmcld: Timedout waiting for LVM daemon cmcld: Daemon exiting to preserve data integrity cmcld: Reason: LVM daemon did not start 16. When using ServiceGuard OPS Edition version A.11.09 with patch PHSS_21107 and OPS 7.3.3.4, new file /tmp/udlm_shmem_addr_file.txt is unnecessarily created. For HPUX, this file is meaningless, but is causing customers to wonder if it is a cause for concern. 17. A node may fail to automatically join the cluster if multiple nodes are rebooting at the same time. 18. Simultaneous configuration operations, started from two or more nodes within a cluster that share the same machine ID, can potentially cause one of the nodes to TOC with an assertion failure. This issue affects Superdome partitions and some L-Class systems that were manufactured with identical machine ID's. 19. During an online replacement operation of a LAN card, ServiceGuard may recover the replaced card incorrectly while its link is not fully UP and the driver is still resetting. 20. If a user inadvertently kills the cmlvmd process, the node will either reboot or TOC. 21. When system updated with HP-UX 11.11 Mission Critical Operating Environment (MCOE) media, the ServiceGuard binary configuration file convert utility may fail to execute successfully. (/var/adm/sw/swagent.log shows "Network is unreachable" error.) 22. In a two node cluster, due to heavy network congestion, it was possible for one of the nodes to get stuck in a "reforming" state. Once in this state, cmcld would periodically write the message: "Attempting to form a new cluster" in the syslog file and cmviewcl would always show the cluster state as "reforming". PHSS_21866: 1. A package takeover will fail if the mountpoint is busy, even if the FS_MOUNT_RETRY_COUNT is set to >0 in the package control script. 2. The cmhaltnode command repeatedly fails with the message: "Unable to halt cluster services at this time." This can occurred if a lan failover has occurred on the node being halted. 3. Second dual cluster lock not tried if first dual cluster lock fails, leading to a cluster TOC. This problem only impact customer using Dual Cluster Locks. 4. Virtual file descriptor leak. 5. Cmcld TOCs with a SIGSEGV leading to a cmcld core in /var/adm/cmcluster with a stack trace in dlpi_recv or ns_if_setgood(). The cmcld is not resistant against corrupted DLPI frames. 6. ServiceGuard cluster daemon(cmcld) aborts with error messages: "Interrupted system call. This may be caused by a very high system load." followed by "Aborting! Failed to send over DLPI" 7. DLPI errors returned from disconnected interface cards during configuration stage. PHSS_21425: 1. If cmcld experiences unexpected interrupted system calls, it could hit assertion core "Assertion failed: inp->state == CL_NODE_ADDING" 2. ServiceGuard config commands(cmcheckconf, cmapplyconf) consistently fail to configure a cluster with error "Non-uniform connections detected". The problem is more likely to occur on large cluster configurations or when the network traffic is very busy. 3. ServiceGuard daemon(cmcld) generates a core dump upon receiving a message from an unconfigured token ring interface whose DLSAP address size is larger then 20 bytes. 4. ServiceGuard command cmhaltcl fails to halt all nodes with the combination of a high value of HB_INTERVAL and NODE_TIMEOUT. PHSS_21107: 1. When using OPS 8.0.5 or 8.0.6 with ServiceGuard OPS Edition, in certain instances the ogms daemon can abort which will cause the node to TOC. 2. When using OPS 7.3.4 with ServiceGuard OPS Edition, users were unable to define the shared memory attach address for udlm clients. 3. When using OPS 7.3.4 with ServiceGuard OPS Edition in a 2-node cluster, shutting down one instance and its node can cause a hang at the "alter database open" when the node and instance are restarted. 4. If you have patch PHSS_20872 installed and you attempt to swremove the ServiceGuard or ServiceGuard OPS Edition product while the cluster is up on any of the nodes, the swremove will report that it failed, but some filesets will be removed. As a result, ServiceGuard commands on the node where the swremove was done will no longer work. However, the commands will still function on the other nodes. PHSS_20872: 1. If NODE_TIMEOUT exceeds 30 seconds, cmhaltcl command fails to halt the last node with an error message: cmhaltcl : Unable to connect to daemon 2. If an swremove of ServiceGuard is attempted while the cluster is up, the cluster is halted and the config file (cmclconfig) is nulled out. 3. Users were unable to define the shared memory attach address for udlm clients. 4. In 2-node clusters, shutting down one instance and its node can cause a hang at the "alter database open" when the node and instance are restarted. 5. When user specifies -T 6 they should get DEBUG level logging. The problem is that when the user does a cmquerycl -v -n sweetee -T 6 or cmcheckconf, cmapplyconf, they will not get logging at the DEBUG level. 6. Package control script will fail during startup of a package, if the mount point used by the package is busy. The package startup will fail. This has been the intended behavior. 7. Misleading error messages if service names has '/'s in it. Error: The cmapplyconf command will return the following message: Unable to apply the configuration change due to an invalid request. In addition cmcld TOC's all nodes in cluster when online apply with PKGNAME has '/' in it. If the cluster is offline, the cmapplyconf will succeeds. But the subsequent cmruncl will fail. 8. ServiceGuard commands fail to configure or re-configure the clusters if X.25 is installed on any nodes. 9. cmhaltcl showed the cluster was successfully halted, but cmviewcl found the cluster is still up and running. 10.Two node ServiceGuard cluster configured with dual cluster lock fails to reform when one of the nodes fails. Specifically, node A fails and node B is unable to reform a one node cluster because the cluster lock is found to be held by the failed node. PHSS_20230: This patch is required if you have: - MC/ServiceGuard A.11.09 and need to use the Continental Cluster product. - MC/ServiceGuard A.11.09 and experiencing any of the following symptoms. - ServiceGuard OPS Edition A.11.09 and experiencing any of the following symptoms. This patch fixes the following symptoms found in either MC/ServiceGuard and ServiceGuard OPS Edition A.11.09 (or earlier). 1. The entire ServiceGuard cluster TOCd when a local switch occurred in a heartbeat subnet. 2. Node TOCd during a configuration operation with either the message "Timed Out Waiting For Replies" or "A process participating in a configuration operation no longer exists". 3. The configuration commands, cmcheckconf and cmapplyconf failed with the message "Device not found on any node". 4. All nodes TOCd on an Oracle Parallel Server 8.1.5 cluster. 5. Misleading message was logged in the syslog.log for DVD-ROM device. 6. Recently released tape library robotic (autochanger) devices not recognized by MC/ServiceGuard shared tape facility. Defect Description: PHSS_27158: 1. When safety time is disabled, a timer is started to simulate safety time protection, but when safety time is re-enabled, the timer is not cancelled and eventually pops, leading to node TOC. Resolution: Change support tool to cancel the timer when enabling safety time. 2. Due to the support of online hotswap LAN cards and APA product, ServiceGuard's network manager inefficiently checks for change of MAC address of each LAN card on a regular basis. This check does consumes lots of CPU power, and the problem starts exposing when there are many LAN cards configured in the cluster node where cmcld is running. Resolution: Efficiently redesign the checking mechanism so that it will not take lots of system CPU power while keeping the supported features intact. 3. The Node_id is not changed after cluster is reconfigured. So when a node with a node ID that is not the first or last node ID in the cluster is removed from the Cluster, there will be a free slot in node_id list, and then cmviewcl will not be able to get the node name for the removed node_id. Resolution: Continue to check the next node_id instead of reporting this error. 4. When cmsnmpd tries to determine if a package status has changed after a cluster reformation and a new coordinator is assigned, the previous package status isn't known because only the coordinator node has a record of all of the cluster-wide package status related information. Resolution: Changed cmsnmpd and package manager to replicate package status information on all nodes in the cluster. This enables cmsnmpd to identify a package status change when a new coordinator is assigned and a package status changes from up to down after a node failure or cluster reformation. The package status change triggers the appropiate PkgDown trap to be sent with the corresponding node name. 5. Resource monitor requests are not unregistered with EMS when cmcld exits from a cmrunnode/cmruncl time-out, so the next time cmcld starts up and registers the same requests with EMS, it will not get immediate notifications regarding the state of the resources, and a package will not be able to start on that node. Resolution: Unregister resource monitor requests before cmcld exits from a cmrunnode/cmruncl time-out. 6. The hpmcSGClusterDown trap was never implemented because there was no reliable way for the SG subagent api to send a "cluster down" event because each node halts independently and once cmcld stops running, cmsnmpd can no longer retrieve the cluster status from the SG subagent api. Resolution: When cmsnmpd receives an event that the local node was halted or failed, then it will locally check the cluster status and set the hpmcClusterStatus mib variable to "down" and send a cluster down trap if the cluster is down. 7. During cluster formation, timers may be inadvertantly cancelled. Resolution: Correct the code that cancels timers during cluster reformation. 8. This problem happens when customer tries to modify the bridged net configuration. If the cluster has existing binary configuration, cmcheckconf/cmapplyconf are supposed to update the binary configuration accordingly. However, these commands fail to do so and only until cmruncl/cmrunnode do their own network probing does ServiceGuard realize the bridged net configuration has been changed. At this time, cmcld goes through the list of network cards it found, compares with what exists in the binary configuration generated by cmapplyconf but could not find a match, hence the assertion failure. Resolution: Made change so cmcheckconf/cmapplyconf update binary configuration correctly. 9. This problem happens due to the system clock not being updated for a prolonged duration. The ServiceGuard daemon cmcld relies on the system clock to create internal events like sending Heartbeat after each heartbeat interval, etc. But cmcld does respond to external events created by other nodes in the cluster. So if on a node the system clock stops working then cmcld on that node is running in an unstable manner. This unstable cmcld creates a problem for itself and one of the other nodes in the cluster which ultimately results in those 2 nodes failing. And if only 1 node remains, then that node will also TOC due to lack of quorum. Resolution: Changes are made to the cmcld daemon so that it can detect a system clock problem. This detection is driven by external events. If the system clock stops working for a time equal to 2 node timeouts then a warning will be logged into syslog until there have been 5 node timeouts. After that the node will kill itself so that other nodes in the cluster can form a new cluster, excluding the problematic node. If a node TOC should not happen after a short while then an increase in node timeout is required. For more details on this see Special Installation Instructions section. 10. The subnet down/up events did not trigger cmsnmpd to correctly update the package subnet status MIB variables hpmcSGSubnetUp and hpmcSGSubnetDown on all running nodes in the cluster. Resolution: cmsnmpd correctly updates the package subnet mib variable on all running nodes and sets the subnet status to "unknown" on all halted nodes. 11. The node halted event doesn't trigger cmsnmpd to set all hpmcNodeRole mibs to "unknown" on the halted node. Resolution: Set all hpmcNodeRole mibs to "unknown" when a node is halted. 12. This is actually a problem from CMA thread. When nmap is running and trying to connect with cmcld ports, cma_accept is revoked to receive data. This cma_accept in turn will call fstat on the file descriptor passed to it. There are times when fstat returns error numbers that represent transient problems and cma is supposed to handle these errnos appropriately. However, the problem is that cma_accept exits and the thread terminates abnormally. This leads to cmcld abort. Resolution: A fix is provided from the CMA team and since ServiceGuard is linked with libcma statically, it is necessary that customers experiencing this problem install this patch which already archive linked with libcma to get the fix. PHSS_26750: 1. Error messages describing local LAN failover failures are not logged to syslog in a production environment. Resolution: Make change such that the error messages are logged to syslog in a production environment. 2. The problem occurs due to SAM GUI code not properly going through necessary checks. Resolution: Fix has been implemented to properly transmit code checks. 3. Network connections (heartbeat and general service) are not restablished when the physical network is restored until cluster reformation time. Connections are not cleaned up fast enough when physical network goes down. This defect was originally root caused in JAG ad94082. A quick fix was put into PHSS_25499. That fix has been backed out. This is the complete fix for that problem. Resolution: Add the 'rcomm health monitor' to monitor health of connections. Restablish responding connections, disconnect non-responding connections. 4. When an EMS monitor returns RM_NOT_READY, cmcld retries registering the monitor request. Due to a bug in the retry code, a lock is released twice, leading to cmcld aborting. This was introduced in PHSS_26338. Resolution: Make change such that the lock is released only once in the retry code. 5. A cmapplyconf command is started, but goes away immediately. The proxy server does not know this because the proxy server did not check bind failure to the command's lcomm port. Proxy server believes the command is there, so it starts a transaction (acquires config lock) and waits for the transaction to start. Proxy will never know that the command is already gone. Subsequent applyconfs will fail since the config lock is held already. Resolution: Make sure the transaction is not started until after the bind has completed successfully. If the command goes away after the bind has completed the transaction will be cleaned up. 6. The problem happens when one node of a cluster hangs, causing a cluster reformation, and then returns immediately before the cluster reformation completes (late vote). If the cluster reformation is in the last phase when the hung node returns and votes, the coordinator must determine if it will accept the node back into the election. There is a small window during which this determination is done incorrectly. Resolution: The fix is to accurately determine whether the hung node should be accepted back into the election. This prevents the election from being restarted and both nodes TOCing by safety timer expiring. In some cases, the hung node will be allowed back in, and in other cases it will TOC. 7. The cmhaltnode command halts packages first. While halting the packages if other nodes in the cluster reboot or halt, the cluster communications for halting the package may get disconnected, resulting in an error, ENOTCONN. This error causes the cmhaltnode command to exit out without halting the cluster. Resolution: If an ENOTCONN error is generated before completing the cmhaltnode command, the command will now handle this and will retry to halt cluster services again, but this time the rebooted or halted node will not be used for the cluster communications for the package being halted. 8. The cmhaltnode command first disables packages from running on any node, then halts packages, enables packages so that they can start on adoptive node and then halts cmcld. So on one node cmhaltnode command might be in the middle of halting cmcld while packages might be starting on another node. During this time if cmhaltnode is issued on another node then that command does not treat starting packages correctly and directly halts cmcld on that node.The packages in the starting stage can then finish starting but there will be no cluster running on that node. Resolution: The cmhaltnode command will wait for packages that got enabled and are in starting state due another cmhaltnode command halting cmcld another node. 9. The fix submitted for JAG ad68565 in PHSS_24678 (SG11.13) and PHSS_24536 (SG11.09) to intialize all Emanate Cluster related variables to empty strings when the cluster or local node wasn't running caused resls to show the cluster name as an empty string if the local node is halted. Resolution: Change was made to initialize all status variables to empty string when cmsnmpd first starts, independent of whether the cluster or local node are up or down. 10. There is a very small window where an uninitialized 'status' variable in the cdb coordinator code results in old status being returned for the current transaction. Resolution: Initialize the status variable to 0 when beginning a transaction. 11.At network probing phase, ServiceGuard tries to bind to network interfaces of unsupported type. Resolution: Check for and skip lan cards of unsupported type. 12.SG design assumed once a node votes late and gets deferred, it's no longer heartbeating with coordinator. It turns out, although rarely, this does happen. Resolution: At election timeout, drop any node that's hb_eligible but did not send us vote. 13.At the end of cluster reformation, sychronization is done in order to make sure that all nodes in the cluster are at same level and have consistent information. This includes cmlvmd reconfiguration completion. If cmlvmd or any other activity was not able to sync for certain time (8 minutes) then node having trouble will be killed and cluster will be reformed. But sync activity does not get aborted and after 2 more minutes if cluster reformation is not completed then one more node gets killed. Resolution: The fix is added to abort sych activity if any node gets killed due to synchronization problem. This will allow cluster reformation to finish and start new synchronization. PHSS_26338: 1. ServiceGuard registers only one request with EMS for each resource of string or enumerated type but unregisters the request multiple times if there are multiple RESOURCE_UP_VALUE defined for that resource. Furthermore, inconsistencies between ServiceGuard's internal list of monitor requests and the requests that are actually registered with EMS may cause the same monitor request to be unregistered more than once or a non-existent request to be unregistered. Resolution: Strictly enforce synchronization of EMS's list of resource requests with SG's, through much improved locking. ServiceGuard now registers a request for each RESOURCE_UP_VALUE defined for a resource of string/enum type. 2. Parsing in config_ascii.c doesn't handle unquoted case in RESOURCE_UP_VALUE. Resolution: Make sure multiple words are enclosed within quotes. 3. Second criterion in strings and enums with !=x and !=y value for RESOURCE_UP_VALUE was not being saved and therefore not being monitored by ServiceGuard. Resolution: Save second criterion since parsing allows this combination. 4. This is actually a DLPI bug. The DLS provider somehow returns dl_errno as 1, which means bad address, for a temporary resource shortage. It should return dl_errno as 4 with unix_errno as ENOBUFS or ENOSR instead, so ServiceGuard could handle this transient problem accordingly. Resolution: A DLPI patch will be released to fix this problem. The workaround solution in Service Guard is to abort only if we receive the dl_errno 1 too frequently in a relatively long period of time, which indicates a permanent, serious problem. Otherwise, the problem is transient and will be ignored. PHSS_25935: 1. There is a coding error which causes an infinite loop, leading to a TOC when a package which has 2 or more deferred resources is halted. Resolution: Changed pm_unregister_pkg_resource to properly update link in while loop while handling multiple deferred resources. 2. cmviewconf checks the wrong variable when determining what value to display for HALT_SCRIPT_TIMEOUT. Resolution: Modify cmviewconf to check the correct variable when determining what value to display for HALT_SCRIPT_TIMEOUT. 3. Multiple commands create multiple transactions in the queue. When one of the commands is aborted, the corresponding transaction is also aborted. A lock is released and a pointer is moved to next transaction. As the lock is released another thread may come and delete the next transaction thinking that it has been aborted. Later when that deleted transaction is referenced, cmcld dumps core with SIGBUS or SIGSEGV. Resolution: The fix is to always go back to the first transaction when a transaction is aborted and destroy the transaction. Also make sure that no transaction pointers are held while the lock is released. Instead, re-lookup will be done to find the correct transaction. 4. There is an invalid assertion in the code that checks that all nodes are in a legal state corresponding to the reply message received from a node. It is asserted that a state of NO_TRANS is not legal when it is. Resolution: The fix was to change the code so that NO_TRANS is considered a legal state at this point. 5. The cmrunnode command collects cluster configuration information from all nodes and copies the latest one before starting cluster. But sometimes during system startup when all systems are starting, the cmrunnode command can fail to collect the cluster configuration information which can result into failure of cluster formation. Resolution: The fix is added to make sure that cmrunnode command collects correct cluster configuration version and if unable to do it then fails with error. The startup script will then retry command for 10 minutes and if not successful then will give up. PHSS_25499: 1. The stape reserve/release functionality is controlled by kernel tunable parameter, st_ats_enabled. Currently this parameter defaults to enabled on 11.0 and 11.i HP-UX systems and in near future this parameter will be defaulted to disabled. The ATS does not check the state of this tuneable before proceeding with shared tape operations and will generate errors if it is set to disabled. Resolution: Add mechanisms to ensure ATS checks the state of kernel tunable, st_ats_enabled and ensures that shared tape functionality is not available if kernel tuneable for stape reserved/release functionality is disabled. 2. When cmclconfd fails to open the inactive FC60 device files, SG assumes an error condition, cmclconfd logs an error message in syslog and cmquerycl/cmcheckconf/ cmapplyconf report the error to the user. Resolution: If SG cannot open a block device file, it will try to open the character file and get the device size. If the open succeeds and the size is 0, SG will ignore the device and not report an error. 3. To enforce the length less than 1024, the resource name length was incorrectly compared with different variable. Resolution: The fix is added for correct comparison of resource name length with MAX_PATH_LENGTH(1024) and package name length with MAX_NAME_LENGTH(40). 4. When all heartbeat networks goes down and serial device is configured then node who noticed these failures will delay himself before going for cluster lock. This will ensure that if other node is good then it can obtain the cluster lock and form a one node cluster. But if other node is down or also delayed then before this node can acquire cluster lock, safety time can expire and this node can TOCed. In case of heartbeat network switch failure both nodes can delay themselves and whole cluster can fail. Resolution: To enforce the delay, more time is spent in FC state. Very large number of FCs are sent if node notices that it has serial device and all heartbeat networks are down. This number of FCs are so huge that if other node is also delayed or failed then this node will do TOCed as it does not have enough time to form one node cluster. The recalculations are done for how many FCs to send so that the other node has enough time to get the lock and form one node cluster and if it does not do that then this node has enough time to form one node cluster. 5. If the heartbeat network is down for more than tcp_ip_abort_interval (default is 10 minutes) then ServiceGuard cleans heartbeat and other connections on that network. If this network comes back then ServiceGuard does not establish new connections until another heartbeat network fails. Under such condition the cluster reforms and some connections are not cleaned properly. Due to this the synchronization activity fails and results into cluster failure. Resolution: The connections would be cleaned properly during cluster reformation. PHSS_24850: 1. When an OPS instance is halted, cmgmsd cleans up the group membership information related to this instance. During this operation, cmgmsd closes the socket connection to this instance. In some situation, cmgmsd closes the wrong socket connection which belongs to another running OPS instance. When cmgsmd later discovers it cannot send information to that instance because of the broken connection, cmgmsd sends SIGKILL signal to the Oracle client process. Resolution: Make sure cmgmsd doesn't close the wrong socket connection when cmgmsd cleans up group members. 2. ServiceGuard config daemon cmclconfd passes an array which contain physical volume name to a LVM library function while trying to detach physical volume groups during the device query process of cmcheckconf/cmapplyconf. Later on, cmclconfd frees the memory allocated for the array but LVM library keeps using it. This leads to memory corruption and because LVM cannot locate corrupted data for the physical volume group, it cannot detach it, which leads to the failure of releasing volume group. The cluster lock disk not being able to initialize problem is a side effect of this. Resolution: Make change so that LVM library makes copy of the actual data, not of the pointer to physical volume name. 3. While aborting ServiceGuard daemon cmcld forks a child which calls reboot to ensure buffer is flushed to the disk. Meanwhile parent exit out, stopping safety time update. Thus the race between reboot and safety timeout starts and depending upon who wins either reboot is done or system does TOCed. As time taken to reboot and safety timeout determined by node timeout value varies and depends on lot of other factors,thus it is non deterministic who will win. Resolution : The child process of cmcld calling reboot will now call sync to flush buffer to disk and will exit out. This will ensure that all the time system does TOCed when cmcld aborts. PHSS_24536: 1. When the cmsnmpd agent is brought up or ServiceGuard events are generated when the SG daemon, cmcld, is down there's several cluster related variables that are left uninitialized. The SG daemon, cmcld, is down when the local node is halted or the cluster is down. These error messages appear in the snmpd.log file when the cmsnmpd subagent sends these uninitialized variables to the SNMP Master agent, snmpdm. Resolution: Initialize all cluster related variables when cmsnmpd is started. 2. cmgmsd did not issue socket close call caused this TCP resource leak. Resolution: Issue socket close call. 3. After cmcld starts up GMS from OPS 8.0.x, cmcld waits for a connection message from GMS before cmcld finishes joing the cluster. But cmcld fails to receive that message within the timeout so cmcld exits. It turns out that the nmapi1 lib linked in GMS sends the message to a remote cmcld instead of the local one because nmapi1 lib retrieves a wrong socket connection. Resolution: Establish socket connection using handle to local node instead of handle to cluster. 4. The cmlvmd core is caused by referencing an out-of-bound index into an array when cmlvmd processes a cluster reconfiguration message. The array can only hold 8 nodes of information so if more than 8 nodes are involved in an online node addition either being added or removed, corruption occurs. This can corrupt memory structures used during a cluster reformation resulting in a core dump at this time. Resolution: Increase the size of array. 5. This happens when the VERITAS volume manager is installed but has not been fully initialized. Resolution: During start up check for successful initialization of VERITAS volume manager before executing vxdisk command. 6. While aborting if core is required then cmcld forks another process to call the reboot so that core will be written and file system will be flushed. But if reboot takes more time and safety time is reached then system will do TOC. Also if core is not required system will TOC directly. Thus depending upon time taken by reboot, when safety time expires and if core is required the system can do either TOC or reboot. As how much time reboot will take is unpredictable, safety time depends on cluster configuration and core requirement depends on runtime problem encountered by cmcld, whether system will do TOC or reboot is totally unpredictable. Resolution: When the core is required cmcld will fork another process but instead of doing reboot this process will now call sync and then will exit. This will ensure that SG nodes will always do TOC. PHSS_24033: 1. The machine hang problem is caused by cmgmsd retrying begin_trans without any delay if begin_trans reports another cdb transaction is in progress. The excessive begin_trans request could in turn cause cmcld to use a large amount of cpu cycles to responsd to those requests. In a single-cpu L class machine, TOP reports cmcld could use 40% of the cpu time and cmgmsd could use another 20% of the cpu. Since both processes are running as high priority processes, regular processes may have a starvation problem. Resolution: Two changes were made. One change is to eliminate unnecessary begin_trans so that begin_trans failure is unlikely to happen. The other change is to slow down retrying begin_trans if begin_trans fails consistantly. 2. This is an enhancement from 11.13 that is being backported to 11.09. Resolution: In the package control script a new variable has been added, VXVOL. This variable contains the the vx command that starts the VxVM volumes. In the default case any resilvering that needs to be done will be done in the foreground. However if the user wants to to have the resilvering processed in the background and thus not have the package control script wait until resilvering is completed then select one of the alternate VXVOL settings. 3: The status is not updated when the service is halted. Resolution: The status is marked down before the callback is deleted. 4. This problem occurred because the rexec_cmd_reply() function did not check for a multi-cast send error and assumed that the error was in the ack. The ack pointer is null and so we hit an assertion when we try to use it. Resolution: Check for the send error. 5. This problem occurred because of 2 bugs. First, when a node tried to send a cl_kill message to another node, it did not release a mutex and we hit a deadlock situation. Secondly, the nodes that did not have the hung clvmd are waiting for a sync message from the coordinator who is waiting for the sync message from the hung node. When the coordinator does not respond to the remaining nodes after 10 minutes they will time out and send a cl_kill message to the coordinator even though the coordinator was not the one that was hung. Resolution: We now release the mutex before calling cl_kill(). Also, we will set a timer on the coordinator when the first node sends a sync message. If after 8 minutes we have not received a corresponding sync message from the other nodes, we will send a cl_kill message to any nodes that did not send us a sync message. So, if they are hung, they will be killed. 6. The problem occurs when there are two cluster reformations in close succession and a package which depends on defered resource needs to be started. Resolution: Modified cmstartres and cmstopres to retry when errno is ETXTBSY. 7. The cmsnmpd on the non-coordinator nodes assigns a NULL value for the hpmcSGPkgCurrNode mib variables, when the coordinator is halted, the SG API sends an event to cmsnmpd on the new coordinator indicating the current owner of all the packages. This NULL value doesn't get overwritten by cmsnmpd on the new coordinator node for all packages. Resolution: The fix involved updating the hpmcSGPkgCurrNode mib varible when an SG API event is recieved by cmsnmpd on the new coordinator node indicating the package is "up". 8. The configuration daemon memory leak caused query to retrieve an incorrect node status. Resolution: Fixed memory leak in the cmclconfd. 9. The cluster configuration commands use gethostbyname() to resolve the ip address for a nodename and tries to connect to the configuration daemon on that ip address. If the primary lan card on which DNS is configured is down, gethostbyname fails. This problem can also happen if the /etc/hosts file is configured and returns an ip address which is configured on lan card that is down. Subsequent attempts to connect to configuraion daemon will fail.After that cmcld starts with the existing configuration information and the node tries to join the cluster. Meanwhile if the configuration version number has been changed in running cluster, any such node attempting to join that cluster will be rejected, since they have a lower configuration version. Resolution: When commands discover that they can not connect to interested configuration daemons then instead of starting cmcld they will do probing on all subnets configured on a node, to find an alternate path to the nodes running in the cluster. If it finds an alternate path then that path will be used to get latest cluster configuration information and then cmcld will be started. 10.The cluster SNMP agent (cmsnmpd) incorrectly dealt with a return value from a status lookup on nodes on which the package subnet was not available. This caused cmsnmpd to stop retrieving status information from ServiceGuard. Resolution: The subagent was modified to deal with the case where a node does not contain a subnet used by packages configured for other nodes. 11.The coordinator is waiting for a request message from the proxy server which has already gone down. Resolution: When checking for request message from the proxy server, also check if the proxy server is still a part of the cluster. PHSS_23511: 1. The root cause of the problem was that we were not getting all the device entries for a single major minor number pair. This means that if two device files (say c1t1d0 and disk1) point to the same physical device, we would only find one of these during our scans. 2. Enhance the package control script to support VxVM disk groups. 3. VxVM disk group's hostid is not cleared. Therefore the disk group cannot be imported on another host. 4. This problem occurred because the wrong thread inside the cmcld intercepted the SIGALRM that was intended for the timer loop thread. 5. This problem occurs because SG only understands disks with LVM headers. Anything else is interpretted as an invalid disk and thus the messages to use pvcreate to give them an id. 6. This problem occurs because SG does not explicitly recognize fibre-channel lock disks, rather they are treated as a generic default and allowed a minimum amount of retry time for cluster lock operations. 7. If an SG command is issued on the local system or somewhere else in the network which requires invoking cmclconfd, cmclconfd will check the .rhosts file to see if the user has permission to execute it. After OEUR is installed and the system is rebooted, there is no .rhosts file in the system for cmclconfd to check. Therefore, this is correct behavior. However, the message should be less generic so customer will not panic. 8. The problem happens if there is a CDB transaction at the same time as halting the node. cmgmsd disconnects from the cmcld and the CDB transaction misinterprets this as an error condition. 9. The problem is caused by OPS calling skgxnini twice and and skgxnini doesn't realize it has already connected to ServiceGuard. The fix is to skip the step of connecting to ServiceGuard in skgxnini if it has already done so. PHSS_22876: 1. The underlying cdb client code in cmgmsd could not communicate with cmclconfd within the timeout (5s). Because of this, cdb marks the connection invalid. All subsequent cdb transaction would fail. In extreme high load system, since cmclconfd is running lower priority than cmcld and cmgmsd, the starving situation could occur. 2. Problem occurred since cmgetconf tried to open all entries in I/O trees that showed up under /dev/dsk. However, some of these entries are not disks, like in this case, disk controllers for XP disk array. 3. A networking problem caused the connection between cmgmsd and the remote cmclconfd process to break unexpectedly at the moment that we were committing a cmgmsd transaction as part of shutting down. As part of the commit, we first attempt to check all nodes to make sure there is enough disk space to proceed. We do this by sending the cmclconfig file to the remote cmclconfd processes. Because this is a rather large message, it seems to have a higher chance of encountering the network problem. If this happens, the commit will fail and if it is during shutting down of the OPS node, the halt will fail as well. It turns out that we don't need to do this check on all nodes and should only do it on the local node. Resolution: We detect that cmgmsd is the configuration client by the fact that it connects using a node handle and in that case we will only copy cmclconfig.tmp to the local node, thereby avoiding sending the large message over the network. 4. ContinentalClusters error messages exceed the size of the buffer to which the cmprovider logs such errors, and the way cmprovider logs allowed the buffer to overflow. This resulted in memory corruption and caused cmomd to abort with a segmentation violation. Resolution: Fix logging in the cmprovider to prevent buffer overflow. 5. The problem exists because cmclconfd closes fd 0 during startup. DLPI network probing module could open fd 0, but it uses fd 0 as an invalid fd, thus it never closes fd 0. This resulted in the perpetually bound cmclconfd to the DLPI port which eventually blocks all other network probing due to the collision while binding to this port. The resolution is to make sure cmclconfd opens fd 0,1,2 as /dev/null, thus DLPI will never be able to open fd 0 as it binds to DLPI port and a retry timeout is implemented to break out of the retry in the cmclconfd client in case deadlock happens from an older version of cmclconfd. 6. When the serial link experiences an over-run or under- run, the serial link code will attempt to find the next message. In doing so if it finds what looks like a valid header which has a correct header check sum, it attempts to compute check sum for entire message. Sometimes random bytes look like a valid header but message length is very large. This very large value causes cmcld to get a memory violation. 7. When adding nodes online, ServiceGuard assigns bridged net id for new nodes from scratch, without using existing bridged net ids. This leads to situations where existing nodes and newly added nodes have different views of cluster's bridged nets configuration. Once 'cmquerycl' is issued, SG checks to see if lan cards on the same bridged net could talk to each other. Since existing nodes and newly added node have different views on which lan card is of which bridged net, SG will check for connection between lan cards that are actually not on the same bridged net. This is where SG gives out the "Non-uniform connection" error messages. Resolution: Make changes so that SG will use existing bridged net ID, if there is any, to assign to lan cards of new node if they are on the same bridged net as lan cards on existing nodes. PHSS_22683: 1. MC/ServiceGuard is not aware of ContinentalClusters, so it cannot prevent users from starting the wrong package. Resolution: ContinentalClusters configurations will be checked when starting or enabling a package. 2. The MC/ServiceGuard command cmrunnode is not aware of the fact that the ppa values of existing APA link aggregates have been changed from "1xx" to "9xx" after system upgraded to HP-UX 11.11. This results in hanging the command indefinitely because it continues to use the old ppa values, which are stored in the cluster binary configuration file. Resolution: The binary conversion utility, which gets run during post-installation of the patch, has been modified to perform the appropriate LAN conversion prior to running any commands. 3. When the volume group is activated on a node the open for the cluster lock device (which is powered down or failed) succeeds. But the health check will show a problem later when querying the device. Resolution: Modified cluster initialization to start lock health check immediately after open of cluster lock disk. Also modified cmruncl warning to tell system administrator to check syslog on all nodes in cluster. 4. Some of the DLPI errors,especially the ones which are not caused by unix system errors(for eg: DL_ATTACH_REQ failing because of trying to attach an incorrect value of PPA to a stream), were not being reported as errors thus causing incorrect behavior of some commands. Resolution: Modified the error value being returned so that all DLPI errors are reported as errors irrespective of whether they are caused by unix system errors. 5. A file descriptor leak was detected that eventually utilized all of the systems available file descriptors. When cmsnmpd is no longer allowed to open any more file descriptors, it's unable to retrieve and/or report the current correct cluster status. Resolution: An extra file descriptor close call was added. 6. The package switching bit in the hpmcSGPkgStatus mib variable is never cleared when the Package Switching option is changed from enabled to disabled. Without the fix, cmsnmpd only updated this mib variable when the Package Switching option was changed from disabled to enabled. Resolution: This was corrected by adding code to clear the package switching bit in the hpmcSGPkgStatus mib variable when the Package Switching option is changed to disabled. 7. The machine hang problem is caused by cmgmsd retrying begin_trans without any delay if begin_trans reports another cdb transaction is in progress. The excessive begin_trans request could in term caused cmcld to take up a significant cpu cycles to response to those requests. In a single-cpu L class machine, TOP reports cmcld could 40% of the cpu time and cmgmsd could use another 20% of the cpu. Since both processes are running as higher-priority process than regular processes like telnetd, this generates a starvation problem. Resolution: The change is to slow down the rate of cmgmsd retrying begin_trans. PHSS_22540: 1. Directory permissions have been repaired throughout the product. 2. MC/ServiceGuard tries to recover and log messages including sender's node id when corrupted DLPI packet is received. MC/ServiceGuard node TOCd while logging if sender's node id is corrupted in DLPI packet. Resolution: Algorithm is added to check validity of sender's node id. PHSS_21996: 1. There was a window between the time that the cmrunnode command checked the configuration version and the time that the cmcld was started where another configuration operation (most likely from cmgmsd) could sneak in. When the cmcld started with the old configuration version and asked the coordinator to be let into the cluster, the coordinator would not let the joining node in. The joining node would keep trying until its autostart time- out was up. In the meantime, the cmrunnode command was kept waiting. Resolution: Use configuration transaction during startup. 2. As part of beginning a transaction when the cluster is online, each cmcld will send a message to their configd requesting that it lock a file on behalf of them. This was used to sync up with any cmrunnode commands that might be going on at the same time. Under normal use, the config lock request will return immediately with either success or failure depending on whether the lock was acquired. However, if the system is overloaded for an extended period of time with I/O or the open of the lock file hangs for some reason, a CDB message timer pops and the node that did not respond to the begin request will be killed. Resolution: Remove unneeded cmcld-to-cmclconfd request. 3. Internally SG will traverse the device tree and then build device names based on these device ids. SG would then check the cluster lock disk PV string to these device names it built. Since the names it built would not actually be real device names the check would fail. Resolution: Use the routine hpux library routine ftw() to build the device tree using the actual device names on the system. We then use these names to compare againest that is in the cluster ascii file. 4. Attempting to understand the cause - The messages encountered should reduce the possibilities. Resolution: Added hardware information to an existing message. Added another data verification to reduce the window of possible causes. 5. When ServiceGuard daemon 'cmcld' is attempting to establish a heart beat connection a socket call returns an error EAGAIN. This error causes cmcld to abort. Resolution: SG daemon cmcld will retry socket calls when following errors EAGAIN, ENOBUFS, and ENOSR are returned. 6. If a source interface is not physically connected, then the config daemon should ignore it. In dual ported and quad-ported hardware, often one or more ports will not be used. However config daemon was not ignoring this cases and was logging it as an error at a severity of level zero. Resolution : Increased the log level for this kind of thing to level three, so that this doesn't showup as an error with severity level zero. 7. During the commit phase of the configuration transaction the transaction thread was getting blocked waiting for a message back from the cmclconfd. Resolution: Spawn thread to manage communication with the cmclconfd. 8. The logging message has always been at level 0. We needed to find a way to get the collision message back to the commands before we could raise it. Resolution: Raise level of log message and change config clients to interpret busy error return. 9. During the quiescence period of the cluster formation for the new node joining, the new node is counted in the last number of active nodes. When the existing nodes dies, we go back to reform. We have to get quorum of the last number of active nodes. But we only accept votes from old members that were in the last incarnation. So, our last number of active nodes is too high. Resolution: Set a flag to indicate when the last number of active nodes includes nodes that aren't allowed to vote and interpret this flag when calculating quorum. 10.When a node is halted, cmcld shutdowns the cmlogd in order to complete the service gracefully. This is done with the attempt that any last minute messages will be sending directly to syslog. However, there are a few important networking log messages that have already been sent to cmlogd and waiting to be written to syslog. Therefore, upon exiting cmlogd service, these messages will never make to syslog at all. Resolution: The cmlogd daemon will not be shutdown when a node is halted. Instead, it will die on itself gracefully when it detects that cmcld has already exited. 11.The cmmodnet command fails because one of the socket ioctls, which are called to add or remove IP address, returns an EINTR(Interrupted system call) error code. Although this is a transient error, ServiceGuard doesn't retry the ioctl. Thus the command exits out with such error. Resolution: A retry mechanism has been implemented to allow the cmmodnet command to call such ioctls repeatedly when EINTR is returned. 12.Although lan failover works correctly, while it is occurring, there is a window where DLPI returns errors during driver reset. This produces error messages customers could see in syslog. Resolution: Log level was changed so error messages will not appear in syslog anymore. 13.The inetd daemon has not completed initialization when SG start script is executed. Therefore, inetd does not start cmclconfd which starts the cluster. Resolution: Modify script cmcluster.init to retry cmrunnode -v. 14.The function cl_strerror aborts when attempting to print a negative error number. Resolution: Modify cl_strerror to print negative errors as negative numbers. 15.The SG daemon aborts because cmlvmd does not start in time, but cmlvmd stays up. Resolution: Close timing window when starting lvm daemon. 16. The creation of this new file was introduced by a fix included in PHSS_21107. For HPUX, this file is meaningless and should not be created. Resolution: This file will no longer be created. 17. A prior fix to the cmrunnode command involved adding additional locking to the node startup process. This locking effectively prevents two simultaneous cmrunnode invocations from both succeeding. Resolution: A retry mechanism was added to the system startup script /sbin/init.d/cmcluster 18. When an online ServiceGuard node initiates a configuration operation, it first generates a transaction ID that will be used by all nodes to identify this operation. This transaction ID must be unique within the cluster. One component of the transaction ID, used to make it node specific, is the machine ID number as returned by uname(2). Within a Superdome cabinet, all partitions return the same machine ID and thus transaction ID's were not unique. Resolution: The algorithm used to generate transaction ID's has been changed such that it no longer relies on the on the machine ID to make it unique. 19. New PCI LAN drivers, which support online I/O operation, will reset the card's MIB statistics to zero during an online replacement. This behavior creates a logic problem for the ServiceGuard's network manager so that it could declare a faulty recovery of the replaced card when the link is not fully recovered. Resolution: The algorithm used to determine the recovery of a failed LAN card has been modified to support new driver behavior. 20. For a variety of reasons, system administrators will sometimes need to kill system processes. The cmlvmd process was not masking the signals commonly used for this purpose. Thus if an administrator mistakenly sent a kill signal to cmlvmd, it would terminate. This in turn would cause cmcld to either reboot or TOC the system. Resolution: Added additional signal handling to cmlvmd. 21. The HP-UX 11.11 MCOE ServiceGuard binary file conversion problem is due to the convert utility executing before the operating system is fully functional. Resolution: ServiceGuard configure script, which runs the binary file conversion utility, is rerun by swconfig when this patch is installed on the system. 22. There was a very small window in the cluster reformation protocol that allowed a one node cluster (with safety timer disabled) to attempt to begin to re-form a two node cluster before re-enabling the safety timer. If this was then followed by the right combination of future failures, the single node could get into a state where it was continually attempting cluster reformations. Resolution: The protocol has been modified such that the safety timer must be enabled prior to allowing cluster reformations to proceed. PHSS_21866: 1. The function in the package control script that frees up a busy mount point does not handle the input parameters passed to it correctly in all cases. The problem only occurs if: 1. there are mount options and 2. if the retry count is set to > 0 and 3. if the mount point is busy Resolution: Account for the case that the mount options (which are operated on as a single parameter in the freeup_busy_mountpoint_and_mount_fs() function) are passed correctly. 2. During the halt operation, the cmhaltnode command, locks the configuration database to prevent changes while nodes are halting. This locking is done by creating connections to the configuration daemons on remote nodes and sending messages requesting the lock to be held. During halts, the cluster manager (cmcld) will always restore a node to its original configuration. If a node had suffered a NIC failure on its primary network and the cluster manager had switched to a standby NIC then, during halt, the primary IP would be switched back to the failed NIC. This effectively causes a loss of network connectivity to the node. If cmhaltnode had been running on this node, it would lose connectivity to the configuration daemons and be unable to unlock the configuration database. This would result in the configuration database being locked for two hours (the default network keepalive timeout value). Resolution: The locking algorithm has been changed to only lock the local node if it is an active cluster member (this is sufficient to prevent configuration changes). 3. When dual cluster locks are configured, if the request to acquire the first cluster lock fails, the cluster membership election is restarted -- if the request failed because one or the other of the two lock disks is down, the election will be restarted continuously until safety time is exhausted, leading to a cluster TOC. Resolution: changed dual cluster lock such that 1) if first dual lock fails at lock request, move on and try second dual lock; 2) if first dual lock is granted and second dual lock fails at lock request, SG will assume cluster lock is granted and will move on with cluster membership protocol. 4. KEPD file descriptor (console) is not closed potentially may cause resource leak. Resolution: close the file descriptor 5. DLPI Buffer is corrupted. Resolution: Add checksum and size checks upon receipt. Drop corrupted buffers after logging problem. Add code to check the size of a message before sending. 6. ServiceGuard does not handle EINTR code properly when putmsg() or getmsg() returns the error. Resolution: Retry the system calls when EINTR error is returned. PHSS_21425: 1. The procedure that makes the connect() does not handle the EINTR code properly. Resolution: When the connect call returns EINTR, we will retry. 2. There isn't any automatic retry mechanism available in ServiceGuard's network probe area. Thus when several network interfaces couldn't get probe messages from other interfaces due to busy traffic, ServiceGuard config commands prematurely determine the non-uniform network connections between those interfaces. Resolution: A retry mechanism has been implemented so that missing probe messages will be re-transmitted until they are received by their peers or until maximum number of retries is reached. 3. A buffer designed to store the ASCII representation of the unconfigured DLSAP address was not allocated long enough. Thus causes the stack corruption problem. Resolution: The buffer size has been increased to hold the largest DLSAP address of any interface type. 4. The cmhaltcl fails to stick around long enough to halt the last node in the cluster. Resolution: To use the election timeout instead of the node timeout to determine how long to wait for the cluster to reform after halting each node during cmhaltcl. PHSS_21107: 1. Under certain conditions, the ogms daemon can abort which causes the node to TOC. A special version of ogms is available from Oracle that will not abort. Contact Oracle and make reference to bug# 1107981 to obtain this special version of ogms. 2. Users who so desired were unable to define the shared memory attach address for udlm clients are now able to do so with this change. Resolution: Users have now been provided with a method to specify the shared memory attach address. 3. In some cases, the instance and node that are still up, do not recognize the shutdown of the other node. The shadow process was waiting for the lamport clock status to become normal, but DLM thought that its status was already normal. Resolution: A bug in the distributed clock that improperly updated the clock value during DLM reconfiguration has been fixed. 4. The checkremove script that was released with the previous patch (PHSS_20872), does not work when it is used to remove the SG or SG OPS product. It only works for the removal of the patch, but not for an entire product (ServiceGuard or ServiceGuard OPS Edition). Resolution: the checkremove script has been removed from this patch. A new implementation will be added at the next product release. PHSS_20872: 1. With NODE_TIMEOUT >30 sec, cluster reformation takes longer than 30 sec during the halt operation. The command client code allowed the hard-coded 30 sec limit for a cluster reformation. Any time it took over 30 sec to reform, it failed with an above error message. Resolution: Replaced hard-coded 30 seconds limit with a variable proportionate to the NODE_TIMEOUT. 2. The swremove scripts do not verify the state of the cluster, and do a cmhaltcl -f followed by a cmdeleteconf -f. Resolution: Added a new script that verifies the cluster is down before proceeding with the swremove. 3. Users who so desired were unable to define the shared memory attach address for udlm clients are now able to do so with this change. Resolution: Users have been provided with a method to specify the shared memory attach address. 4. In some cases, the instance and node that are still up, do not recognize the shutdown of the other node. The shadow process was waiting for the lamport clock status to become normal, but DLM thought that its status was already normal. Resolution: A bug in the distributed clock that incorrectly updated the clock value during DLM reconfigurations has been fixed. 5. When the logging interface was changed in 11.09, the log_level parameter was no longer being passed into cf_find_config(). Resolution: Passed in log_level parameter. The fix has been merged to 11.09 and 11.12 branch. 6. The package control script tries to mount all the logical volumes specified in the control script in their specified mount points while starting a package. If any one of these mount points is busy, the mount operation in the control script will fail, the package will not startup. Enhancement: The package control script is enhanced to allow customer to specify a retry count to retry the mount operation if the mount fails. A new configuration variable, FS_MOUNT_RETRY_COUNT is added to the package control script, to optionally set the number of times the mount operation will be retried. During each retry, fuser -k will be used to remove any processes using the mount point. 7. We didn't check the package name and service name up front. Therefore when the name gets to be stored into the CDB, the '/' is causing CDB problems. Resolution: Added code to validate the cluster name, package name and service name up front. 8. Did not allocate enough buffer to store the MAC address of X.25 interfaces, which is 20-byte long. This causes the program stack to be inadvertently overwritten. Resolution: Increases the buffer size to efficiently store the MAC address of any interfaces including X.25. 9. Did not traverse the whole linked list which contained node information within the cluster if one linked list element was taken off the linked list. Resolution: Remember the next element before take off the current element from the linked list. 10.Previously, node which acquired cluster lock was unable to execute clear lock on first lock device due to continuously busy device status. Subsequent retries prevented it from going on to clear the second lock device. Resolution: If clear lock operation in dual lock configuration encounters lock device busy status, a finite number of retries is attempted, a "cluster lock facility" problem message is logged. PHSS_20230: 1. After a local switch, the ServiceGuard cluster repeatedly fails to reform due to continuous and unexpected broken TCP connections between the members and the co-ordinator node. The safety time eventually expires and thus causes the entire cluster to TOC. Resolution: Re-establish the connection, which has been disconnected unexpectedly, if the victim node is in certain states. 2. Did not handle a timing window where the client doing the configuration operation died after sending the abort of the transaction to the cmcld. Resolution: Added code to manage the loss of a client process better. 3. The inconsistency in the internal data structures caused this failure. Resolution: Store and lookup the device name consistently with the same data structure. 4. Parallel execution of the Oracle databases on the cluster nodes lead to a cmcld daemon core followed by the TOC on all the cluster nodes. Resolution: Changed the algorithm that generated the transaction id. In the old way, it uses "clock ticks since last reboot". It could generate the same id within the same clock ticks. The new way is to use a counter in the id. The counter is increased by 1 every time we generate a new id. 5. The code didn't differentiate between a regular disk and DVD-ROM disk. Resolution: The DVD-ROM will be skipped during device probing. 6. Tape library robotic (autochanger) not recognized by MC/ServiceGuard Advanced Tape Services stquerycl command. Resolution: Added C7200-8000 and A5617A identifiers to MC/ServiceGuard Advanced Tape Services identification file (/etc/cmcluster/sharedtape/ats_tapelibs). Enhancement: No SR: 8606106547 8606126610 8606106679 8606112176 8606129359 8606129766 8606141323 8606140372 8606143939 8606145227 8606140033 8606145229 8606139999 8606143005 8606145329 8606158792 8606158292 8606114259 8606155930 8606156893 8606156059 8606160283 8606109633 8606134787 8606159558 8606158175 8606157297 8606126975 8606156970 8606159559 8606159837 8606106319 8606165225 8606165889 8606166114 8606167337 8606167124 8606161913 8606168967 8606160805 8606167187 8606175902 8606167794 8606163578 8606156457 8606107097 8606108252 8606108396 8606110270 8606110445 8606112852 8606112902 8606113872 8606125691 8606126061 8606126065 8606130431 8606134738 8606141299 8606151921 8606155987 8606157100 8606157454 8606158116 8606158704 8606160279 8606165917 8606179400 8606185006 8606184170 8606165415 8606184140 8606184142 8606186540 8606168354 8606188123 8606193289 8606159837 8606189721 8606193167 8606196065 8606194643 8606189594 8606189595 8606194562 8606183649 8606181583 8606198063 8606197317 8606199378 8606207880 8606201766 8606209298 8606206329 8606220129 8606219681 8606172246 8606200990 8606158555 8606140550 8606178310 4701391482 8606217091 8606215621 8606184143 8606225203 8606223632 8606224994 8606228953 8606231688 8606230826 8606229591 8606232561 8606241825 8606245824 8606245827 8606234353 8606242498 8606232772 8606227696 8606244305 8606255606 8606257766 8606254986 8606247612 8606257381 8606251320 8606259570 8606249052 8606244410 8606258410 8606258296 8606259876 8606233259 8606254001 8606264328 8606270663 8606268205 8606264135 8606271637 8606258432 8606284468 8606284273 8606289077 8606280420 Patch Files: DLM.CM-DLM,fr=A.11.09,fa=HP-UX_B.11.00_32/64,v=HP: /opt/dlm/lbin/cmlkmgrd /opt/dlm/lib/libudlm.a /opt/dlm/sbin/dlmstat /usr/contrib/bin/dlmdump DLM.CM-DLM-CMDS,fr=A.11.09,fa=HP-UX_B.11.00_32/64,v=HP: /opt/dlm/sbin/dlmapplyconf /opt/dlm/sbin/dlmcheckconf /opt/dlm/sbin/dlmquery /usr/lib/libcmdlm.1 /usr/lib/libcmdlm.dlm.1 DLM-Pkg-Mgr.CM-PKG,fr=A.11.09,fa=HP-UX_B.11.00_32/64,v=HP: Package-Manager.CM-PKG,fr=A.11.09,fa=HP-UX_B.11.00_32/64, v=HP: /usr/lbin/cm/C/CMcoreadmin.ui /usr/lbin/cm/C/CMcoreconf.ui /usr/lbin/cm/C/CMpack.ou /usr/lbin/cm/C/CMpackadmin.ui /usr/lbin/cm/C/CMpackconf.ui /usr/sbin/cmhaltpkg /usr/sbin/cmhaltserv /usr/sbin/cmmakepkg /usr/sbin/cmmigrate /usr/sbin/cmmodnet /usr/sbin/cmmodpkg /usr/sbin/cmrunpkg /usr/sbin/cmrunserv /usr/sbin/cmstartres /usr/sbin/cmstopres DLM-Clust-Mon.CM-CORE,fr=A.11.09,fa=HP-UX_B.11.00_32/64, v=HP: Cluster-Monitor.CM-CORE,fr=A.11.09,fa=HP-UX_B.11.00_32/64, v=HP: /sbin/init.d/cmcluster /usr/contrib/bin/cmsetlog /usr/contrib/bin/cmsetsafety /usr/lbin/cm/C/CMcore.ou /usr/lbin/cmclconfd /usr/lbin/cmcld /usr/lbin/cmlogd /usr/lbin/cmlvmd /usr/lbin/cmsnmpd /usr/lbin/cmsrvassistd /usr/lbin/cmui /usr/lib/libcmcore.1 /usr/newconfig/usr/lib/libcmdlm.1 /usr/sbin/cmapplyconf /usr/sbin/cmcheckconf /usr/sbin/cmdeleteconf /usr/sbin/cmgetconf /usr/sbin/cmhaltcl /usr/sbin/cmhaltnode /usr/sbin/cmquerycl /usr/sbin/cmruncl /usr/sbin/cmrunnode /usr/sbin/cmviewcl /usr/sbin/cmviewconf /usr/sbin/convert DLM-NMAPI.CM-NMAPI,fr=A.11.09,fa=HP-UX_B.11.00_32/64,v=HP: /opt/nmapi/8.0/lib/libnmapi_32.a /opt/nmapi/8.0/lib/libnmapi_64.a /opt/nmapi/nmapi2/lib/libnmapi2.1 /opt/nmapi/nmapi2/lib/pa20_64/libnmapi2.1 /usr/lbin/cmgmsd CM-Provider-MOF.CM-MOF,fr=A.11.09,fa=HP-UX_B.11.00_32/64, v=HP: /opt/cmom/mof/CMcluster.mof /opt/cmom/mof/EMScore.mof /opt/cmom/mof/SGcluster.mof /opt/cmom/mof/SGpackage.mof CM-Provider-MOF.CM-PROVIDER,fr=A.11.09, fa=HP-UX_B.11.00_32/64,v=HP: /opt/cmom/providers/cmprovider.omp ATS-CORE.ATS-RUN,fr=A.11.09,fa=HP-UX_B.11.00_32/64,v=HP: /etc/cmcluster/sharedtape/ats_tapelibs /usr/sbin/stapplyconf /usr/sbin/stcheckconf /usr/sbin/stdeleteconf /usr/sbin/stgetconf /usr/sbin/stquerycl /usr/sbin/streclaim /usr/sbin/stsetlog /usr/sbin/stviewcl /usr/lbin/cmtaped what(1) Output: DLM.CM-DLM,fr=A.11.09,fa=HP-UX_B.11.00_32/64,v=HP: /opt/dlm/lbin/cmlkmgrd: HP92453-02A.10.20 HP-UX SYMBOLIC DEBUGGER (END.O) $R evision: 74.03 $ MC Lock Manager A.11.09 PHSS_22540 (lms.c) $Revision: /main/st_odlm_tyurek_update_version/0 $ /opt/dlm/lib/libudlm.a: MC Lock Manager A.11.09 PHSS_22540 (lms.c) $Revision: /main/st_odlm_tyurek_update_version/0 $ /opt/dlm/sbin/dlmstat: HP92453-02A.10.20 HP-UX SYMBOLIC DEBUGGER (END.O) $R evision: 74.03 $ MC Lock Manager A.11.09 PHSS_22540 (lms.c) $Revision: /main/st_odlm_tyurek_update_version/0 $ /usr/contrib/bin/dlmdump: HP92453-02A.10.20 HP-UX SYMBOLIC DEBUGGER (END.O) $R evision: 74.03 $ MC Lock Manager A.11.09 PHSS_22540 (lms.c) $Revision: /main/st_odlm_tyurek_update_version/0 $ DLM.CM-DLM-CMDS,fr=A.11.09,fa=HP-UX_B.11.00_32/64,v=HP: /opt/dlm/sbin/dlmapplyconf: None /opt/dlm/sbin/dlmcheckconf: None /opt/dlm/sbin/dlmquery: None /usr/lib/libcmdlm.1: A.11.09 Date: 01/15/2003; PATCH: PHSS_27158 ServiceGuard OPS Edition Product $Revision: 82.2 $ /usr/lib/libcmdlm.dlm.1: A.11.09 Date: 01/15/2003; PATCH: PHSS_27158 ServiceGuard OPS Edition Product $Revision: 82.2 $ DLM-Pkg-Mgr.CM-PKG,fr=A.11.09,fa=HP-UX_B.11.00_32/64,v=HP: /usr/lbin/cm/C/CMcoreadmin.ui: $Revision: 82.2 $ Local Site specific patch for EDS co Inland Revenue /usr/lbin/cm/C/CMcoreconf.ui: $Revision: 82.2 $ Local Site specific patch for EDS co Inland Revenue /usr/lbin/cm/C/CMpack.ou: RCS $Header: CMpack.ou,v 82.2 98/10/19 19:13:55 ssa Exp $ Local Site specific patch for EDS co Inland Revenue /usr/lbin/cm/C/CMpackadmin.ui: $Revision: 82.2 $ Local Site specific patch for EDS co Inland Revenue /usr/lbin/cm/C/CMpackconf.ui: $Revision: 82.2 $ Local Site specific patch for EDS co Inland Revenue /usr/sbin/cmhaltpkg: Build date: Wed Jan 15 14:46:18 PST 2003 Build id: ibld_sgops_a1109_patch A.11.09 Date: 01/15/2003; PATCH: PHSS_27158 Commands Command Cln Command Srv Config Command Utils Local Comm Util /usr/sbin/cmhaltserv: Build date: Wed Jan 15 14:46:18 PST 2003 Build id: ibld_sgops_a1109_patch A.11.09 Date: 01/15/2003; PATCH: PHSS_27158 Commands Command Cln Command Srv Config Command Utils Local Comm Util /usr/sbin/cmmakepkg: Build date: Wed Jan 15 14:46:18 PST 2003 Build id: ibld_sgops_a1109_patch A.11.09 Date: 01/15/2003; PATCH: PHSS_27158 Commands Command Cln Command Srv Config Command Utils Local Comm Util /usr/sbin/cmmigrate: Build date: Wed Jan 15 14:46:18 PST 2003 Build id: ibld_sgops_a1109_patch A.11.09 Date: 01/15/2003; PATCH: PHSS_27158 Commands Command Cln Command Srv Config Command Utils Local Comm Util /usr/sbin/cmmodnet: Build date: Wed Jan 15 14:46:18 PST 2003 Build id: ibld_sgops_a1109_patch A.11.09 Date: 01/15/2003; PATCH: PHSS_27158 Commands Command Cln Command Srv Config Command Utils Local Comm Util /usr/sbin/cmmodpkg: Build date: Wed Jan 15 14:46:18 PST 2003 Build id: ibld_sgops_a1109_patch A.11.09 Date: 01/15/2003; PATCH: PHSS_27158 Commands Command Cln Command Srv Config Command Utils Local Comm Util /usr/sbin/cmrunpkg: Build date: Wed Jan 15 14:46:18 PST 2003 Build id: ibld_sgops_a1109_patch A.11.09 Date: 01/15/2003; PATCH: PHSS_27158 Commands Command Cln Command Srv Config Command Utils Local Comm Util /usr/sbin/cmrunserv: Build date: Wed Jan 15 14:46:18 PST 2003 Build id: ibld_sgops_a1109_patch A.11.09 Date: 01/15/2003; PATCH: PHSS_27158 Commands Command Cln Command Srv Config Command Utils Local Comm Util /usr/sbin/cmstartres: Build date: Wed Jan 15 14:46:18 PST 2003 Build id: ibld_sgops_a1109_patch A.11.09 Date: 01/15/2003; PATCH: PHSS_27158 Commands Command Cln Command Srv Config Command Utils Local Comm Util /usr/sbin/cmstopres: Build date: Wed Jan 15 14:46:18 PST 2003 Build id: ibld_sgops_a1109_patch A.11.09 Date: 01/15/2003; PATCH: PHSS_27158 Commands Command Cln Command Srv Config Command Utils Local Comm Util DLM-Clust-Mon.CM-CORE,fr=A.11.09,fa=HP-UX_B.11.00_32/64, v=HP: /sbin/init.d/cmcluster: $Revision: 82.2 $ /usr/contrib/bin/cmsetlog: Build date: Wed Jan 15 14:46:18 PST 2003 Build id: ibld_sgops_a1109_patch A.11.09 Date: 01/15/2003; PATCH: PHSS_27158 Commands Command Cln Command Srv Config Local Comm Util /usr/contrib/bin/cmsetsafety: Build date: Wed Jan 15 14:46:18 PST 2003 Build id: ibld_sgops_a1109_patch A.11.09 Date: 01/15/2003; PATCH: PHSS_27158 Commands Command Cln Command Srv Config Local Comm Util /usr/lbin/cm/C/CMcore.ou: RCS $Header: CMcore.ou,v 82.2 98/10/19 19:13:56 ssa Exp $ Local Site specific patch for EDS co Inland Revenue /usr/lbin/cmclconfd: HP92453-02A.10.20 HP-UX SYMBOLIC DEBUGGER (END.O) $R evision: 74.03 $ Build date: Wed Jan 15 14:44:18 PST 2003 Build id: ibld_sgops_a1109_patch A.11.09 Date: 01/15/2003; PATCH: PHSS_27158 Config Daemon Config Command Cln Command Srv Local Comm Util Config DB /usr/lbin/cmcld: HP92453-02A.10.20 HP-UX SYMBOLIC DEBUGGER (END.O) $R evision: 74.03 $ Build date: Wed Jan 15 14:45:45 PST 2003 Build id: ibld_sgops_a1109_patch A.11.09 Date: 01/15/2003; PATCH: PHSS_27158 Daemon Config DB Cluster Monitor Command Srv CommunicationSrv Config Dlm Local Comm Network Sensor Package Manager Remote Comm API Service Sensor Cluster LVM Status DB Sync Util /usr/lbin/cmlogd: Build date: Wed Jan 15 14:45:45 PST 2003 Build id: ibld_sgops_a1109_patch A.11.09 Date: 01/15/2003; PATCH: PHSS_27158 Log Daemon Local Comm Util /usr/lbin/cmlvmd: Build date: Wed Jan 15 14:42:56 PST 2003 Build id: ibld_sgops_a1109_patch A.11.09 Date: 01/15/2003; PATCH: PHSS_27158 Cluster LVM Local Comm Util /usr/lbin/cmsnmpd: Build date: Wed Jan 15 14:47:55 PST 2003 Build id: ibld_sgops_a1109_patch A.11.09 Date: 01/15/2003; PATCH: PHSS_27158 API Copyright 1992-1996 SNMP Research, Incorporated SNMP Research Distribution version 14.0.0.0 Copyright 1992-1996 SNMP Research, Incorporated SNMP Research Distribution version 14.0.0.0 Copyright 1992-1996 SNMP Research, Incorporated SNMP Research Distribution version 14.0.0.0 Copyright 1992-1996 SNMP Research, Incorporated SNMP Research Distribution version 14.0.0.0 Copyright 1992-1996 SNMP Research, Incorporated SNMP Research Distribution version 14.0.0.0 Copyright 1992-1996 SNMP Research, Incorporated SNMP Research Distribution version 14.0.0.0 Copyright 1992-1996 SNMP Research, Incorporated SNMP Research Distribution version 14.0.0.0 SNMPSUBAGENT Copyright 1992-1996 SNMP Research, Incorporated SNMP Research Distribution version 14.0.0.0 Copyright 1992-1996 SNMP Research, Incorporated SNMP Research Distribution version 14.0.0.0 Copyright 1992-1996 SNMP Research, Incorporated SNMP Research Distribution version 14.0.0.0 Copyright 1992-1996 SNMP Research, Incorporated SNMP Research Distribution version 14.0.0.0 Copyright 1992-1996 SNMP Research, Incorporated SNMP Research Distribution version 14.0.0.0 Copyright 1992-1996 SNMP Research, Incorporated SNMP Research Distribution version 14.0.0.0 Copyright 1992-1996 SNMP Research, Incorporated SNMP Research Distribution version 14.0.0.0 Copyright 1992-1996 SNMP Research, Incorporated SNMP Research Distribution version 14.0.0.0 Copyright 1992-1996 SNMP Research, Incorporated SNMP Research Distribution version 14.0.0.0 Copyright 1992-1996 SNMP Research, Incorporated SNMP Research Distribution version 14.0.0.0 Copyright 1992-1996 SNMP Research, Incorporated SNMP Research Distribution version 14.0.0.0 Copyright 1992-1996 SNMP Research, Incorporated SNMP Research Distribution version 14.0.0.0 Copyright 1992-1996 SNMP Research, Incorporated SNMP Research Distribution version 14.0.0.0 Copyright 1992-1996 SNMP Research, Incorporated SNMP Research Distribution version 14.0.0.0 Copyright 1992-1996 SNMP Research, Incorporated SNMP Research Distribution version 14.0.0.0 Copyright 1992-1996 SNMP Research, Incorporated SNMP Research Distribution version 14.0.0.0 Copyright 1992-1996 SNMP Research, Incorporated SNMP Research Distribution version 14.0.0.0 Copyright 1992-1996 SNMP Research, Incorporated SNMP Research Distribution version 14.0.0.0 Copyright 1992-1996 SNMP Research, Incorporated SNMP Research Distribution version 14.0.0.0 Copyright 1992-1996 SNMP Research, Incorporated SNMP Research Distribution version 14.0.0.0 Copyright 1992-1996 SNMP Research, Incorporated SNMP Research Distribution version 14.0.0.0 Copyright 1992-1996 SNMP Research, Incorporated SNMP Research Distribution version 14.0.0.0 Copyright 1992-1996 SNMP Research, Incorporated SNMP Research Distribution version 14.0.0.0 Copyright 1992-1996 SNMP Research, Incorporated SNMP Research Distribution version 14.0.0.0 Copyright 1992-1996 SNMP Research, Incorporated SNMP Research Distribution version 14.0.0.0 Copyright 1992-1996 SNMP Research, Incorporated SNMP Research Distribution version 14.0.0.0 Copyright 1992-1996 SNMP Research, Incorporated SNMP Research Distribution version 14.0.0.0 Copyright 1992-1996 SNMP Research, Incorporated SNMP Research Distribution version 14.0.0.0 Copyright 1992-1996 SNMP Research, Incorporated SNMP Research Distribution version 14.0.0.0 Copyright 1992-1996 SNMP Research, Incorporated SNMP Research Distribution version 14.0.0.0 Copyright 1992-1996 SNMP Research, Incorporated SNMP Research Distribution version 14.0.0.0 Copyright 1992-1996 SNMP Research, Incorporated SNMP Research Distribution version 14.0.0.0 Copyright 1992-1996 SNMP Research, Incorporated SNMP Research Distribution version 14.0.0.0 Copyright 1992-1996 SNMP Research, Incorporated SNMP Research Distribution version 14.0.0.0 Copyright 1992-1996 SNMP Research, Incorporated SNMP Research Distribution version 14.0.0.0 Copyright 1992-1996 SNMP Research, Incorporated SNMP Research Distribution version 14.0.0.0 Copyright 1992-1996 SNMP Research, Incorporated SNMP Research Distribution version 14.0.0.0 Copyright 1992-1996 SNMP Research, Incorporated SNMP Research Distribution version 14.0.0.0 Copyright 1992-1996 SNMP Research, Incorporated SNMP Research Distribution version 14.0.0.0 Copyright 1992-1996 SNMP Research, Incorporated SNMP Research Distribution version 14.0.0.0 Copyright 1992-1996 SNMP Research, Incorporated SNMP Research Distribution version 14.0.0.0 Copyright 1992-1996 SNMP Research, Incorporated SNMP Research Distribution version 14.0.0.0 Copyright 1992-1996 SNMP Research, Incorporated SNMP Research Distribution version 14.0.0.0 Copyright 1992-1996 SNMP Research, Incorporated SNMP Research Distribution version 14.0.0.0 Copyright 1992-1996 SNMP Research, Incorporated SNMP Research Distribution version 14.0.0.0 Copyright 1992-1996 SNMP Research, Incorporated SNMP Research Distribution version 14.0.0.0 Copyright 1992-1996 SNMP Research, Incorporated SNMP Research Distribution version 14.0.0.0 Copyright 1992-1996 SNMP Research, Incorporated SNMP Research Distribution version 14.0.0.0 Copyright 1992-1996 SNMP Research, Incorporated SNMP Research Distribution version 14.0.0.0 Copyright 1992-1996 SNMP Research, Incorporated SNMP Research Distribution version 14.0.0.0 Copyright 1992-1996 SNMP Research, Incorporated SNMP Research Distribution version 14.0.0.0 /usr/lbin/cmsrvassistd: HP92453-02A.10.20 HP-UX SYMBOLIC DEBUGGER (END.O) $R evision: 74.03 $ Build date: Wed Jan 15 14:42:23 PST 2003 Build id: ibld_sgops_a1109_patch /usr/lbin/cmui: HP92453-02A.10.20 HP-UX SYMBOLIC DEBUGGER (END.O) $R evision: 74.03 $ A.11.09 Date: 01/15/2003; PATCH: PHSS_27158 GUI Config Command Cln Command Utils Local Comm Util /usr/lib/libcmcore.1: A.11.09 Date: 01/15/2003; PATCH: PHSS_27158 Cluster Monitor Product $Revision: 82.2 $ /usr/newconfig/usr/lib/libcmdlm.1: A.11.09 Date: 01/15/2003; PATCH: PHSS_27158 Cluster Monitor Product Only $Revision: 82.2 $ /usr/sbin/cmapplyconf: Build date: Wed Jan 15 14:46:18 PST 2003 Build id: ibld_sgops_a1109_patch A.11.09 Date: 01/15/2003; PATCH: PHSS_27158 Commands Command Cln Command Srv Config Command Utils Local Comm Util /usr/sbin/cmcheckconf: Build date: Wed Jan 15 14:46:18 PST 2003 Build id: ibld_sgops_a1109_patch A.11.09 Date: 01/15/2003; PATCH: PHSS_27158 Commands Command Cln Command Srv Config Command Utils Local Comm Util /usr/sbin/cmdeleteconf: Build date: Wed Jan 15 14:46:18 PST 2003 Build id: ibld_sgops_a1109_patch A.11.09 Date: 01/15/2003; PATCH: PHSS_27158 Commands Command Cln Command Srv Config Command Utils Local Comm Util /usr/sbin/cmgetconf: Build date: Wed Jan 15 14:46:18 PST 2003 Build id: ibld_sgops_a1109_patch A.11.09 Date: 01/15/2003; PATCH: PHSS_27158 Commands Command Cln Command Srv Config Command Utils Local Comm Util /usr/sbin/cmhaltcl: Build date: Wed Jan 15 14:46:18 PST 2003 Build id: ibld_sgops_a1109_patch A.11.09 Date: 01/15/2003; PATCH: PHSS_27158 Commands Command Cln Command Srv Config Command Utils Local Comm Util /usr/sbin/cmhaltnode: Build date: Wed Jan 15 14:46:18 PST 2003 Build id: ibld_sgops_a1109_patch A.11.09 Date: 01/15/2003; PATCH: PHSS_27158 Commands Command Cln Command Srv Config Command Utils Local Comm Util /usr/sbin/cmquerycl: Build date: Wed Jan 15 14:46:18 PST 2003 Build id: ibld_sgops_a1109_patch A.11.09 Date: 01/15/2003; PATCH: PHSS_27158 Commands Command Cln Command Srv Config Command Utils Local Comm Util /usr/sbin/cmruncl: Build date: Wed Jan 15 14:46:18 PST 2003 Build id: ibld_sgops_a1109_patch A.11.09 Date: 01/15/2003; PATCH: PHSS_27158 Commands Command Cln Command Srv Config Command Utils Local Comm Util /usr/sbin/cmrunnode: Build date: Wed Jan 15 14:46:18 PST 2003 Build id: ibld_sgops_a1109_patch A.11.09 Date: 01/15/2003; PATCH: PHSS_27158 Commands Command Cln Command Srv Config Command Utils Local Comm Util /usr/sbin/cmviewcl: Build date: Wed Jan 15 14:46:18 PST 2003 Build id: ibld_sgops_a1109_patch A.11.09 Date: 01/15/2003; PATCH: PHSS_27158 Commands Command Cln Command Srv Config Command Utils Local Comm Util /usr/sbin/cmviewconf: Build date: Wed Jan 15 14:48:04 PST 2003 Build id: ibld_sgops_a1109_patch A.11.09 Date: 01/15/2003; PATCH: PHSS_27158 Config Command Cln Command Srv Command Utils Local Comm Util Tools /usr/sbin/convert: Build date: Wed Jan 15 14:48:04 PST 2003 Build id: ibld_sgops_a1109_patch A.11.09 Date: 01/15/2003; PATCH: PHSS_27158 Config Command Cln Command Srv Command Utils Local Comm Util Tools DLM-NMAPI.CM-NMAPI,fr=A.11.09,fa=HP-UX_B.11.00_32/64,v=HP: /opt/nmapi/8.0/lib/libnmapi_32.a: A.11.09 Date: 01/15/2003; PATCH: PHSS_27158 A.11.09 Date: 01/15/2003; PATCH: PHSS_27158 Local Comm Config API Util /opt/nmapi/8.0/lib/libnmapi_64.a: A.11.09 Date: 01/15/2003; PATCH: PHSS_27158 A.11.09 Date: 01/15/2003; PATCH: PHSS_27158 Local Comm Config API Util /opt/nmapi/nmapi2/lib/libnmapi2.1: A.11.09 Date: 01/15/2003; PATCH: PHSS_27158 NMAPI2 32 GMAPI 32 Build date: Wed Jan 15 14:50:03 PST 2003 Build id: ibld_sgops_a1109_patch /opt/nmapi/nmapi2/lib/pa20_64/libnmapi2.1: Build date: Wed Jan 15 14:52:50 PST 2003 Build id: ibld_sgops_a1109_patch A.11.09 Date: 01/15/2003; PATCH: PHSS_27158 NMAPI2 64 GMAPI 64 /usr/lbin/cmgmsd: HP92453-02A.10.20 HP-UX SYMBOLIC DEBUGGER (END.O) $R evision: 74.03 $ Build date: Wed Jan 15 14:49:38 PST 2003 Build id: ibld_sgops_a1109_patch A.11.09 Date: 01/15/2003; PATCH: PHSS_27158 A.11.09 Date: 01/15/2003; PATCH: PHSS_27158 CM-Provider-MOF.CM-MOF,fr=A.11.09,fa=HP-UX_B.11.00_32/64, v=HP: /opt/cmom/mof/CMcluster.mof: ServiceGuard A.11.09 Date: 12/06/1999 PATCH: PHSS_ 20230 /opt/cmom/mof/EMScore.mof: ServiceGuard A.11.09 Date: 12/06/1999 PATCH: PHSS_ 20230 /opt/cmom/mof/SGcluster.mof: ServiceGuard A.11.09 Date: 12/06/1999 PATCH: PHSS_ 20230 /opt/cmom/mof/SGpackage.mof: ServiceGuard A.11.09 Date: 12/06/1999 PATCH: PHSS_ 20230 CM-Provider-MOF.CM-PROVIDER,fr=A.11.09, fa=HP-UX_B.11.00_32/64,v=HP: /opt/cmom/providers/cmprovider.omp: Command Utils Command Cln Config DB Config API MC/ServiceGuard Product $Revision: 82.2 $ Cluster Monitor Product Only $Revision: 82.2 $ Cluster Monitor Product $Revision: 82.2 $ A.11.09 Date: 01/15/2003; PATCH: PHSS_27158 Cluster Management Provider Library Build date: Wed Jan 15 14:52:06 PST 2003 Build id: ibld_sgops_a1109_patch ATS-CORE.ATS-RUN,fr=A.11.09,fa=HP-UX_B.11.00_32/64,v=HP: /etc/cmcluster/sharedtape/ats_tapelibs: Advanced Tape Services A.11.09 PHSS_20230 /usr/lbin/cmtaped: HP92453-02A.10.20 HP-UX SYMBOLIC DEBUGGER (END.O) $R evision: 74.03 $ Advanced Tape Support daemon Build date: Wed Jan 15 14:48:42 PST 2003 Build id: ibld_sgops_a1109_patch A.11.09 Date: 01/15/2003; PATCH: PHSS_27158 ATS Headers API Config DB CommunicationSrv Config Local Comm Util /usr/sbin/stapplyconf: Advanced Tape Support commands Build date: Wed Jan 15 14:49:17 PST 2003 Build id: ibld_sgops_a1109_patch A.11.09 Date: 01/15/2003; PATCH: PHSS_27158 ATS Headers ATS Utils Command Cln Command Srv Command Utils Config Local Comm Util /usr/sbin/stcheckconf: Advanced Tape Support commands Build date: Wed Jan 15 14:49:17 PST 2003 Build id: ibld_sgops_a1109_patch A.11.09 Date: 01/15/2003; PATCH: PHSS_27158 ATS Headers ATS Utils Command Cln Command Srv Command Utils Config Local Comm Util /usr/sbin/stdeleteconf: Advanced Tape Support commands Build date: Wed Jan 15 14:49:17 PST 2003 Build id: ibld_sgops_a1109_patch A.11.09 Date: 01/15/2003; PATCH: PHSS_27158 ATS Headers ATS Utils Command Cln Command Srv Command Utils Config Local Comm Util /usr/sbin/stgetconf: Advanced Tape Support commands Build date: Wed Jan 15 14:49:17 PST 2003 Build id: ibld_sgops_a1109_patch A.11.09 Date: 01/15/2003; PATCH: PHSS_27158 ATS Headers ATS Utils Command Cln Command Srv Command Utils Config Local Comm Util /usr/sbin/stquerycl: Advanced Tape Support commands Build date: Wed Jan 15 14:49:17 PST 2003 Build id: ibld_sgops_a1109_patch A.11.09 Date: 01/15/2003; PATCH: PHSS_27158 ATS Headers ATS Utils Command Cln Command Srv Command Utils Config Local Comm Util /usr/sbin/streclaim: Advanced Tape Support commands Build date: Wed Jan 15 14:49:17 PST 2003 Build id: ibld_sgops_a1109_patch A.11.09 Date: 01/15/2003; PATCH: PHSS_27158 ATS Headers ATS Utils Command Cln Command Srv Command Utils Config Local Comm Util /usr/sbin/stsetlog: Advanced Tape Support commands Build date: Wed Jan 15 14:49:17 PST 2003 Build id: ibld_sgops_a1109_patch A.11.09 Date: 01/15/2003; PATCH: PHSS_27158 ATS Headers ATS Utils Command Cln Command Srv Command Utils Config Local Comm Util /usr/sbin/stviewcl: Advanced Tape Support commands Build date: Wed Jan 15 14:49:17 PST 2003 Build id: ibld_sgops_a1109_patch A.11.09 Date: 01/15/2003; PATCH: PHSS_27158 ATS Headers ATS Utils Command Cln Command Srv Command Utils Config Local Comm Util cksum(1) Output: DLM.CM-DLM,fr=A.11.09,fa=HP-UX_B.11.00_32/64,v=HP: 1023047832 1983176 /opt/dlm/lbin/cmlkmgrd 2388940259 3293720 /opt/dlm/lib/libudlm.a 1831801626 1153664 /opt/dlm/sbin/dlmstat 1428797931 1218896 /usr/contrib/bin/dlmdump DLM.CM-DLM-CMDS,fr=A.11.09,fa=HP-UX_B.11.00_32/64,v=HP: 903851096 1527808 /opt/dlm/sbin/dlmapplyconf 903851096 1527808 /opt/dlm/sbin/dlmcheckconf 903851096 1527808 /opt/dlm/sbin/dlmquery 1655436900 12288 /usr/lib/libcmdlm.1 1655436900 12288 /usr/lib/libcmdlm.dlm.1 DLM-Pkg-Mgr.CM-PKG,fr=A.11.09,fa=HP-UX_B.11.00_32/64,v=HP: 931980497 67723 /usr/lbin/cm/C/CMcoreadmin.ui 304404735 67483 /usr/lbin/cm/C/CMcoreconf.ui 3195080482 682 /usr/lbin/cm/C/CMpack.ou 3314357551 65747 /usr/lbin/cm/C/CMpackadmin.ui 2024572631 65851 /usr/lbin/cm/C/CMpackconf.ui 2469811764 2007040 /usr/sbin/cmhaltpkg 2469811764 2007040 /usr/sbin/cmhaltserv 2469811764 2007040 /usr/sbin/cmmakepkg 2469811764 2007040 /usr/sbin/cmmigrate 2469811764 2007040 /usr/sbin/cmmodnet 2469811764 2007040 /usr/sbin/cmmodpkg 2469811764 2007040 /usr/sbin/cmrunpkg 2469811764 2007040 /usr/sbin/cmrunserv 2469811764 2007040 /usr/sbin/cmstartres 2469811764 2007040 /usr/sbin/cmstopres DLM-Clust-Mon.CM-CORE,fr=A.11.09,fa=HP-UX_B.11.00_32/64, v=HP: 781332524 6519 /sbin/init.d/cmcluster 303882879 897024 /usr/contrib/bin/cmsetlog 303882879 897024 /usr/contrib/bin/cmsetsafety 3068539824 672 /usr/lbin/cm/C/CMcore.ou 2146987687 1251024 /usr/lbin/cmclconfd 1366312202 2328272 /usr/lbin/cmcld 233067838 135168 /usr/lbin/cmlogd 2760774532 249856 /usr/lbin/cmlvmd 1671845562 1773568 /usr/lbin/cmsnmpd 3928503799 149200 /usr/lbin/cmsrvassistd 3053060909 2414288 /usr/lbin/cmui 3988254933 12288 /usr/lib/libcmcore.1 771850430 12288 /usr/newconfig/usr/lib/libcmdlm.1 2469811764 2007040 /usr/sbin/cmapplyconf 2469811764 2007040 /usr/sbin/cmcheckconf 2469811764 2007040 /usr/sbin/cmdeleteconf 2469811764 2007040 /usr/sbin/cmgetconf 2469811764 2007040 /usr/sbin/cmhaltcl 2469811764 2007040 /usr/sbin/cmhaltnode 2469811764 2007040 /usr/sbin/cmquerycl 2469811764 2007040 /usr/sbin/cmruncl 2469811764 2007040 /usr/sbin/cmrunnode 2469811764 2007040 /usr/sbin/cmviewcl 3926141592 1486848 /usr/sbin/cmviewconf 856896666 1531904 /usr/sbin/convert DLM-NMAPI.CM-NMAPI,fr=A.11.09,fa=HP-UX_B.11.00_32/64,v=HP: 2974763347 1976800 /opt/nmapi/8.0/lib/libnmapi_32.a 1208685996 1413670 /opt/nmapi/8.0/lib/libnmapi_64.a 3378388030 225280 /opt/nmapi/nmapi2/lib/libnmapi2.1 329842802 123224 /opt/nmapi/nmapi2/lib/pa20_64/libnmapi2.1 4123675457 927440 /usr/lbin/cmgmsd CM-Provider-MOF.CM-MOF,fr=A.11.09,fa=HP-UX_B.11.00_32/64, v=HP: 3282458452 14942 /opt/cmom/mof/CMcluster.mof 1354007112 323 /opt/cmom/mof/EMScore.mof 3485431659 8033 /opt/cmom/mof/SGcluster.mof 2019823577 10303 /opt/cmom/mof/SGpackage.mof CM-Provider-MOF.CM-PROVIDER,fr=A.11.09, fa=HP-UX_B.11.00_32/64,v=HP: 264906199 2408448 /opt/cmom/providers/cmprovider.omp ATS-CORE.ATS-RUN,fr=A.11.09,fa=HP-UX_B.11.00_32/64,v=HP: 1344553928 606 /etc/cmcluster/sharedtape/ats_tapelibs 2200003968 874192 /usr/lbin/cmtaped 1280859492 1724416 /usr/sbin/stapplyconf 1280859492 1724416 /usr/sbin/stcheckconf 1280859492 1724416 /usr/sbin/stdeleteconf 1280859492 1724416 /usr/sbin/stgetconf 1280859492 1724416 /usr/sbin/stquerycl 1280859492 1724416 /usr/sbin/streclaim 1280859492 1724416 /usr/sbin/stsetlog 1280859492 1724416 /usr/sbin/stviewcl Patch Conflicts: None Patch Dependencies: None Hardware Dependencies: None Other Dependencies: None Supersedes: PHSS_20230 PHSS_20872 PHSS_21107 PHSS_21425 PHSS_21866 PHSS_21996 PHSS_22540 PHSS_22683 PHSS_22876 PHSS_23511 PHSS_24033 PHSS_24536 PHSS_24850 PHSS_25499 PHSS_25935 PHSS_26338 PHSS_26750 Equivalent Patches: None Patch Package Size: 34830 KBytes Installation Instructions: Please review all instructions and the Hewlett-Packard SupportLine User Guide or your Hewlett-Packard support terms and conditions for precautions, scope of license, restrictions, and, limitation of liability and warranties, before installing this patch. ------------------------------------------------------------ 1. Back up your system before installing a patch. 2. Login as root. 3. Copy the patch to the /tmp directory. 4. Move to the /tmp directory and unshar the patch: cd /tmp sh PHSS_27158 5. Run swinstall to install the patch: swinstall -x autoreboot=true -x patch_match_target=true \ -s /tmp/PHSS_27158.depot By default swinstall will archive the original software in /var/adm/sw/save/PHSS_27158. If you do not wish to retain a copy of the original software, include the patch_save_files option in the swinstall command above: -x patch_save_files=false WARNING: If patch_save_files is false when a patch is installed, the patch cannot be deinstalled. Please be careful when using this feature. For future reference, the contents of the PHSS_27158.text file is available in the product readme: swlist -l product -a readme -d @ /tmp/PHSS_27158.depot To put this patch on a magnetic tape and install from the tape drive, use the command: dd if=/tmp/PHSS_27158.depot of=/dev/rmt/0m bs=2k Special Installation Instructions: For ServiceGuard OPS Edition Clusters using OPS 8.0.6, do the following: 1) Halt the cluster. 2) Install this patch on all nodes. 3) Relink Oracle applications on all nodes. 4) On all nodes, add this new line to the Oracle initialization file (usually named init.ora) as follows: ogms_home=/var/opt/ogms 5) Start the cluster and OPS. For ServiceGuard OPS Edition Clusters using OPS 8.1.6 or higher do the following: 1) Halt the cluster. 2) Install this patch on all nodes. 3) Start the cluster and OPS. For MC/ServiceGuard Clusters, do the following: 1) Halt ServiceGuard on the node the patch is to be installed on. 2) Install this patch on that node. 3) Restart ServiceGuard on that node. 4) Patch needs to be installed on all nodes in the cluster Defect 9 (SR#: 8606181583) listed in the patch text for PHSS_24033 requires some additional configuration steps to be performed for the fix to function correctly. We would recommend these steps be performed on all clusters but these are only strictly needed to allow for the resolution of defect 9 in PHSS_24033 and have no effect on the other fixes in any other patch. In order for cmrunnode to communicate over all fixed cluster IP addresses these need to be listed in .rhosts or cmclnodelist: 1) Ensure the root account .rhosts or /etc/cmcluster/cmclnodelist if this is used instead contains all fixed IP addresses and hostnames in the following format: 192.2.1.1 root 192.2.1.2 root 192.2.1.3 root 15.13.170.1 root 15.13.170.2 root 15.13.170.3 root hasupt01 root hasupt02 root hasupt03 root ServiceGuard needs to be able resolve hostnames in order for cmrunnode and cmruncl to work. If the cluster relies on DNS or NIS for host name resolution and name server switching is not configured to use /etc/hosts the cluster will fail to start if a DNS or NIS server is unreachable when the primary lan card is down. In these configurations the following additional steps should be performed: 2) Edit or create /etc/nsswitch.conf file on all nodes and specify /etc/hosts should be used before continuing to use either DNS or NIS. For example, if DNS were used for name resolution, the entry could be: hosts: files [NOTFOUND=continue] dns 3) Ensure that /etc/hosts lists the primary ip addresses for all nodes in the cluster, for example: 15.13.170.1 hasupt01.cup.hp.com hasupt01 15.13.170.2 hasupt02.cup.hp.com hasupt02 15.13.170.3 hasupt03.cup.hp.com hasupt03 These additional steps are only required for defect 9. These steps are not required for any other fix in this or other patches. Defect 3 (JAGae12286) listed in the patch text for PHSS_26338 requires that any customer who currently has lower and upper criteria defined for strings/enums needs to reapply their package configuration. This additional step is only required for defect 3 in PHSS_26338. This step is not required for any other fix in this or other patches. Defect 9 (JAGae48414) listed for patch PHSS_27158 requires some consideration for the node timeout for some very specific customers. This fix introduces a change in behavior for ServiceGuard in the case where the system clock is not updated for a certain time period. In this situation, the node will TOC if the system clock is not advancing for 5 node timeout periods. This change will make sure that whole cluster does not fail. And it will also make sure that Mission Critical applications are started on another node which does not exhibit the system clock problem. Large systems with higher number of CPUs/high amount of memory/large IO configurations are more susceptible to this phenomenon than small systems. It is recommended that for large systems a higher setting of the node timeout value from 5 to 8 seconds should be used. In addition a higher value of node timeout of 5 to 8 seconds is also recommended for systems where any of the following symptoms have been seen before installation of this patch: - a series of reconfigurations spaced by the node timeout value for no apparent reason & resulting in the same membership. - or after installation of this patch following messages are seen in the syslog: - Warning : Kernel ticks_since_boot is not advanced in the past xx seconds. - or a system crash with following messages on console or in the crash dump: - FAILURE : Kernel ticks_since_boot has not been advanced for xx seconds, which is greater than or equal to maximum allowable interval of XX seconds. This additional consideration is only required for defect 9 in PHSS_27158. This step is not required for any other fix in this or other patches.