I had to change IPs of my 2-nodes GPFS cluster, in order to move them from a LAN network to a DMZ.

So what I did, is :

I changed the system’s IP address without stopping the cluster (yeah, I know, silly me…), which broke the nodes communication :

Fri Oct 16 10:51:42.721 2015: GPFS: 6027-2724 [I] Cluster Manager connection broke. Probing cluster GPFS_CLUSTER.LPAR1
Fri
 Oct 16 10:51:43.392 2015: GPFS: 6027-755 [I] Waiting for challenge -1 
(node 1, sequence 6) to be responded during disk election
Fri Oct 16 10:51:51 CEST 2015: mmremote: unmountFileSystems all -f ...

So I tried to make it up by changing the IP on the GPFS side, but here’s what I got:

# mmchnode --daemon-interface=LPAR2-N LPAR2 --nonquorum
Fri Oct 16 15:05:31 CEST 2015: 6027-1664 mmchnode: Processing node LPAR2
mmremote: 6027-2188 Unable to determine the local node identity.
mmremote: 6027-1639 Command failed. Examine previous error messages to determine cause.
mmremote: 6027-2188 Unable to determine the local node identity.
mmremote: 6027-1639 Command failed. Examine previous error messages to determine cause.
mmchnode: 6027-1271 Unexpected error from checkExistingClusterNode LPAR2. Return code: 1
mmchnode: 6027-1639 Command failed. Examine previous error messages to determine cause.

Whoops… let’s try some other commandes on node 2

# mmlsconfig
mmlsconfig: 6027-2188 Unable to determine the local node identity.
mmlsconfig: 6027-1639 Command failed. Examine previous error messages to determine cause.

or these errors from the node 1 which was still responding :


[root@LPAR1]/root # mmlsnode
GPFS nodeset Node list
------------- -------------------------------------------------------

GPFS_CLUSTER LPAR1 LPAR2

[root@LPAR1]/root # mmdelnode -N LPAR2
Verifying GPFS is stopped on all affected nodes ...
LPAR2: mmremote: 6027-2188 Unable to determine the local node identity.
LPAR2: mmremote: 6027-1639 Command failed. Examine previous error messages to determine cause.
mmdelnode: 6027-1639 Command failed. Examine previous error messages to determine cause.
[root@LPAR1]/root # mmgetstate -La

Node number Node name Quorum Nodes up Total nodes GPFS state Remarks
------------------------------------------------------------------------------------
1 LPAR1 0 0 2 down quorum node
2 LPAR2 0 0 2 unknown quorum node
[root@LPAR1]/root # mmlscluster
[..]
GPFS cluster configuration servers:
-----------------------------------
Primary server: LPAR1
Secondary server: (none)

Node Daemon node name IP address Admin node name Designation
----------------------------------------------------------------------
1 LPAR1 10.10.10.1 LPAR1 quorum-manager
2 LPAR2 192.168.1.2 LPAR2 quorum-manager

Well , that doesn’t look good… the node 2 has disappeared from the configuration servers list (because I deleted it, this will be the key later), and it is still detected as unknown state…

I coulnd’t add it to the cluster either. In fact I couldn’t do anything on node 2, I was stuck.

So after looking for every possibility, and no progress gained at all, I decided to truss the mmlsnode command to find what files could be accessed:

# truss mmlsnode /tmp/toto
# grep kopen /tmp/toto |grep mm
kopen("/usr/lpp/mmfs/bin/mmlsnode", O_RDONLY|O_LARGEFILE) = 3
kopen("bin/mmglobfuncs", O_RDONLY|O_LARGEFILE) = 3
kopen("/usr/lpp/mmfs/bin/.paths", O_RDONLY|O_LARGEFILE) Err#2 ENOENT
kopen("bin/mmprodname", O_RDONLY|O_LARGEFILE) = 11
kopen("bin/mmglobfuncs.AIX", O_RDONLY|O_LARGEFILE) = 11
kopen("bin/mmsdrfsdef", O_RDONLY|O_LARGEFILE) = 11
kopen("bin/mmsdrfsdef.AIX", O_RDONLY|O_LARGEFILE) = 11
kopen("bin/mmfsfuncs", O_RDONLY|O_LARGEFILE) = 11
kopen("bin/mmfsfuncs.AIX", O_RDONLY|O_LARGEFILE) = 11
kopen("/var/mmfs/gen", O_RDONLY) = 12
kopen("/var/mmfs/gen", O_RDONLY) = 12
kopen("/var/mmfs/gen", O_RDONLY) = 12
kopen("/var/mmfs/gen/nodeFiles", O_RDONLY) = 12
kopen("/var/mmfs/gen/nodeFiles", O_RDONLY) = 12

Now we can see the files used to get information before executing anything on the node.

So I tried to view some of these files (shell script files, lucky me), especially the mmsdrfsdef one…

Here is an interesting part, look at the printErrorMsg comment :

# more /usr/lpp/mmfs/bin/mmsdrfsdef
[..]
# If the mmfsNodeData file is still missing, give up.
# Return the locally-determined node number and name
# and hope for the best.
if [[ ! -s $mmfsNodeData ]]
then
if [[ -z $ourNodeName || -z $ourNodeNumber ]]
then
# Unable to determine the local node identity.
printErrorMsg 716 $mmcmd
return 1
else
ourShortName=${ourNodeName%%.*}
return 0
fi
fi # end of if [[ ! -f $mmfsNodeData ]]

fi # end if [[ ! -f $mmfsNodeData ]]

[..]

Mmmm… so it seems the test fails to detect the presence of this $mmfsNodeData file, which is normally located in /var/mmfs/gen/mmfsNodeData

(hypothesis confirmed by its presence on the other node, which was responding correctly)

My guess is that I broke communication in the cluster by changing node 2’s ip, then deleted it from the cluster , hence this file too, and from this moment on, mmcommands could not be executed because the cluster didn’t who the heck this node was !

Solution :

modify mmsdrfs file in order to reflect the new IP addresses (it’s a bit brutal, but it works fine) on both nodes

[root@LPAR1]gen # grep MEMBER_NODE mmsdrfs
%%home%%:20_MEMBER_NODE::1:1:LPAR1:10.10.10.1:LPAR1:manager::::::LPAR1:LPAR1:1413:4.1.0.7:AIX:Q::::::server::
%%home%%:20_MEMBER_NODE::2:2:LPAR2:192.168.1.2:LPAR2:manager::::::LPAR2:LPAR2:1413:4.1.0.7:AIX:Q::::::server::

to

%%home%%:20_MEMBER_NODE::1:1:LPAR1:10.10.10.1:LPAR1:manager::::::LPAR1:LPAR1:1413:4.1.0.7:AIX:Q::::::server::
%%home%%:20_MEMBER_NODE::2:2:LPAR2:10.10.10.2:LPAR2:manager::::::LPAR2:LPAR2:1413:4.1.0.7:AIX:Q::::::server::

!!! DO THIS ON BOTH NODES !!!

Restart GPFS

# mmstartup -a

… and taadaaaaaa  /var/mmfs/gen/mmfsNodeData  is recreated on node 2:

[root@LPAR2]mmfs # ls -ltr /var/mmfs/gen/mmfsNodeData
-rw-r--r-- 1 root system 150 Oct 16 16:02 /var/mmfs/gen/mmfsNodeData

# cat /var/mmfs/gen/mmfsNodeData
%%home%%:20_MEMBER_NODE::2:2:LPAR2:10.10.10.2:LPAR2:manager::::::LPAR2:LPAR2:1413:4.1.0.7:AIX:Q::::::server::

… And the cluster is running fine in its comfy DMZ. Phew!

 

Hope it helps.

Share Button
GPFS : mmchnode to change a GPFS node’s IP breaks the cluster ?
Taggé sur :

Laisser un commentaire