Hi everyone,

Hope you’re doing good!

I was working for a client where they were upgrading their Servers from RHEL 7 to RHEL 8. As part of this project, DB Servers are part of upgrade.

On this client, they use a naming convention for network interfaces (ethN) which is not used by default on Linux anymore.

So, as part of upgrade tasks, they were renaming the network interfaces to the default naming convention: enpN. After this, they did the upgrade process and during the post-ugprade tasks, they were renaming the network interfaces to the original name: ethN.

OK. With that said, they were upgrading the OS for a 3-node cluster in a rolling manner.

So, after every server they upgrade, they were handing over the servers to DB team, that basically was doing this:

  • Checking if ASM disks were present;
  • Relinking the binary for RDBMS Home;
  • Running $GRID_HOME/crs/install/rootcrs.sh -updateosfiles;
  • Starting up the cluster.

OK, so, on 2nd node, during the last step (Start of the cluster), we faced the following error (output truncated for better exhibit):

[root@dbnode02 install]# crsctl start crs -wait
CRS-4123: Starting Oracle High Availability Services-managed resources
CRS-2672: Attempting to start 'ora.evmd' on 'dbnode02'
CRS-2672: Attempting to start 'ora.mdnsd' on 'dbnode02'
CRS-2676: Start of 'ora.evmd' on 'dbnode02' succeeded
CRS-2676: Start of 'ora.mdnsd' on 'dbnode02' succeeded
CRS-2672: Attempting to start 'ora.gpnpd' on 'dbnode02'
CRS-2676: Start of 'ora.gpnpd' on 'dbnode02' succeeded
CRS-2672: Attempting to start 'ora.gipcd' on 'dbnode02'
CRS-2676: Start of 'ora.gipcd' on 'dbnode02' succeeded
CRS-2672: Attempting to start 'ora.crf' on 'dbnode02'
CRS-2672: Attempting to start 'ora.cssdmonitor' on 'dbnode02'
CRS-2676: Start of 'ora.cssdmonitor' on 'dbnode02' succeeded
CRS-2672: Attempting to start 'ora.cssd' on 'dbnode02'
CRS-2672: Attempting to start 'ora.diskmon' on 'dbnode02'
CRS-2676: Start of 'ora.diskmon' on 'dbnode02' succeeded
CRS-2676: Start of 'ora.crf' on 'dbnode02' succeeded
CRS-2676: Start of 'ora.cssd' on 'dbnode02' succeeded
CRS-2672: Attempting to start 'ora.cluster_interconnect.haip' on 'dbnode02'
CRS-2672: Attempting to start 'ora.ctssd' on 'dbnode02'
CRS-2676: Start of 'ora.ctssd' on 'dbnode02' succeeded
CRS-2676: Start of 'ora.cluster_interconnect.haip' on 'dbnode02' succeeded
CRS-2672: Attempting to start 'ora.asm' on 'dbnode02'
CRS-2676: Start of 'ora.asm' on 'dbnode02' succeeded
CRS-2672: Attempting to start 'ora.storage' on 'dbnode02'
CRS-2676: Start of 'ora.storage' on 'dbnode02' succeeded
CRS-2672: Attempting to start 'ora.crsd' on 'dbnode02'
CRS-2676: Start of 'ora.crsd' on 'dbnode02' succeeded
CRS-2672: Attempting to start 'ora.asmnet1.asmnetwork' on 'dbnode02'
CRS-5017: The resource action "ora.asmnet1.asmnetwork start" encountered the following error:
CRS-5006: Unable to automatically select a network interface which has subnet mask  and subnet number 192.168.118.0
. For details refer to "(:CLSN00107:)" in "/oracle/app/grid/diag/crs/dbnode02/crs/trace/crsd_orarootagent_root.trc".

Just for awareness, the cluster were started and DB instance and DB servers were available.

But, this is interesting, isn’t?

CRS-5006: Unable to automatically select a network interface which has subnet mask and subnet number 192.168.118.0

I also verified the cluster alert.log and noticed this message flooding the log:

2024-06-23 08:32:57.998 [GIPCD(74101)]CRS-42216: No interfaces are configured on the local node for interface definition eth1(:.*)?:192.168.118.0: available interface definitions are [eth0(:.*)?:10.250.118.0][eth1(:.*)?:192.168.118.18][eth2(:.*)?:192.168.119.0][eth2:1(:.*)?:169.254.0.0][eth2:2(:.*)?:169.254.16.0][eth3(:.*)?:10.250.113.0][eth4(:.*)?:10.250.113.0][eth5(:.*)?:10.250.111.0][eth3(:.*)?:[fe80:0:0:0:0:0:0:0]][eth5(:.*)?:[fe80:0:0:0:0:0:0:0]][eth1(:.*)?:[fe80:0:0:0:0:0:0:0]][eth2(:.*)?:[fe80:0:0:0:0:0:0:0]][eth0(:.*)?:[fe80:0:0:0:0:0:0:0]][eth4(:.*)?:[fe80:0:0:0:0:0:0:0]].

OK, on this client, we have two network interfaces for cluster interconnect, let’s check it:

[root@dbnode02 ~]# oifcfg getif
eth0  10.250.118.0  global  public
eth1  192.168.118.0  global  cluster_interconnect,asm
eth2  192.168.119.0  global  cluster_interconnect,asm

Confirmed, so, eth1 is the network interface that we are having issues. If we check on the output exhibited above, we can see that:

eth1 is configured as interconnect, with subnet 192.168.118.0

The fact to having two network interfaces configured as cluster interconnect explains why cluster were able to start up. If this environment doesn’t have two network interfaces to work as cluster interconnect, the cluster will be unable to start on this node.

Let’s proceed with the troubleshooting.

OK, let’s now check at the OS layer:

[root@dbnode02 ~]# ifconfig eth1

eth1: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 9000
        inet 192.168.118.18  netmask 255.255.255.255  broadcast 192.168.118.18
        inet6 fe80::250:56ff:fe82:93de  prefixlen 64  scopeid 0x20<link>
        ether 00:50:56:82:93:de  txqueuelen 1000  (Ethernet)
        RX packets 1125140563  bytes 835592309784 (778.2 GiB)
        RX errors 0  dropped 24  overruns 0  frame 0
        TX packets 1214647607  bytes 849189780308 (790.8 GiB)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

Hummm, do you noticed that?

netmask 255.255.255.255
broadcast 192.168.118.18

The broadcast IP is pointing to the IP of interconnect itself: 192.168.118.18

This means that netmask is 255.255.255.255 (as exhibited on ifconfig output) instead 255.255.255.0.

Let’s check the communication between the cluster nodes on this private network.

From first node to second node:

[root@dbnode01 ~]# ping 192.168.118.18
PING 192.168.118.18 (192.168.118.18) 56(84) bytes of data.
^C
--- 192.168.118.18 ping statistics ---
3 packets transmitted, 0 received, 100% packet loss, time 2069ms

All tentatives failed.

Now, from third node to second node:

[root@dbnode03 ~]# ping 192.168.118.18
PING 192.168.118.18 (192.168.118.18) 56(84) bytes of data.
^C
--- 192.168.118.18 ping statistics ---
3 packets transmitted, 0 received, 100% packet loss, time 2069ms

Again, all tentatives failed.

Now, let’s check from second node to first node:

[root@dbnode02 ~]# ping 192.168.118.17
PING 192.168.118.17 (192.168.118.17) 56(84) bytes of data.
^C
--- 192.168.118.17 ping statistics ---
3 packets transmitted, 0 received, 100% packet loss, time 2069ms

Again, all tentatives failed.

Let’s check now from second node to third node:

[root@dbnode02 ~]# ping 192.168.118.19
PING 192.168.118.19 (192.168.118.19) 56(84) bytes of data.
^C
--- 192.168.118.19 ping statistics ---
3 packets transmitted, 0 received, 100% packet loss, time 2069ms

Again, all tentatives failed.

Let’s now check the communication between first node and third node, which is supposed to be working:

[root@dbnode01 ~]# ping 192.168.118.19
PING 192.168.118.19 (192.168.118.19) 56(84) bytes of data.
64 bytes from 192.168.118.19: icmp_seq=1 ttl=64 time=0.294 ms
64 bytes from 192.168.118.19: icmp_seq=2 ttl=64 time=0.169 ms
64 bytes from 192.168.118.19: icmp_seq=3 ttl=64 time=0.214 ms
^C
--- 192.168.118.19 ping statistics ---
3 packets transmitted, 3 received, 0% packet loss, time 2083ms
rtt min/avg/max/mdev = 0.169/0.225/0.294/0.054 ms

Perfect! Working fine. If we try from the third node to first node, also will work:

[root@dbnode03 ~]# ping 192.168.118.17
PING 192.168.118.19 (192.168.118.17) 56(84) bytes of data.
64 bytes from 192.168.118.17: icmp_seq=1 ttl=64 time=0.294 ms
64 bytes from 192.168.118.17: icmp_seq=2 ttl=64 time=0.169 ms
64 bytes from 192.168.118.17: icmp_seq=3 ttl=64 time=0.214 ms
^C
--- 192.168.118.17 ping statistics ---
3 packets transmitted, 3 received, 0% packet loss, time 2083ms
rtt min/avg/max/mdev = 0.169/0.225/0.294/0.054 ms

Great! Let’s try to understand why the netmask is wrong.

OK, let’s check the network interface configuration script on OS:

[root@dbnode02 ~]# cat /etc/sysconfig/network-scripts/ifcfg-eth1

TYPE=Ethernet
PROXY_METHOD=none
BROWSER_ONLY=no
BOOTPROTO=static 
IPADDR=192.168.118.18
PREFIZ=24
DEFROUTE=yes
NAME=eth1
DEVICE=eth1
ONBOOT=yes
MTU=9000

OK, can you notice something weird/wrong here?

See, there is a typo in the configuration file: PREFIZ.

This parameter does not exists, it should be PREFIX.

Once the parameter is wrong, Linux assumed that the IP will use 255.255.255.255 as netmask and this network interface will be unable to communicate with other cluster nodes.

So, what we did: we fixed the parameter:

PREFIX=24

After this, we brought the interface down and up:

ifdown eth1
ifup eth1

Double check on network interface configuration:

[root@dbnode02 ~]# ifconfig eth1

eth1: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 9000
        inet 192.168.118.18  netmask 255.255.255.0  broadcast 192.168.118.255
        inet6 fe80::250:56ff:fe82:93de  prefixlen 64  scopeid 0x20<link>
        ether 00:50:56:82:93:de  txqueuelen 1000  (Ethernet)
        RX packets 1125140563  bytes 835592309784 (778.2 GiB)
        RX errors 0  dropped 24  overruns 0  frame 0
        TX packets 1214647607  bytes 849189780308 (790.8 GiB)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

After this, the error message flooding the log was cleared.

Let’s check the communication between the cluster nodes.

From the first node to second node:

[root@dbnode01 ~]# ping 192.168.118.18
PING 192.168.118.18 (192.168.118.18) 56(84) bytes of data.
64 bytes from 192.168.118.18: icmp_seq=1 ttl=64 time=0.294 ms
64 bytes from 192.168.118.18: icmp_seq=2 ttl=64 time=0.169 ms
64 bytes from 192.168.118.18: icmp_seq=3 ttl=64 time=0.214 ms
^C
--- 192.168.118.18 ping statistics ---
3 packets transmitted, 3 received, 0% packet loss, time 2083ms
rtt min/avg/max/mdev = 0.169/0.225/0.294/0.054 ms

All good!

Let’s try now from the third node to second node:

[root@dbnode03 ~]# ping 192.168.118.18
PING 192.168.118.18 (192.168.118.18) 56(84) bytes of data.
64 bytes from 192.168.118.18: icmp_seq=1 ttl=64 time=0.294 ms
64 bytes from 192.168.118.18: icmp_seq=2 ttl=64 time=0.169 ms
64 bytes from 192.168.118.18: icmp_seq=3 ttl=64 time=0.214 ms
^C
--- 192.168.118.18 ping statistics ---
3 packets transmitted, 3 received, 0% packet loss, time 2083ms
rtt min/avg/max/mdev = 0.169/0.225/0.294/0.054 ms

All good!

Let’s try now from the second node to first node:

[root@dbnode02 ~]# ping 192.168.118.17
PING 192.168.118.17 (192.168.118.17) 56(84) bytes of data.
64 bytes from 192.168.118.17: icmp_seq=1 ttl=64 time=0.294 ms
64 bytes from 192.168.118.17: icmp_seq=2 ttl=64 time=0.169 ms
64 bytes from 192.168.118.17: icmp_seq=3 ttl=64 time=0.214 ms
^C
--- 192.168.118.17 ping statistics ---
3 packets transmitted, 3 received, 0% packet loss, time 2083ms
rtt min/avg/max/mdev = 0.169/0.225/0.294/0.054 ms

All good!

And now from second node to third node:

[root@dbnode02 ~]# ping 192.168.118.19
PING 192.168.118.19 (192.168.118.19) 56(84) bytes of data.
64 bytes from 192.168.118.19: icmp_seq=1 ttl=64 time=0.294 ms
64 bytes from 192.168.118.19: icmp_seq=2 ttl=64 time=0.169 ms
64 bytes from 192.168.118.19: icmp_seq=3 ttl=64 time=0.214 ms
^C
--- 192.168.118.19 ping statistics ---
3 packets transmitted, 3 received, 0% packet loss, time 2083ms
rtt min/avg/max/mdev = 0.169/0.225/0.294/0.054 ms

OK, all good!

Communication now is working fine between the cluster nodes on this private network. Issue fixed!

Hope it helps!

Peace!

Vinicius