Yesterday I had a weird network issue at one of our customers:
An ESX Host has lost the connection to the vCenter server and was marked as "disconnected". The host was running as well as the virtual machines (checked via console), a reconnect to the vCenter was not possible. Further investigations obtained that even some VMs had sporadic network issues.
After exploring the logfiles of the ESX host, I obtained the following messages:
Mar 6 17:31:16 ESX01 vmkernel: 15:07:33:10.284 cpu2:4245)WARNING: LinNet: netdev_watchdog: NETDEV WATCHDOG: vmnic3: transmit timed out
Mar 6 17:31:16 ESX01 vmkernel: 15:07:33:10.284 cpu2:4245)BUG: warning at vmkdrivers/src26/vmklinux26/vmware/linux_net.c:3235/netdev_watchdog() (inside vmklinux)
Mar 6 17:31:17 ESX01 vmkernel: 15:07:33:11.285 cpu3:4235)WARNING: LinNet: netdev_watchdog: NETDEV WATCHDOG: vmnic3: transmit timed out
Mar 6 17:31:18 ESX01 vmkernel: 15:07:33:12.287 cpu2:4245)WARNING: LinNet: netdev_watchdog: NETDEV WATCHDOG: vmnic3: transmit timed out
Mar 6 17:31:19 ESX01 vmkernel: 15:07:33:13.287 cpu5:4231)WARNING: LinNet: netdev_watchdog: NETDEV WATCHDOG: vmnic3: transmit timed out
Mar 6 17:31:20ESX01 vmkernel: 15:07:33:14.289 cpu1:4234)WARNING: LinNet: netdev_watchdog: NETDEV WATCHDOG: vmnic3: transmit timed out
Mar 6 17:31:21 ESX01 vmkernel: 15:07:33:15.291 cpu2:4245)WARNING: LinNet: netdev_watchdog: NETDEV WATCHDOG: vmnic3: transmit timed out
Mar 6 17:31:22 ESX01 vmkernel: 15:07:33:16.293 cpu5:4244)WARNING: LinNet: netdev_watchdog: NETDEV WATCHDOG: vmnic3: transmit timed out
Mar 6 17:31:23 ESX01 vmkernel: 15:07:33:17.294 cpu2:4244)WARNING: LinNet: netdev_watchdog: NETDEV WATCHDOG: vmnic3: transmit timed out
So it seems to be that sth. was wrong with one NIC of the host. The date / time matched with the disconnect message inside the vCenter logs. The NIC is dedicated to virtual machines (among three other NICs) and not to the Service Console, so this seemed not be the reason for the disconnected ESX host.
After a short search in the VMware KB I found the following patch: http://kb.vmware.com/kb/1017458. It was released at 3th March 2010.
The description shows the following:
On some systems under heavy networking and processor load (large number of virtual machines), some NIC drivers might randomly attempt to reset the device and fail.
Then I obtained that even the virtualized vCenter server was running on this ESX host, so the vCenter itself had a network problem, too. This seems to be the reason, why the ESX host lost its connection. The curious thing is, that the other 3 ESX Hosts seems to be connected and even a RDP connection to the vCenter was possible.
Conclusion: I suggest, that due to high network I/O load the NIC got that error described above. In the VMware communities some other people reported the same issue, so I highly recommend to install the patch.
The reason for the sporadic behaviour is the following: the loadbalancing of the respective vSwitch is set to "IP Hash" in conjunction with a etherchannel configuration on the physical switches, to improve overall performance of a VM by using all NICs instead of one NIC in "Port ID" mode. So everytime the respective NIC was contacted, the packets got lost.
If you ask yourself now, why this failure was not detected by the ESX host, consider the following: due to the failover setting "Link Status Only", these logical failures are not detected, only phyiscal link-down failures. The other option "Beacon probing" would detect even such a logical failure, but can not be used in conjunction with IP hash load-balancing and Etherchannel because of possible network flapping errors (see KB1012819). So you have to decide wether you want performance or a better failure detection.