I recently solved a very nasty authentication problem with two Exchange Servers running as VMs on two separate ESX hosts. Randomly the servers would receive an error in the Application Event Viewer ID 1035 Inbound authentication failed with error UnexpectedExchangeAuthBlob for the Receive connector. This error would occur when the Exchange Servers tried to send mail to each other. When this error occurred, mail would back up in the mail queues. Sometimes the Exchange Servers would fix themselves and sometimes I had to manually restart the Exchange Transport Services on both servers to get the mail flowing again.
Because the errors occurred randomly, the problem was very difficult to troubleshoot. Sometimes mail would work for a few days, and sometimes it would break once or twice a day. I also noticed I was receiving a lot of time adjustment notifications in the System Event Logs. Sometimes the time change adjustment was minimal, but randomly time was adjusted by roughly six minutes. There was a correlation between 1035 errors in the Application Event Log and the six minute time adjustments in the System Event Log.
Time synchronization was properly configured according to Microsoft’s best practices by synchronizing time of the PDC Emulator to an external NTP time source and then all domain members synchronizing their time to the PDC Emulator. This appeared to work fine. In the VMware Tools, both VMs were NOT set to synchronize their time with the ESX host. Obviously the VMs were getting their time from somewhere other than the PDC Emulator, but where?
One of my co-workers Kris Kroll suggested that I look at the ESX hosts. Bingo! Even through both ESX hosts were configured to synchronize their time to an external NTP server, they were roughly six minutes off from the time on the PDC Emulator. As soon as I adjusted the time on the ESX hosts, the problem went away. Even when you don’t’ have the checkbox checked in VMware tools to NOT synchronize time of the VM with the ESX host, it can still receive time updates from the ESX host. Some of these times are pretty obvious like during a reboot, or server startup. Because I finally narrowed down the authentication problem to be time related, I ran the command:
W32tm /resync /rediscover
Evidently when you run this command a Windows Server has to get approximately five good time synchronizations before it “calms” down and doesn’t try to synchronize time repeatedly. Because time was getting adjusted by roughly six minutes during this period, the VMs never calmed down and repeatedly tried to synchronize their time over and over. This caused the Exchange Servers to randomly have Authentication Problems because their clock would skew by more than five minutes causing Kerberos Authentication errors.
Bottom line – ESX Host time does matter, even when it’s not supposed to matter. Hopefully this blog post will help you solve any Kerberos authentication issues with VMs running on ESX.