We recently solved a very nasty HA SimpliVity storage problem that has plagued a client for the last three weeks. Starting on 4/2/18 at 4:50 p.m. the SimpliVity storage experienced a failover from one OmniStack Virtual Controller (OVC) to the other OVC in a two-node cluster. The failures would happen once a day approximately 10 minutes later each day – it happened on 4/2/18 at 4:50 p.m., 4/3/18 at 5:00 p.m., 4/4/18 at 5:10 p.m. etc. A Windows Virtual Machine (VM) would see these as System Event 129 Lsi_sas write errors in the Event Viewer.
We initially suspected a problem with Hyperconverged storage so we opened a case with SimpliVity. We sent the SimpliVity support logs and the Intelligent Lights Out (ILO) Server Health Logs each time we had a storage failover event. After reviewing the logs SimpliVity said that the OVC's were running out of memory and went into a panic state because they couldn't write to disk. SimpliVity did identify one disk in the array that was having intermittent write problems so they sent a replacement drive. However, after replacing the drive the problem continued.
SimpliVity support was leaning towards a network problem that was causing the storage to failover. We reviewed all of our vLANs and switch configuration and they looked Ok. No changes were made to the switch configuration prior to 4/2/18 when the failures started. Because SimpliVity suspected a networking problem (I didn't believe them at the time) I reviewed the switch logs on the HPE 6600 10GB Switches by issuing the command in an SSH session:
Show log
It's always better to view the logs from the SSH session, because the GUI just shows a summary of the errors on the switch. Dumping the logs from an SSH session gives a lot more detail. The logs showed multiple failed telnet attempts just before our storage experienced a failover. The source IP was coming from the AlienVault System Information and Event Management (SIEM) server. It appeared that the switch was detecting a Denial of Service (DOS) Attack and disconnecting from the network during the attack. AlienVault was configured to perform a daily Network Discovery Scan. We disabled this scan, however the scan still continued to run even after it was disabled.
We opened up a support call with AlienVault and they verified that the Network Discovery Scan was disabled and there were no scheduled Vulnerability Scans. AlienVault support deleted the entry for the Network Discovery Scan and restarted the ossim process on the AlienVault Server. They performed a SQL query against the back-end database to confirm that there were no Scheduled Vulnerability Scan entries in the database. After trying these fixes, AlienVault did not launch a scan and there were no storage failovers. The AlienVault SIEM was upgraded to version 5.5.1 in February 2018 which is the current version.
Based on this experience, we suggest disabling any AlienVault Scheduled network scans and definitely do not perform any network vulnerability scans unless you are in a maintenance window, because AlienVault has the potential to take down your infrastructure. Even if you disable a network scan it may still continue to run the scan until the Network Scan Entry is deleted and the ossim process is restarted on the AlienVault Server. We burned about 80 hours of troubleshooting time to resolve this issue. If you are running AlienVault with any scheduled scans and are experiencing storage issues, this blog post may save you a lot of time.