4.5.2.34. 20200528 – XRACK – Australia – Client Network loss
NOTICETIMEZONE
Australia Brisbane GMT+10
STATUS (Open/Closed)
Closed
INCIDENTSTARTDATE
20200528
INCIDENTSTARTTIME (HH:MM)
13:05
OUTAGEDURATION
20 Minutes
ESTIMATEDTIME TO RESOLUTION
Resolved
XCRMTICKETNUMBER
Not Applicable
BRAND
XRACK
PRIORITY
P2
CUSTOMERSAFFECTED
Approximately 4 Clients X32, X67, X73 & X98
DESCRIPTION OF INCIDENT
Multiple virtual machines lost network connectivity. Intial findings thought to be a faulty NIC, the redundant NIC was tried but the hyper-visor locked up preventing changes. A reboot of the server was performed and services resumed resulting in a 20 minute outage.
DESCRIPTIONIMPACT
Primary Effect
6 Vitual machines lost network connectivity through the primary network port
Secondary Effect
Access to the VMs and the services they run resulted in disconnection for some users
Residual Effect
None expected once the problem is resolved
EVENTTIMELINE
13:05
PRTG Alerts indicated a problem numerous client services
13:06
NOC staff alerted senior technicians of services offline
13:11
Technician Luke commenced identification of primary cause
13:12
Technician Luke identified cause to be isolated to one server and specifically the network interface for customers
13:13
Technician Luke attempted to migrate a virtual machine over to redundant network interface
13:17
Technician Luke found changes were hanging and not applying due to the hypervisor entering a hung state
13:18
Technician Luke initiated a restart of the server
13:23
Technician Luke confirmed server was back online and accessible
13:25
Machine was fully operational again and all services resumed their normal operation
RECOVERY & RESOLUTION
XSTRA identified issue was related to the communication between the network interface and the hypervisor of the host server. A reboot allowed the hypervisor to return to normal operation
ROOTCAUSE
Hypervisor entered an hung state resulting in no communication between the virtual switch and the physical network interface
CORRECTIVE & PREVENTATIVEMEASURES
Remove the host from production and perform routine maintenance and updates
Post your comment on this topic.