NOTICE TIME ZONE | Australia Brisbane GMT+10 |
---|---|
STATUS (Open/Closed) | Closed |
INCIDENT START DATE | 20200528 |
INCIDENT START TIME (HH:MM) | 13:05 |
OUTAGE DURATION | 20 Minutes |
ESTIMATED TIME TO RESOLUTION | Resolved |
XCRM TICKET NUMBER | Not Applicable |
BRAND | XRACK |
PRIORITY | P2 |
CUSTOMERS AFFECTED | Approximately 4 Clients X32, X67, X73 & X98 |
DESCRIPTION OF INCIDENT | Multiple virtual machines lost network connectivity. Intial findings thought to be a faulty NIC, the redundant NIC was tried but the hyper-visor locked up preventing changes. A reboot of the server was performed and services resumed resulting in a 20 minute outage. |
DESCRIPTION IMPACT | |
|
6 Vitual machines lost network connectivity through the primary network port |
|
Access to the VMs and the services they run resulted in disconnection for some users |
|
None expected once the problem is resolved |
EVENT TIMELINE | |
|
PRTG Alerts indicated a problem numerous client services |
|
NOC staff alerted senior technicians of services offline |
|
Technician Luke commenced identification of primary cause |
|
Technician Luke identified cause to be isolated to one server and specifically the network interface for customers |
|
Technician Luke attempted to migrate a virtual machine over to redundant network interface |
|
Technician Luke found changes were hanging and not applying due to the hypervisor entering a hung state |
|
Technician Luke initiated a restart of the server |
|
Technician Luke confirmed server was back online and accessible |
|
Machine was fully operational again and all services resumed their normal operation |
RECOVERY & RESOLUTION | XSTRA identified issue was related to the communication between the network interface and the hypervisor of the host server. A reboot allowed the hypervisor to return to normal operation |
ROOT CAUSE | Hypervisor entered an hung state resulting in no communication between the virtual switch and the physical network interface |
CORRECTIVE & PREVENTATIVE MEASURES | Remove the host from production and perform routine maintenance and updates |
Revision:
1
Last modified:
May 28, 2020
Post your comment on this topic.