NOTICE TIME ZONE |
Australia Brisbane GMT+10 |
STATUS (Open/Closed) |
Closed |
INCIDENT START DATE |
20200520 |
INCIDENT START TIME (HH:MM) |
15:50 |
OUTAGE DURATION |
8 Minutes |
ESTIMATED TIME TO RESOLUTION |
Resolved |
XCRM TICKET NUMBER |
Not Applicable |
BRAND |
XRACK |
PRIORITY |
P2 |
CUSTOMERS AFFECTED |
Approximately 25 Clients including X26, X32, X40, X52 |
DESCRIPTION OF INCIDENT |
An operating system issue with a physical host, Z-0-0-VH94, caused the physical host machine to restart. The machine was operational again 8 minutes later at 3:58 pm and service returned to normal. |
DESCRIPTION IMPACT |
|
|
Acces to virtual machines on the host would have been off-line for 8 minutes |
|
Many other clients would have been affected because one of the virtual machines on this host was a router that trunked traffic for multiple clients. This would have affected Phones, Virtual Servers, Cloud Desktops, and Internet Access |
|
None expected once the problem is resolved |
EVENT TIMELINE |
|
|
PRTG Alerts indicated a problem with Z-0-0-VH94 |
|
Physical Host produced an operating system error and restarted |
|
Technician Luke logged in via iDRAC |
|
Memory Dump had started |
|
Technician Luke force restarted the Machine |
|
Machine was fully operational again and all virtual machines resumed their normal operation |
RECOVERY & RESOLUTION |
XSTRA made sure the machine booted up again as quickly as possible. The error stop code was noted as well. |
ROOT CAUSE |
The stop code on Z-0-0-VH94 was – 0×000000a – Although we did not have a memory dump file to interrogate, the same stop code was identified on a different host, Z-0-0-VH92, on Saturday and Sunday, and on that host, we did have a memory dump file to analyze. After analysis of the dump file we could determine that the issue on both cases was caused by the file: bxnd60a.sys – This file and the issues experienced seem to occur when Hyper-V is in use on the server and relates to communication through the hypervisor to the underlying network hardware, primarily, Broadcom BCM6716C & BCM5709 NetXtreme II GigE. But why would this be a problem on these 2 hosts when they have been in production for years and there have been no changes to them recently? There has been an ongoing extended Hyper-v replication event for some of the X40’s virtual machines, and this is not normal. The project to move a X40 physical host was delayed because of COVID-19 and this has meant that these extended-replicas have been active for a longer period of time than planned. We suspect that this activity, being the only new activity, and common to both physical machines that had exactly the same problem, could be the cause of the restart. |
CORRECTIVE & PREVENTATIVE MEASURES |
All Hyper-V extended replica’s on Z-0-0-VH92 and Z-0-0-VH94 have been removed to eliminate this suspected root cause. Both hosts have been stable since the 20th of May 2020. At this point, there is nothing more to do except continue to monitor these machines as usual and see if the problem re-occurs. If the hosts do not repeat the same error after the 20th of June then we will close this ticket and mark it as resolved. |
Post your comment on this topic.