NOTICE TIME ZONE Australia Brisbane GMT+10
STATUS (Open/Closed) Closed
INCIDENT START DATE 20200520
INCIDENT START TIME (HH:MM) 15:50
OUTAGE DURATION 8 Minutes
ESTIMATED TIME TO RESOLUTION Resolved
XCRM TICKET NUMBER Not Applicable
BRAND XRACK
PRIORITY P2
CUSTOMERS AFFECTED Approximately 25 Clients including X26, X32, X40, X52
DESCRIPTION OF INCIDENT An operating system issue with a physical host, Z-0-0-VH94, caused the physical host machine to restart. The machine was operational again 8 minutes later at 3:58 pm and service returned to normal.
DESCRIPTION IMPACT
  • Primary Effect
Acces to virtual machines on the host would have been off-line for 8 minutes
  • Secondary Effect
Many other clients would have been affected because one of the virtual machines on this host was a router that trunked traffic for multiple clients. This would have affected Phones, Virtual Servers, Cloud Desktops, and Internet Access
  • Residual Effect
None expected once the problem is resolved
EVENT TIMELINE
  • 15:50
PRTG Alerts indicated a problem with Z-0-0-VH94
  • 15:50
Physical Host produced an operating system error and restarted
  • 15:53
Technician Luke logged in via iDRAC
  • 15:56
Memory Dump had started
  • 15:56
Technician Luke force restarted the Machine
  • 15:58
Machine was fully operational again and all virtual machines resumed their normal operation
RECOVERY & RESOLUTION XSTRA made sure the machine booted up again as quickly as possible. The error stop code was noted as well.
ROOT CAUSE The stop code on Z-0-0-VH94 was – 0×000000a – Although we did not have a memory dump file to interrogate, the same stop code was identified on a different host, Z-0-0-VH92, on Saturday and Sunday, and on that host, we did have a memory dump file to analyze. After analysis of the dump file we could determine that the issue on both cases was caused by the file: bxnd60a.sys – This file and the issues experienced seem to occur when Hyper-V is in use on the server and relates to communication through the hypervisor to the underlying network hardware, primarily, Broadcom BCM6716C & BCM5709 NetXtreme II GigE. But why would this be a problem on these 2 hosts when they have been in production for years and there have been no changes to them recently? There has been an ongoing extended Hyper-v replication event for some of the X40’s virtual machines, and this is not normal. The project to move a X40 physical host was delayed because of COVID-19 and this has meant that these extended-replicas have been active for a longer period of time than planned. We suspect that this activity, being the only new activity, and common to both physical machines that had exactly the same problem, could be the cause of the restart.
CORRECTIVE & PREVENTATIVE MEASURES All Hyper-V extended replica’s on Z-0-0-VH92 and Z-0-0-VH94 have been removed to eliminate this suspected root cause. Both hosts have been stable since the 20th of May 2020. At this point, there is nothing more to do except continue to monitor these machines as usual and see if the problem re-occurs. If the hosts do not repeat the same error after the 20th of June then we will close this ticket and mark it as resolved.

Feedback

Was this helpful?

Yes No
You indicated this topic was not helpful to you ...
Could you please leave a comment telling us why? Thank you!
Thanks for your feedback.

Post your comment on this topic.

Please do not use this for support questions.
https://x.direct/1/en/topic/welcome

Post Comment