Recently, I was working with a customer that was having an intermittent blue screen on a number of Windows 8.1 PVS target devices. The error they were getting was NETWORK_BOOT_DUPLICATE_ADDRESS
This same image was being used on VDI desktops in their first data centre without issue but in DC2 there was about a 50% failure rate at boot time. A reboot would often clear the error and allow the VMS to boot correctly.
Upon questioning the customer I found out that both sites used the same DHCP server (with different scopes) and that DHCP server was located in DC1
I confirmed that the DHCP server was configured to enable conflict detection as per our recommendation in CTX135938. Everything looked OK with the scopes and DHCP reservations were configured for all VMs.
At this point I suspected that this was a timing issue that could be explained by DHCP client requests being serviced slower in DC2 than in DC1. During that delay the clients were occasionally reverting to the cached DHCP information stored in the registry from the master VM it was sealed upon.
What I Suspect is Happening:
VMs are using the cached registry information to send a renewal request to the DHCP servers as they come up.
50% of the time the DHCP server responds within the expected time and issues it a new address.
the other 50% of the time the DHCP server doesn’t respond within the expected timeframe so the client enters a rebinding state, broadcasting a DHCPDiscover message to any available DHCP servers to update its current IP address lease.
While it waits for responses it is using the cached registry information for it’s address.
As PVS streaming is happening over this address it conflicts with other Vms that are also experiencing the issue and kicks out to a blue screen before the DHCP server can sort out the mess ( DHCP is pretty good at this – but if it doesn’t do it quick enough it will be too late as the PVS target device has already failed.
Implementing the following steps to clear this cached information before the VM was sealed resolved the issue with 100% of machines able to boot in both DC1 and DC2.
I created a simple script that the customer could tag on to the end of their existing shutdown processes for the Master VM whenever it is updated that:
Stops the DHCP client service: “Net stop DHCP”
Blanks the following registry keys: “regedit /s DHCP_clear.reg”
The DHCP_clear.reg file contained blank values for the following keys:
Windows Registry Editor Version 5.00
We always recommend a well defined process to clear down a PVS master image before deployment, including the removal of DHCP lease info but this is the first time I have personally come across a real world issue cause by not following this advice. I will use it as a cautionary tale in the future 🙂
Additionally I recommended to the customer that they locate some DHCP services in DC2, not to resolve this issue but to ensure DC2 carries on working if DC1 goes down
Dave Brear – Citrix Consulting