A quick post to help make the services on the Microsoft High Performance Computing (HPC) 2019 Head Nodes more resilient by setting their recovery options.
Run the following PowerShell code on each of your Head Nodes which helps avoid failures for the users when starting the HPC Job Manager, or Admins when starting HPC Cluster Manager.
$HPCServices = Get-Service | where-object {$_.DisplayName -like '*HPC *'} ForEach ($HPCService in $HPCServices) { $ServiceDisplayName = $HPCService.DisplayName $ServiceName = $HPCService.Name write-verbose "Setting the `"$ServiceDisplayName`" service recovery options to:" -verbose write-verbose "- Restart the service 1 minute (60000 ms) after the first failure" -verbose write-verbose "- Restart the service 1 minute (60000 ms) after the second failure" -verbose write-verbose "- Restart the service 3 minute (180000 ms) after subsequent failures" -verbose write-verbose "- Reset the failure count every 1 day (86400 seconds)" -verbose Invoke-Command {cmd /c sc failure "$ServiceName" reset= 86400 actions= restart/60000/restart/60000/restart/180000} | out-null }
The service control manager (SCM) counts the number of times each service has failed since the system booted. The count is reset to 0 if the service has not failed for the time (in seconds) that we set the “Reset fail count after” to. There is no explicit limit in Windows regarding how many times the “subsequent failures” action will repeat. However, if the recovery action is “Restart the Service”, the service will keep restarting indefinitely after the time lapse unless the service enters a critical failure state often caused by dependency issues, missing files, persistent crashes, etc. So you may need to tweak some of the values above if needed. Or instead of the “Restart the Service” for “Subsequent failures”, you may need to “Run a Program” to execute a PowerShell or Batch script to address underlying issues, or simply “Restart the Computer”.
I don’t know why Microsoft doesn’t enable service recovery on these services by default.
I would also recommend deploying the latest service pack. As of writing this article HPC Pack 2019 Update 3 is the latest release. Ensure you follow the upgrade process.
Hope you find this helpful.