A quick post to help make the services on the Microsoft High Performance Computing (HPC) 2019 Head Nodes more resilient by setting their recovery options.
Run the following PowerShell code on each of your Head Nodes which helps avoid failures for the users when starting the HPC Job Manager, or Admins when starting HPC Cluster Manager.
$HPCServices = Get-Service | where-object {$_.DisplayName -like '*HPC *'} ForEach ($HPCService in $HPCServices) { $ServiceDisplayName = $HPCService.DisplayName $ServiceName = $HPCService.Name write-verbose "Setting the `"$ServiceDisplayName`" service recovery options to:" -verbose write-verbose "- Restart the service 1 minute (60000 ms) after the first failure" -verbose write-verbose "- Restart the service 1 minute (60000 ms) after the second failure" -verbose write-verbose "- Restart the service 3 minute (180000 ms) after subsequent failures" -verbose write-verbose "- Reset the failure count every 1 day (86400 seconds)" -verbose Invoke-Command {cmd /c sc failure "$ServiceName" reset= 86400 actions= restart/60000/restart/60000/restart/180000} | out-null }
{ 0 comments }