Microsoft High Performance Computing (HPC) Pack 2019 Service Resilience

by Jeremy Saunders on February 19, 2025

A quick post to help make the services on the Microsoft High Performance Computing (HPC) 2019 Head Nodes more resilient by setting their recovery options.

Run the following PowerShell code on each of your Head Nodes which helps avoid failures for the users when starting the HPC Job Manager, or Admins when starting HPC Cluster Manager.

$HPCServices = Get-Service | where-object {$_.DisplayName  -like '*HPC *'}
ForEach ($HPCService in $HPCServices) {
  $ServiceDisplayName = $HPCService.DisplayName
  $ServiceName = $HPCService.Name
  write-verbose "Setting the `"$ServiceDisplayName`" service recovery options to:" -verbose
  write-verbose "- Restart the service 1 minute (60000 ms) after the first failure" -verbose
  write-verbose "- Restart the service 1 minute (60000 ms) after the second failure" -verbose
  write-verbose "- Restart the service 3 minute (180000 ms) after subsequent failures" -verbose
  write-verbose "- Reset the failure count every 1 day (86400 seconds)" -verbose
  Invoke-Command {cmd /c sc failure "$ServiceName" reset= 86400 actions= restart/60000/restart/60000/restart/180000} | out-null
}

HPC Pack 2019 logo

The service control manager (SCM) counts the number of times each service has failed since the system booted. The count is reset to 0 if the service has not failed for the time (in seconds) that we set the “Reset fail count after” to. There is no explicit limit in Windows regarding how many times the “subsequent failures” action will repeat. However, if the recovery action is “Restart the Service”, the service will keep restarting indefinitely after the time lapse unless the service enters a critical failure state often caused by dependency issues, missing files, persistent crashes, etc. So you may need to tweak some of the values above if needed. Or instead of the “Restart the Service” for “Subsequent failures”, you may need to “Run a Program” to execute a PowerShell or Batch script to address underlying issues, or simply “Restart the Computer”.

I don’t know why Microsoft doesn’t enable service recovery on these services by default.

I would also recommend deploying the latest service pack. As of writing this article HPC Pack 2019 Update 3 is the latest release. Ensure you follow the upgrade process.

Hope you find this helpful.

Jeremy Saunders

Jeremy Saunders

Technical Architect | DevOps Evangelist | Software Developer | Microsoft, NVIDIA, Citrix and Desktop Virtualisation (VDI) Specialist/Expert | Rapper | Improvisor | Comedian | Property Investor | Kayaking enthusiast at J House Consulting
Jeremy Saunders is the Problem Terminator. He is a highly respected IT Professional with over 35 years’ experience in the industry. Using his exceptional design and problem solving skills with precise methodologies applied at both technical and business levels he is always focused on achieving the best business outcomes. He worked as an independent consultant until September 2017, when he took up a full time role at BHP, one of the largest and most innovative global mining companies. With a diverse skill set, high ethical standards, and attention to detail, coupled with a friendly nature and great sense of humour, Jeremy aligns to industry and vendor best practices, which puts him amongst the leaders of his field. He is intensely passionate about solving technology problems for his organisation, their customers and the tech community, to improve the user experience, reliability and operational support. Views and IP shared on this site belong to Jeremy.
Jeremy Saunders
Jeremy Saunders

Previous post: