Maintenance and Downtime Policy¶
The ULHPC team will schedule maintenance in one of three manners:
- Rolling reboots Whenever possible, ULHPC will apply updates and do other maintenance in a rolling fashion in such a manner as to have either no or as little impact as possible to ULHPC services
- Partial outages We will do these as needed but in a manner that impacts only some ULHPC services at a time
- Full outages These are outages that will affect all ULHPC services, such as outages of core datacenter networking services, datacenter power of HVAC/cooling system maintenance or global GPFS/Spectrumscale filesystem updates. Such maintenance windows typically happen on a quarterly basis. It should be noted that we are not always able to anticipate when these outages are needed.
ULHPC's goal for these downtimes is to have them completed as fast as possible. However, validation and qualification of the full platform takes typically one working day, and unforeseen or unusual circumstances may occur. So count for such outages a multiple-day downtime.
We normally inform users of cluster maintenance at least 3 weeks in advance by mail using the HPC User community mailing list (moderated):
A second reminder is sent a few days prior to actual downtime.
The news of the downtimes is also posted on the Live status page.
Finally, a colored "message of the day" (motd) banner is displayed on all access/login servers such that you can quickly be informed of any incoming maintenance operation upon connection to the cluster. You can see this when you login or (again),any time by issuing the command:
Detecting maintenance... During the maintenance
- During the maintenance period, access to the involved cluster access/login serveur is DENIED and any users still logged-in are disconnected at the beginning of the maintenance
- We will notify you of the end of the maintenance with a summary of the performed operations.
Exceptional "EMERGENCY" maintenance¶
Unscheduled downtimes can occur for any number of reasons, including:
- Loss of cooling and/or power in the data center.
- Loss of supporting infrastructure (i.e. hardware).
- Critical need to make changes to hardware or software that negatively impacts performance or access.
- Application of critical patches that can't wait until the next scheduled maintenance.
- For safety or security issues that require immediate action.
We will try to notify users in the advent of such event by email.
The ULHPC team reserves the right to intervene in user activity without notice when such activity may destabilize the platform and/or is at the expense of other users, and/or to monitor/verify/debug ongoing system activity.