Virtualization Issue Affecting Login Nodes

A technical issue arose this evening with the VMWare environment, which hosts the Frank login nodes head nodes and queue server. All of these nodes were rebooted and are just now coming back into service.

Jobs that were running on compute nodes may be able to run to completion. More information about these jobs will become available when the queue services reboot. Due to the nature of the crash we are uncertain of how many jobs will still be running. It is likely that jobs that were submitted within the last day and had not yet begun execution will be missing from the queue.

As of now the login nodes are accessible and data can be accessed. Another announcement will be made with more information about queued and running jobs.

Queues online

As of this time Torque and Moab are currently online. The queue services have been restored to their state as of 5 PM on Oct 29.

Service Unit charges made after 5 PM yesterday will likely not be present. In addition, jobs submitted after 5 PM yesterday may not be present as well. Please check your queued and running jobs.

If you believe a job is no longer executing correctly please attempt to delete the job. If the qdel command indicates that it cannot connect to a MOM process, then the job may go a way when the node is rebooted over the next day or two.

The qsub command has been disabled while we investigate the state of running and queued jobs.

qsub enabled

qsub has been enabled on the cluster. Many of the compute nodes are online and should accept new jobs. Again, please check jobs that are running or queued. As more compute nodes get rebooted stale jobs will clear.