Automatically cancel slurm jobs if there are insufficient instances on AWS ParallelCluster

Question

I recently started playing around with AWS ParallelCluster and I noticed that when I submit a job that requires more instances than there are currently available in my region/AZ then the available instances are brought up and idle until all remaining instances become available. It seems like this can sometimes take a very long time. SLURM reports in /var/log/parallelcluster/slurm_resume.log

ERROR - Error in CreateFleet request (...): InsufficientInstanceCapacity - We currently do not have sufficient c6i.metal capacity in the Availability Zone you requested (us-east-1a)

The problem is, I still pay for the nodes that are up and waiting. Is there a way to instead cancel the job after a certain timeout such that I can try later?

score 2 · Accepted Answer · answered Jun 28 '23 at 11:13

There might be a better solution than canceling the job in the face of limited capacity. ParallelCluster has a hidden capability called "all or nothing instance launching" that you can turn on by editing your cluster configuration.

What enabling this will do is instruct ParallelCluster to only launch new instances for a job if it can get all the requested instances. The job will not proceed to a running state, and you will not accrue charges for the unused instances. This should prevent the situation you are describing above.

Here's a link to an AWS HPC blog article that will tell you all about it and show you how to use it: https://aws.amazon.com/blogs/hpc/minimize-hpc-compute-costs-with-all-or-nothing-instance-launching/

Thanks a lot! I tried this out and SLURM seems to behave now much more as I hoped it would. One thing that is confusing me, though: The blog post you refer to seems to indicate that the job fails and needs to be resubmitted if the allocation fails due to capacity limitations, but I am seeing that the job remains in the queue in PD state with reason "BeginTime" (but no instances are allocated and scontrol shows the nodes as IDLE and POWERING_DOWN). Do you know if the behavior has changed slightly since the blog post? — Omar Awile, Jun 29 '23 at 09:32

Automatically cancel slurm jobs if there are insufficient instances on AWS ParallelCluster

1 Answers1