0

I am using SLURM on AWS to manage jobs as part of AWS parallelcluster. I have two questions :

  • When using scancel *jobid* to cancel a job, the associated node(s) do not stop. How can I achieve that ?
  • When starting, I made the mistake of not making my script executable so the sbatch *script.sh* worked but the compute node was doing nothing. How could I identify such behaviour and handle it properly ? Is the proper to e.g. stop the idle node after some time for example and output that in a log ? How can I achieve that ?
FenryrMKIII
  • 1,068
  • 1
  • 13
  • 30
  • For your second question - was the job still on the queue? Or did it "complete" and there were no more jobs on the queue? In either case you should get a log from Slurm about what happened with the process using `sacct` if the job exited or `slist` if the job is running. – Angel Pizarro Apr 13 '21 at 12:24

1 Answers1

1

Check out this page in the docs: https://docs.aws.amazon.com/parallelcluster/latest/ug/autoscaling.html

Bottom line is that instances that have no jobs for a period of time longer than the scaledown_idletime (the default setting is 10 minutes) will get scaled down (terminated) by the cluster, automagically.

You can tweak the setting in the config file when you build your cluster, if 10 mins is too long. Just think about your workload first, because you don't want small delays between jobs to cause you a lot of churn whilst you wait for nodes to die and then get created again shortly after, hence the 10 minute thing.

boofla
  • 46
  • 2