I am using SLURM on AWS to manage jobs as part of AWS parallelcluster. I have two questions :
- When using
scancel *jobid*
to cancel a job, the associated node(s) do not stop. How can I achieve that ? - When starting, I made the mistake of not making my script executable so the
sbatch *script.sh*
worked but the compute node was doing nothing. How could I identify such behaviour and handle it properly ? Is the proper to e.g. stop the idle node after some time for example and output that in a log ? How can I achieve that ?