AWS ParallelCluster compute nodes failing to start properly

Question

I am a new parallelCluster 2.11 user and am having an issue where my compute nodes fail to spin up properly resulting in the eventual failure of pcluster create. Here is my config file:

[aws]
aws_region_name = us-east-1

[aliases]
ssh = ssh {CFN_USER}@{MASTER_IP} {ARGS}

[global]
cluster_template = default
update_check = true
sanity_check = true

[cluster default]
key_name = <keypair>
scheduler = slurm
master_instance_type = c5n.2xlarge
base_os = centos7
vpc_settings = default
queue_settings = compute
master_root_volume_size = 1000
compute_root_volume_size = 35

[vpc default]
vpc_id = <my-default-vpc>
master_subnet_id = <my-subnetc>
compute_subnet_id = <my-subnetb>
use_public_ips = false

[queue compute]
enable_efa = true
compute_resource_settings = default
compute_type = ondemand
placement_group = DYNAMIC
disable_hyperthreading = true

[compute_resource default]
instance_type = c5n.18xlarge
initial_count = 1
min_count = 1
max_count = 32

[ebs shared]
shared_dir = shared
volume_type = st1
volume_size = 500

When I run pcluster create I get the following error after ~15 min: The following resource(s) failed to create:

The following resource(s) failed to create: [MasterServer]. 
    - AWS::EC2::Instance MasterServer Failed to receive 1 resource signal(s) within the specified duration

If I log into the master node before the failure above I see the following in the /var/log/parallelcluster/clustermgtd log file:

2021-09-28 15:42:41,168 - [slurm_plugin.clustermgtd:_maintain_nodes] - INFO - Found the following unhealthy static nodes: (x1) ['compute-st-c5n18xlarge-1(compute-st-c5n18xlarge-1)']
2021-09-28 15:42:41,168 - [slurm_plugin.clustermgtd:_handle_unhealthy_static_nodes] - INFO - Setting unhealthy static nodes to DOWN

However, despite setting the node to DOWN, the ec2 compute instance continues to stay in the running state and the above log continually emits the following message:

2021-09-28 15:54:41,156 - [slurm_plugin.clustermgtd:_maintain_nodes] - INFO - Following nodes are currently in replacement: (x1) ['compute-st-c5n18xlarge-1']

This state persists until the pcluster create command fails with the error noted above. I suspect there is something wrong with my configuration -- any help or further troubleshooting advice would be appreciated.

Quick question - I see that you have two separate subnets for the head and compute node. Can you confirm that the two subnets are able to communicate? I think you are using the default VPC and default subnets, which should allow for communication but best to rule out networking. — Angel Pizarro, Sep 29 '21 at 12:37
@Angel Pizarro it was a networking issue. It turns out that you can’t use two public subnets. You can either use a private subnet for the compute nodes or set assign-public-ips to true. Once I did that it worked fine! — notKnotTheory, Sep 30 '21 at 01:48

score 0 · Answer 1 · answered Sep 29 '21 at 00:24

0

Can you setup the cluster without the min_count parameter in the configuration file? i.e. Indicating to parallelcluster to create the cluster without a compute node spun up.

answered Sep 29 '21 at 00:24

austin

1

this allows the parallelcluster to finish creating successfully, but any jobs run through slurm end up stuck in the CF state because the head node can’t communicate with the compute nodes. – notKnotTheory Sep 30 '21 at 01:54

score 0 · Answer 2 · answered Sep 30 '21 at 01:51

I was originally using two public subnets: one for the head node and one for the compute nodes. Switching the compute nodes to a private subnet solved the problem. Alternatively, not specifying a compute subnet and setting assign_public_ips to true also solved the problem.

After these steps the compute nodes spun up successfully and I was able to run my jobs through slurm.

AWS ParallelCluster compute nodes failing to start properly

2 Answers2