"K" == Kwinten Nelissen kwinten.nelissen@gmail.com writes:
Hi Kwinten,
K> Hi, Something went from when adding new nodes. The configuration K> file was somehow not fully written. So I wrote them again and K> restarted the slurm on all nodes. All nodes were affected but K> the problem seems to be resolved now.
the available RAM reported by nodes cannot be predicted with absolute accuracy, since the size of the kernel and other factors influence it. So it can happen, that after an OS update, the amount of RAM available to slurm jobs on the nodes will have changed and as a consequence the slurm config will need to be rewritten.
K> The Error read as follows:
K> 2018-10-10T09:14:43.306] error: _slurm_rpc_node_registration K> node=beo-02: Invalid argument [2018-10-10T09:14:43.306] error: K> Node beo-01 appears to have a different slurm.conf than the K> slurmctld. This could cause issues with communication and K> functionality. Please review both files and make sure they are K> the same. If this is expected ignore, and set K> DebugFlags=NO_CONF_HASH in your slurm.conf. K> [2018-10-10T09:14:43.306] error: Node beo-01 has low K> socket*core*thread count (16 < 32) [2018-10-10T09:14:43.306] K> error: Node beo-01 has low cpu count (16 < 32) K> [2018-10-10T09:14:43.306] error: _slurm_rpc_node_registration K> node=beo-01: Invalid argument
K> Moreover it seems when a node is added, at the first boot time K> slurm detect the RAM, etc wrongly. This can be resolved by simply K> restarting slurm at the new added node. Perhaps the update would K> indeed resolve this issue.
Indeed, Qlustar 10.1 substantially improved upon the reporting of available RAM for slurm jobs.
Best,
Roland