[Qlustar-General] Re: Slurm failed to detect properly CPU-cores and RAM when adding new nodes.

15 Oct 2018


      ...
...
...
...
...
"K" == Kwinten Nelissen kwinten.nelissen@gmail.com writes:
Hi Kwinten,
K> Hi, Something went from when adding new nodes. The configuration
    K> file was somehow not fully written.  So I wrote them again and
    K> restarted the slurm on all nodes.  All nodes were affected but
    K> the problem seems to be resolved now.
the available RAM reported by nodes cannot be predicted with absolute
accuracy, since the size of the kernel and other factors influence
it. So it can happen, that after an OS update, the amount of RAM available to
slurm jobs on the nodes will have changed and as a consequence the slurm
config will need to be rewritten.
K> The Error read as follows:
K> 2018-10-10T09:14:43.306] error: _slurm_rpc_node_registration
    K> node=beo-02: Invalid argument [2018-10-10T09:14:43.306] error:
    K> Node beo-01 appears to have a different slurm.conf than the
    K> slurmctld.  This could cause issues with communication and
    K> functionality.  Please review both files and make sure they are
    K> the same.  If this is expected ignore, and set
    K> DebugFlags=NO_CONF_HASH in your slurm.conf.
    K> [2018-10-10T09:14:43.306] error: Node beo-01 has low
    K> socket*core*thread count (16 < 32) [2018-10-10T09:14:43.306]
    K> error: Node beo-01 has low cpu count (16 < 32)
    K> [2018-10-10T09:14:43.306] error: _slurm_rpc_node_registration
    K> node=beo-01: Invalid argument
K> Moreover it seems when a node is added, at the first boot time
    K> slurm detect the RAM, etc wrongly. This can be resolved by simply
    K> restarting slurm at the new added node. Perhaps the update would
    K>  indeed resolve this issue.
Indeed, Qlustar 10.1 substantially improved upon the reporting of available
RAM for slurm jobs.
Best,
Roland

2025

2024

2023

2022

2021

2020

2019

2018

[Qlustar-General] Re: Slurm failed to detect properly CPU-cores and RAM when adding new nodes.