Hi Roland:
On Fri, Apr 5, 2019 at 9:26 AM Roland Fehrenbacher rf@q-leap.de wrote:
"B" == Bryan Hill bhill@ucsd.edu writes:
Hi Bryan,
B> [2019-04-05T08:54:15.636] error: Node gpu-11 appears to have a B> different slurm.conf than the slurmctld. This could cause issues B> with communication and functionality. Please review both files B> and make sure they are the same. If this is expected ignore, and B> set DebugFlags=NO_CONF_HASH in your slurm.conf. B> [2019-04-05T08:54:15.636] error: Setting node gpu-11 state to B> DRAIN [2019-04-05T08:54:15.637] drain_nodes: node gpu-11 state B> set to DRAIN [2019-04-05T08:54:15.637] error: B> _slurm_rpc_node_registration node=gpu-11: Invalid argument B> The error about the differing slurm.conf shows up for all nodes B> whenever a config change occurs, but I'm assuming (maybe B> incorrectly) this is because the slurm.conf is on an NFS mount B> and the error can be ignored?
often it can be ignored. But some changes obviously require the restart of slurmd on the nodes. It doesn't harm to do this even while jobs are running and it's done easily via the Slurm 'Node State Management' dialog in the GUI.
Great, thanks for the tip!
Concerning the invalid argument: Can you please post the line in slurm.conf corresponding to node gpu-11?
NodeName=gpu-11 CoresPerSocket=12 Gres=gpu:titan:2 RealMemory=189773 Sockets=2 ThreadsPerCore=1
Best,
Roland
B> On Fri, Apr 5, 2019 at 7:44 AM Bryan Hill <bhill@ucsd.edu> wrote: >> >> Hi Goswin: >> >> When I do as you did, all of my GPU nodes show up as "Wrongly >> configured" and the nodes set themselves to "drained". I'll have >> a closer look at the log files. >> >> On Fri, Apr 5, 2019 at 7:39 AM Goswin von Brederlow >> <brederlo@q-leap.de> wrote: >> > >> > Hi Bryan, >> > >> > I'm not sure I understand the problem or where the problem >> > is. I created a GPU gres group on our test system set to have 2 >> > geforce-rtx-2080-ti GPUs and assigned it to a host. I then >> > edited the GPU type by double clikcing it and shortening it to >> > 2080ti. When I then click preview the gres.conf reads as: >> > >> > NodeName=beo-204 Name=gpu Type=2080ti Count=2 >> > >> > So there doesn't seem to be a problem in qluman-qt preventing >> > the gres group to be simplified. After clicking "Write" the >> > change should be written to the slurm config and slurm >> > restarted to recognise it. That is all there should be to it. >> > >> > >> > If the generated gres.conf looks right on your end too then >> > please check the log files of slurm itself to see if it >> > complained about something on restart that would make it ignore >> > the specified gres group. Please attach the generated slurm >> > config and log file if you notice anything odd and include the >> > full command you tried that failed and the exact error it >> > produced. >> > >> > Greetings, Goswin von Brederlow