Hi All:
Is there a preferred way to change the name of a GPU gres type in Qluman? For example, we have to use
--gres=gpu:geforce-rtx-2080-ti:2
to request that specific model of GPU. I would like to simplify that to something like:
--gres=gpu:2080ti:2
When I attempt to change the names from what the GPU Wizard had originally set up, it makes things go badly, in that the the gres resource is no longer recognized. Is there a clean way to do this?
Thanks, Bryan
Hi Bryan,
I'm not sure I understand the problem or where the problem is. I created a GPU gres group on our test system set to have 2 geforce-rtx-2080-ti GPUs and assigned it to a host. I then edited the GPU type by double clikcing it and shortening it to 2080ti. When I then click preview the gres.conf reads as:
NodeName=beo-204 Name=gpu Type=2080ti Count=2
So there doesn't seem to be a problem in qluman-qt preventing the gres group to be simplified. After clicking "Write" the change should be written to the slurm config and slurm restarted to recognise it. That is all there should be to it.
If the generated gres.conf looks right on your end too then please check the log files of slurm itself to see if it complained about something on restart that would make it ignore the specified gres group. Please attach the generated slurm config and log file if you notice anything odd and include the full command you tried that failed and the exact error it produced.
Greetings, Goswin von Brederlow
Hi Goswin:
When I do as you did, all of my GPU nodes show up as "Wrongly configured" and the nodes set themselves to "drained". I'll have a closer look at the log files.
On Fri, Apr 5, 2019 at 7:39 AM Goswin von Brederlow brederlo@q-leap.de wrote:
Hi Bryan,
I'm not sure I understand the problem or where the problem is. I created a GPU gres group on our test system set to have 2 geforce-rtx-2080-ti GPUs and assigned it to a host. I then edited the GPU type by double clikcing it and shortening it to 2080ti. When I then click preview the gres.conf reads as:
NodeName=beo-204 Name=gpu Type=2080ti Count=2
So there doesn't seem to be a problem in qluman-qt preventing the gres group to be simplified. After clicking "Write" the change should be written to the slurm config and slurm restarted to recognise it. That is all there should be to it.
If the generated gres.conf looks right on your end too then please check the log files of slurm itself to see if it complained about something on restart that would make it ignore the specified gres group. Please attach the generated slurm config and log file if you notice anything odd and include the full command you tried that failed and the exact error it produced.
Greetings, Goswin von Brederlow _______________________________________________ Qlustar-General mailing list -- qlustar-general@qlustar.org To unsubscribe send an email to qlustar-general-leave@qlustar.org
When I make the change, the preview of gres.conf looks good, but the logs look like this when I apply the write:
This is a single node (called gpu-11) that has 2 x titan XP cards that show up originally as TITAN-Xp in the gres groups, I then changed it to simply "titan"
[2019-04-05T08:54:15.636] error: Node gpu-11 appears to have a different slurm.conf than the slurmctld. This could cause issues with communication and functionality. Please review both files and make sure they are the same. If this is expected ignore, and set DebugFlags=NO_CONF_HASH in your slurm.conf. [2019-04-05T08:54:15.636] error: Setting node gpu-11 state to DRAIN [2019-04-05T08:54:15.637] drain_nodes: node gpu-11 state set to DRAIN [2019-04-05T08:54:15.637] error: _slurm_rpc_node_registration node=gpu-11: Invalid argument
The error about the differing slurm.conf shows up for all nodes whenever a config change occurs, but I'm assuming (maybe incorrectly) this is because the slurm.conf is on an NFS mount and the error can be ignored?
Thanks, Bryan
On Fri, Apr 5, 2019 at 7:44 AM Bryan Hill bhill@ucsd.edu wrote:
Hi Goswin:
When I do as you did, all of my GPU nodes show up as "Wrongly configured" and the nodes set themselves to "drained". I'll have a closer look at the log files.
On Fri, Apr 5, 2019 at 7:39 AM Goswin von Brederlow brederlo@q-leap.de wrote:
Hi Bryan,
I'm not sure I understand the problem or where the problem is. I created a GPU gres group on our test system set to have 2 geforce-rtx-2080-ti GPUs and assigned it to a host. I then edited the GPU type by double clikcing it and shortening it to 2080ti. When I then click preview the gres.conf reads as:
NodeName=beo-204 Name=gpu Type=2080ti Count=2
So there doesn't seem to be a problem in qluman-qt preventing the gres group to be simplified. After clicking "Write" the change should be written to the slurm config and slurm restarted to recognise it. That is all there should be to it.
If the generated gres.conf looks right on your end too then please check the log files of slurm itself to see if it complained about something on restart that would make it ignore the specified gres group. Please attach the generated slurm config and log file if you notice anything odd and include the full command you tried that failed and the exact error it produced.
Greetings, Goswin von Brederlow _______________________________________________ Qlustar-General mailing list -- qlustar-general@qlustar.org To unsubscribe send an email to qlustar-general-leave@qlustar.org
"B" == Bryan Hill bhill@ucsd.edu writes:
Hi Bryan,
B> [2019-04-05T08:54:15.636] error: Node gpu-11 appears to have a B> different slurm.conf than the slurmctld. This could cause issues B> with communication and functionality. Please review both files B> and make sure they are the same. If this is expected ignore, and B> set DebugFlags=NO_CONF_HASH in your slurm.conf. B> [2019-04-05T08:54:15.636] error: Setting node gpu-11 state to B> DRAIN [2019-04-05T08:54:15.637] drain_nodes: node gpu-11 state B> set to DRAIN [2019-04-05T08:54:15.637] error: B> _slurm_rpc_node_registration node=gpu-11: Invalid argument
B> The error about the differing slurm.conf shows up for all nodes B> whenever a config change occurs, but I'm assuming (maybe B> incorrectly) this is because the slurm.conf is on an NFS mount B> and the error can be ignored?
often it can be ignored. But some changes obviously require the restart of slurmd on the nodes. It doesn't harm to do this even while jobs are running and it's done easily via the Slurm 'Node State Management' dialog in the GUI.
Concerning the invalid argument: Can you please post the line in slurm.conf corresponding to node gpu-11?
Best,
Roland
B> On Fri, Apr 5, 2019 at 7:44 AM Bryan Hill bhill@ucsd.edu wrote: >> >> Hi Goswin: >> >> When I do as you did, all of my GPU nodes show up as "Wrongly >> configured" and the nodes set themselves to "drained". I'll have >> a closer look at the log files. >> >> On Fri, Apr 5, 2019 at 7:39 AM Goswin von Brederlow >> brederlo@q-leap.de wrote: >> > >> > Hi Bryan, >> > >> > I'm not sure I understand the problem or where the problem >> > is. I created a GPU gres group on our test system set to have 2 >> > geforce-rtx-2080-ti GPUs and assigned it to a host. I then >> > edited the GPU type by double clikcing it and shortening it to >> > 2080ti. When I then click preview the gres.conf reads as: >> > >> > NodeName=beo-204 Name=gpu Type=2080ti Count=2 >> > >> > So there doesn't seem to be a problem in qluman-qt preventing >> > the gres group to be simplified. After clicking "Write" the >> > change should be written to the slurm config and slurm >> > restarted to recognise it. That is all there should be to it. >> > >> > >> > If the generated gres.conf looks right on your end too then >> > please check the log files of slurm itself to see if it >> > complained about something on restart that would make it ignore >> > the specified gres group. Please attach the generated slurm >> > config and log file if you notice anything odd and include the >> > full command you tried that failed and the exact error it >> > produced. >> > >> > Greetings, Goswin von Brederlow >> > _______________________________________________ Qlustar-General >> > mailing list -- qlustar-general@qlustar.org To unsubscribe send >> > an email to qlustar-general-leave@qlustar.org B> _______________________________________________ Qlustar-General B> mailing list -- qlustar-general@qlustar.org To unsubscribe send B> an email to qlustar-general-leave@qlustar.org
--
Hi Roland:
On Fri, Apr 5, 2019 at 9:26 AM Roland Fehrenbacher rf@q-leap.de wrote:
"B" == Bryan Hill bhill@ucsd.edu writes:
Hi Bryan,
B> [2019-04-05T08:54:15.636] error: Node gpu-11 appears to have a B> different slurm.conf than the slurmctld. This could cause issues B> with communication and functionality. Please review both files B> and make sure they are the same. If this is expected ignore, and B> set DebugFlags=NO_CONF_HASH in your slurm.conf. B> [2019-04-05T08:54:15.636] error: Setting node gpu-11 state to B> DRAIN [2019-04-05T08:54:15.637] drain_nodes: node gpu-11 state B> set to DRAIN [2019-04-05T08:54:15.637] error: B> _slurm_rpc_node_registration node=gpu-11: Invalid argument B> The error about the differing slurm.conf shows up for all nodes B> whenever a config change occurs, but I'm assuming (maybe B> incorrectly) this is because the slurm.conf is on an NFS mount B> and the error can be ignored?
often it can be ignored. But some changes obviously require the restart of slurmd on the nodes. It doesn't harm to do this even while jobs are running and it's done easily via the Slurm 'Node State Management' dialog in the GUI.
Great, thanks for the tip!
Concerning the invalid argument: Can you please post the line in slurm.conf corresponding to node gpu-11?
NodeName=gpu-11 CoresPerSocket=12 Gres=gpu:titan:2 RealMemory=189773 Sockets=2 ThreadsPerCore=1
Best,
Roland
B> On Fri, Apr 5, 2019 at 7:44 AM Bryan Hill <bhill@ucsd.edu> wrote: >> >> Hi Goswin: >> >> When I do as you did, all of my GPU nodes show up as "Wrongly >> configured" and the nodes set themselves to "drained". I'll have >> a closer look at the log files. >> >> On Fri, Apr 5, 2019 at 7:39 AM Goswin von Brederlow >> <brederlo@q-leap.de> wrote: >> > >> > Hi Bryan, >> > >> > I'm not sure I understand the problem or where the problem >> > is. I created a GPU gres group on our test system set to have 2 >> > geforce-rtx-2080-ti GPUs and assigned it to a host. I then >> > edited the GPU type by double clikcing it and shortening it to >> > 2080ti. When I then click preview the gres.conf reads as: >> > >> > NodeName=beo-204 Name=gpu Type=2080ti Count=2 >> > >> > So there doesn't seem to be a problem in qluman-qt preventing >> > the gres group to be simplified. After clicking "Write" the >> > change should be written to the slurm config and slurm >> > restarted to recognise it. That is all there should be to it. >> > >> > >> > If the generated gres.conf looks right on your end too then >> > please check the log files of slurm itself to see if it >> > complained about something on restart that would make it ignore >> > the specified gres group. Please attach the generated slurm >> > config and log file if you notice anything odd and include the >> > full command you tried that failed and the exact error it >> > produced. >> > >> > Greetings, Goswin von Brederlow
"B" == Bryan Hill bhill@ucsd.edu writes:
B> The error about the differing slurm.conf shows up for all nodes B> whenever a config change occurs, but I'm assuming (maybe B> incorrectly) this is because the slurm.conf is on an NFS mount B> and the error can be ignored? >> >> often it can be ignored. But some changes obviously require the >> restart of slurmd on the nodes. It doesn't harm to do this even >> while jobs are running and it's done easily via the Slurm 'Node >> State Management' dialog in the GUI.
B> Great, thanks for the tip!
Actually, thinking about it, this is almost certainly a situation where a restart of slurmd is required, since gres.conf would have also changed and there is no way that slurmd would know about this change without a restart unless it does an automatic reread as a result of something like inotify which I highly doubt ...
>> Concerning the invalid argument: Can you please post the line in >> slurm.conf corresponding to node gpu-11?
B> NodeName=gpu-11 CoresPerSocket=12 Gres=gpu:titan:2 B> RealMemory=189773 Sockets=2 ThreadsPerCore=1
On Fri, Apr 5, 2019 at 9:46 AM Roland Fehrenbacher rf@q-leap.de wrote:
"B" == Bryan Hill bhill@ucsd.edu writes:
B> The error about the differing slurm.conf shows up for all nodes B> whenever a config change occurs, but I'm assuming (maybe B> incorrectly) this is because the slurm.conf is on an NFS mount B> and the error can be ignored? >> >> often it can be ignored. But some changes obviously require the >> restart of slurmd on the nodes. It doesn't harm to do this even >> while jobs are running and it's done easily via the Slurm 'Node >> State Management' dialog in the GUI. B> Great, thanks for the tip!
Actually, thinking about it, this is almost certainly a situation where a restart of slurmd is required, since gres.conf would have also changed and there is no way that slurmd would know about this change without a restart unless it does an automatic reread as a result of something like inotify which I highly doubt ...
>> Concerning the invalid argument: Can you please post the line in >> slurm.conf corresponding to node gpu-11? B> NodeName=gpu-11 CoresPerSocket=12 Gres=gpu:titan:2 B> RealMemory=189773 Sockets=2 ThreadsPerCore=1
Just an update: doing a "service slurmctld restart" on the head node seemed to fix the gres rename issue I was having.
"B" == Bryan Hill bhill@ucsd.edu writes:
B> Just an update: doing a "service slurmctld restart" on the head B> node seemed to fix the gres rename issue I was having.
OK, good. Weird though that this was necessary, since writing the slurm config with QluMan does that as well as the last step of the 'write config' procedure.
On Mon, Apr 8, 2019 at 5:19 AM Roland Fehrenbacher rf@q-leap.de wrote:
"B" == Bryan Hill bhill@ucsd.edu writes:
B> Just an update: doing a "service slurmctld restart" on the head B> node seemed to fix the gres rename issue I was having.
OK, good. Weird though that this was necessary, since writing the slurm config with QluMan does that as well as the last step of the 'write config' procedure.
Yeah, I don't know why. I'm going to try the rename again to confirm when I get a chance.
The only other thing is that all of the nodes now show the GPUs as "wrongly configured" in the GUI.
"B" == Bryan Hill bhill@ucsd.edu writes:
Hi Bryan,
sorry for the late answer.
B> The only other thing is that all of the nodes now show the GPUs B> as "wrongly configured" in the GUI.
Yes this is "by design", if you don't accept the QluMan suggested name. Would need a more complicate change to fix that. We have it as a low prio item on our ToDo.
Best,
Roland
On Wed, Apr 24, 2019 at 1:00 AM Roland Fehrenbacher rf@q-leap.de wrote:
"B" == Bryan Hill bhill@ucsd.edu writes:
Hi Bryan,
sorry for the late answer.
Not a problem at all.
B> The only other thing is that all of the nodes now show the GPUs B> as "wrongly configured" in the GUI.
Yes this is "by design", if you don't accept the QluMan suggested name. Would need a more complicate change to fix that. We have it as a low prio item on our ToDo.
Got it. The reason I needed to change was that the users needed to type:
geforce-rtx-2080-ti
so I shortened it to just 2080ti
Best,
Roland _______________________________________________