Hi everyone,
This is my first post, and so far I love Qlustar! Just what I've been looking for, you've done amazing work here.
I noticed that my physical compute nodes, which are all diskless, fail when Qluman tries to write the Namespace Configurations. Upon trying lots of things, I found that the file /etc/sssd.conf is written with '644' permissions, causing the sssd.service to not start. However, when I manually change these with 'chmod 600 /etc/sssd.conf' and restart the service with 'service restart sssd' when SSH'ed into a compute node, it works perfectly.
So my question is, how can I change the permissions of a config file which Qluman writes to all nodes? I think this is my last barrier to a properly provisioned cluster.
Thanks, -Mike
Hi again,
OK, I got brave and apparently fixed the issue. I edited two files:
/usr/lib/python3/dist-packages/qluman-12/server/admin.py
and
/usr/lib/python3/dist-packages/qluman-12/server/cfgman/genconfs.py
In this block of Python code in admin.py, I changed permissions as follows on lines 998 and 1003:
file = CT.FileContent(name, "root", "root", 0o600, CT.FILE_KIND.PLAIN, sssd_conf) ... file = CT.FileContent(cert, "root", "root", 0o600, CT.FILE_KIND.PLAIN, content.decode())
And same in genconfs.py on lines 1636 and 1642:
file = CT.FileContent(file_name, "root", "root", 0o600, CT.FILE_KIND.PLAIN, content) ... files[file_name] = CT.FileContent(file_name, "root", "root", 0o600, CT.FILE_KIND.PLAIN, content.decode())
That seems to have done the trick!
Anyone know if this could cause any unexpected side effects?
Thanks, -Mike
Hello Mike,
On Thu, Mar 04, 2021 at 09:11:56PM +0000, hereiam@mit.edu wrote:
That seems to have done the trick!
Anyone know if this could cause any unexpected side effects?
I don't know, but in general, fixing such thing by changing the code tends to incur higher risks than if you manage to do it through the user interface.
I have some additional information that might help to pin down the cause: I have seen the problem you describe for some time, but on only one of a bunch of identical nodes. At first, I thought this was a fluke, so I tried rebooting, but the problem persisted. I changed the permissions locally on the node, but after a reboot, the problem was back. I couldn;t think of any differences between that node and its siblings, so I just let it be.
After reading your post today, I looked at this again: Our nodes aren't diskless, they use the default "ZFS" config for local scratch and /var. So there must be some other difference. Finally, I've found it: the problem node had a Nameservice Config attached individually. After removing it, the problem is gone.
Of course, this doesn't answer the question why the Nameservice Config clobbers sssd.conf's permissions...
A.
Hi Mike,
I can reproduce this problem. Noticed it on Tuesday actually but didn't have time to check out the reason for sssd failing. Changing the permissions in the .py files is the right fix for this.
As for why it clobbers it: The sssd.conf is generated on the headnode according to the configuration of the node. It is then written by qluman-execd on boot or when you write Nameservice Config later. Qluman-execd only has a command to write a file with content, owner and mode. There is no modify or update mode. It always overwrites any existing file. So if it gets told to set the wrong mode you get the wrong mode.
Hope that explains it a bit. Goswin von Brederlow
Hi Ansgar and Goswin,
Thank you very much for the explanation! It did seem that the qluman-exec scripts would always write a fresh file, as I noticed that SSHing into a node and manually changing a file and/or its permissions was all overwritten during each qluman node synchronization.
All is working well now, even after reboots sssd.conf gets written as it should (as expected). Is there any mechanism for me to submit a patch request to qluman, or can you take it from here?
Next up I'm chasing down where /etc/system/logind.conf gets written to the compute nodes, as I'm building somewhat of a Frankenstein cluster - I'm using a big pile of old laptops discarded by others as a hodge-podge cluster of diskless, very power efficient nodes. I can start a new thread on that if I don't figure it out soon. That's my home/hobby cluster, the one I wrote about above is a regular rack of identical, diskless 1U server nodes.
Cheers, -Mike
Hi everyone,
Got it! There is a line in /etc/systemd/logind.conf regarding what to do if one closes a laptop lid:
#HandleLidSwitch=suspend
Changing this to:
HandleLidSwitch=ignore
while editing the disk image using 'qlustar-image-edit -s Standard-focal' and rebooting nodes did the trick. Now I can use all sorts of old laptops, some with no screens, keyboards, none with hard drives, and all with various problems as additional nodes.
Cheers, -Mike
"M" == hereiam hereiam@mit.edu writes:
Hi Mike,
M> Hi everyone, Got it! There is a line in /etc/systemd/logind.conf M> regarding what to do if one closes a laptop lid:
>>>> #HandleLidSwitch=suspend
M> Changing this to:
>>>> HandleLidSwitch=ignore
M> while editing the disk image using 'qlustar-image-edit -s M> Standard-focal' and rebooting nodes did the trick. Now I can use M> all sorts of old laptops, some with no screens, keyboards, none M> with hard drives, and all with various problems as additional M> nodes.
Thanks for reporting. Glad to see you being able to do clustering with your old laptops :)
Note that changes done with 'qlustar-image-edit -s' will be gone, when you update your cluster next time. If you want to have this permanently, you will have to use 'qlustar-image-edit -e' as described at http://docs-dev.qlustar.com/Qlustar/12.0/ClusterOS/administration-manual/nod...
Best,
Roland
Hi Roland,
Thank you for that tip, and it worked perfectly. I had indeed been editing the images using the '-s' option, noticed the changes were overwritten upon update, and just remembered to make the changes again afterwards. Of course that's not a sustainable mode of operation, as problems typically arise somewhere between my keyboard and my chair.
I've got another question for you, related to this one. No matter what I've tried, I can't seem to get my <headnode>:/srv/data folder network FS mounted to /data on my diskless compute nodes using the Qluman-GUI. Same with <headnode>:/srv/apps to /apps. I've confirmed that the Network FS Mounts take export paths /srv/data and /srv/apps mapped to Mountpoints /data and /apps via NFS, and that in my Filesystem Exports I have /srv/data and /srv/apps listed as Export Paths on both my IB and Boot networks (I've got an Infiniband switch and a Gigabit one too) from my headnode Server (beosrv-c).
I've also confirmed that they do show up in /etc/exports on my headnode.
My permanent, sloppy fix has been to add nfs mountpoint entries to the /etc/fstab file in my Qlustar image using the editing mode mentioned above, and while this does work perfectly, I realize that if my headnode IP address were to change to something else this file would also need to be changed to reflect that.
Is anyone able to reproduce this issue using a headnode with only diskless nodes?
I'd be happy to upload any relevant Qluman config files if it would help.
Thanks as always, -Mike
"M" == hereiam hereiam@mit.edu writes:
Hi Mike,
M> I've got another question for you, related to this one. No matter M> what I've tried, I can't seem to get my <headnode>:/srv/data M> folder network FS mounted to /data on my diskless compute nodes M> using the Qluman-GUI. Same with <headnode>:/srv/apps to M> /apps. I've confirmed that the Network FS Mounts take export M> paths /srv/data and /srv/apps mapped to Mountpoints /data and M> /apps via NFS, and that in my Filesystem Exports I have /srv/data M> and /srv/apps listed as Export Paths on both my IB and Boot M> networks (I've got an Infiniband switch and a Gigabit one too) M> from my headnode Server (beosrv-c).
M> I've also confirmed that they do show up in /etc/exports on my M> headnode.
M> My permanent, sloppy fix has been to add nfs mountpoint entries M> to the /etc/fstab file in my Qlustar image using the editing mode M> mentioned above, and while this does work perfectly, I realize M> that if my headnode IP address were to change to something else M> this file would also need to be changed to reflect that.
M> Is anyone able to reproduce this issue using a headnode with only M> diskless nodes?
M> I'd be happy to upload any relevant Qluman config files if it M> would help.
please check the files generated with 'Preview config' in the GUI, as I explained in my previous mail. There you can see the systemd mount units relevant for a node from which you can deduce the mount cmd used.
Best,
Roland