Dear Qlustar experts,
after a successful installation of the latest version of Qlustar (ver 11.0.1), I am facing a problem with the setup of computing nodes: they do not mount the nfs directories for apps and home. To me, everything looks correctly set up (checking with the qluman-qt application), Filesystem Exports and Network FS Mounts look correctly defined and assigned to the configs of the nodes.
The only signal that something in the configuration of nodes is maybe not correct is an empty /etc/fstab, but probably the mount mechanism is different from standard mounts.
What happens is that doing ssh to a compute node nor the apps nor the home directories are found. With the previous version 10.0 I did not have (on a different older cluster without IB) this issue.
Please, can you give me some hints about what I have to check to fix this issue?
Thank you in advance. Best regards,
Franco
"F" == marra marra@irc.cnr.it writes:
Hi Franco,
F> Dear Qlustar experts, after a successful installation of the F> latest version of Qlustar (ver 11.0.1), I am facing a problem F> with the setup of computing nodes: they do not mount the nfs F> directories for apps and home. To me, everything looks correctly F> set up (checking with the qluman-qt application), Filesystem F> Exports and Network FS Mounts look correctly defined and assigned F> to the configs of the nodes.
have you tried to do an ls on them? They are automount units and will only mount once accessed.
F> The only signal that something in the configuration of nodes is F> maybe not correct is an empty /etc/fstab, but probably the mount F> mechanism is different from standard mounts.
Yes it is, it uses systemd mount units. You can check the systemd config files for these mount units using the context menu (right mouse button) of a node in the QluMan GUI's Enclosure view and select the entry "Preview config". In the displayed filesystem tree look for the *.mount files and select them. From their content you will be able to see the arguments of the corresponding mount command. On the net-booted node itself you can look into the systemd journal using 'journalctl -xe' and check what error (if any) you have when the mount is activated.
Best,
Roland
F> What happens is that doing ssh to a compute node nor the apps nor F> the home directories are found. With the previous version 10.0 I F> did not have (on a different older cluster without IB) this F> issue.
F> Please, can you give me some hints about what I have to check to F> fix this issue?
F> Thank you in advance. Best regards,
F> Franco _
"F" == marra@irc cnr it marra@irc.cnr.it writes:
Hi Franco,
it looks as if the IB network might not be exported in /etc/exports. Also please check whether you have set NEED_RDMA="yes" in /etc/default/nfs-kernel-server. This is needed to enable NFSoRDMA.
If you don't want to use NFSoRDMA, you can remove the IB network from your Filesystem Exports resources in QluMan alltogether or just uncheck "Allow RDMA" in the FS mounts definitions. In the latter case, the IB network can be used for NFS but only with TCP/IP rather than RDMA.
Best,
Roland
F> Dear Roland, firstly I like to thank you so much for your kind F> and detailed answer, that allowed me to understand more details F> about this very nice Qlustar distribution and to investigate F> better my issue. I hope I will not bore you too much with my F> reply.
F> When I try to login as a simple user from the frontend to a F> compute node (standard-node) I get a password request and the nfs F> directories are not mounted. These are the relevant lines of the F> output of the journalctl -xe command:
F> Jul 25 12:16:34 HP4 systemd[1]: data-home.automount: Got F> automount request for /data/home, triggered by 38058 (sshd) Jul F> 25 12:16:34 HP4 systemd[1]: Mounting Mount point /data/home... F> Jul 25 12:16:34 HP4 kernel: RPC: Registered rdma transport F> module. Jul 25 12:16:34 HP4 kernel: RPC: Registered rdma F> backchannel transport module. Jul 25 12:16:34 HP4 mount[38060]: F> mount.nfs: access denied by server while mounting F> beosrv-ib:/srv/data/home Jul 25 12:16:34 HP4 systemd[1]: F> data-home.mount: Mount process exited, code= exited status=32 Jul F> 25 12:16:34 HP4 systemd[1]: data-home.mount: Failed with result F> 'exit-code'. Jul 25 12:16:34 HP4 systemd[1]: Failed to mount F> Mount point /data/home. Jul 25 12:16:34 HP4 systemd[1]: F> data-home.automount: Got automount request for /data/home, F> triggered by 38058 (sshd) Jul 25 12:16:34 HP4 systemd[1]: F> Mounting Mount point /data/home... Jul 25 12:16:34 HP4 F> mount[38065]: mount.nfs: access denied by server while mounting F> beosrv-ib:/srv/data/home Jul 25 12:16:34 HP4 systemd[1]: F> data-home.mount: Mount process exited, code= exited status=32 Jul F> 25 12:16:34 HP4 systemd[1]: data-home.mount: Failed with result F> 'exit-code'. Jul 25 12:16:34 HP4 systemd[1]: Failed to mount F> Mount point /data/home. Jul 25 12:16:34 HP4 systemd[1]: F> data-home.automount: Got automount request for /data/home, F> triggered by 38058 (sshd) Jul 25 12:16:34 HP4 systemd[1]: F> Mounting Mount point /data/home... Jul 25 12:16:34 HP4 F> mount[38067]: mount.nfs: access denied by server while mounting F> beosrv-ib:/srv/data/home Jul 25 12:16:34 HP4 systemd[1]: F> data-home.mount: Mount process exited, code= exited status=32 Jul F> 25 12:16:34 HP4 systemd[1]: data-home.mount: Failed with result F> 'exit-code'. Jul 25 12:16:34 HP4 systemd[1]: Failed to mount F> Mount point /data/home. Jul 25 12:16:34 HP4 sshd[38058]: Rhosts F> authentication refused for marra: no home directory F> /data/home/marra
F> From this I understand that the nfs request is for the Infiniband F> interface of the headnode. However, I have a virtual front end F> that miss an IB interface, and the Filesystem exports config F> looks like:
F> Name: Home Server: beosrv-c Export Path: /srv/data/home Network F> priorities: Boot IB
F> I do not know how the priority is exactly managed, but the mount F> command on the VM-FE show me:
F> beosrv-c:/srv/data/home on /data/home type nfs ...
F> so I am sure the home directory is mounted on the ethernet F> network. Is it normal to mix mounting options for FE and standard F> nodes?
F> The network FS Mounts dialog for the same directory in Qluman-qt F> looks like:
F> Resource: Home Export Path: /srtv/data/home [blank] [ ] Override F> Network: (grayed Boot) [X] Allow RDMA
F> The preview config of the node HP4 shows me some alerts (red dots F> or green/red dots) for the following points:
F> (RED/GREEN) /etc -> (RED) Network -> (RED) interfaces.d/qluman F> (I suppose this is not relevant to my issue) F> ################################################################## F> #------------- File is auto-generated by Qluman! F> #-------------# ------------- Manual changes will be F> #overwritten! -------------# F> #----------------------------------------------------------------#
F> auto BOOT iface BOOT inet dhcp metric 10
F> auto ib0 iface ib0 inet static address 192.168.53.104 netmask F> 24 pre-up /lib/qlustar/ib-initialize
F> (RED/GREEN) /etc -> (RED/GREEN) qlustar -> (RED) Disk config F> # ZFS config for single disk (/dev/sda): F> # Zpool name: SYS 8GB zvol for swap (not activated) F> #Filesystems: /var (max 2GB) + /scratch - both compressed
F> [BASE] ZPOOLS = SYS ZFS = var, scratch F> #ARC_LIMIT = 1024 ZVOLS = swap
F> [SYS] vdevs = V-SYS
F> [V-SYS] devs = /dev/sda type =
F> [swap] zpool = SYS size = 8G
F> [var] zpool = SYS quota = 20G reservation = 20G compress = lz4
F> [scratch] zpool = SYS compress = lz4
F> (RED/GREEN) sysconfig -> (RED/GREEN) network-scripts -> (RED) F> ifcfg-BOOT
F> ################################################################## F> #------------- File is auto-generated by Qluman! F> #-------------# ------------- Manual changes will be F> #overwritten! -------------# F> #----------------------------------------------------------------# F> DEVICE=BOOT BOOTPROTO=dhcp ONBOOT=yes TYPE=Ethernet F> HWADDR=a0:d3:c1:fd:9c:a8
F> Maybe a solution could be to delete the IB network from the F> config of the Filesystem exports so to be sure to be consistent F> with the network protocol both for the VM-FE and the nodes.
F> If you have time to give me your hints, I will really appreciate F> your help.
F> Thank you and best regards,
F> Franco
Dear Roland,
sorry for not having replied earlier: due to the COVID emergency, I was able to put my hands on the machine only today.
Good news: I was able to get the nfs filesystems on the nodes! Thank you so much for your help.
I think it can be useful to tell you about what I have done and what did not work. You were right about the entries in /etc/exports: no path was exported on the IB network even if this network was listed in the menu of qluman-qt Filesystem exports under the submenu Network priorities. Then I tried to copy and paste all the ethernet (Boot) exports and change the IP entries to reflect the IB network. After a restart of the full cluster (I was not sure about which services needed to be restarted) I got a different behavior trying to login on the nodes: the login was unsuccessful because of a timeout on the nfs mount of the home directory. By the way, I was unable to find this modification reflected in the qluman-qt entry "Network NFS mounts". Then I checked the entry NEED_RDMA="yes" in file /etc/default/nfs-kernel-server and it was missed even if the proper check was activated in qluman-qt . I did not find any way to edit this configuration within qluman-qt. At this point, I choose to avoid having nfs on IB, so I edited again the /etc/exports file deleting the entry about IB and in qluman-qt I deleted the reference to IB in the Network priorities sub-menu of the Filesystem exports menu. Et voilà, after a new restart of the cluster, users are now able to login to the compute nodes landing in their home directory.
Your suggestions were essential to solve this issue, thank you again. If I can be of any help for a better analysis of the reported behavior, do not hesitate to contact me.
A side info: I get continuous disconnections from the qluman-qt interface, while all connections to the server console or to the nodes are perfectly stable.
My best regards,
Franco
"F" == marra marra@irc.cnr.it writes:
Hi Franco,
F> Dear Roland, sorry for not having replied earlier: due to the F> COVID emergency, I was able to put my hands on the machine only F> today.
F> Good news: I was able to get the nfs filesystems on the nodes! F> Thank you so much for your help.
glad to hear that.
F> I think it can be useful to tell you about what I have done and F> what did not work. You were right about the entries in F> /etc/exports: no path was exported on the IB network even if this F> network was listed in the menu of qluman-qt Filesystem exports F> under the submenu Network priorities. Then I tried to copy and F> paste all the ethernet (Boot) exports and change the IP entries F> to reflect the IB network. After a restart of the full cluster (I F> was not sure about which services needed to be restarted) I got a F> different behavior trying to login on the nodes: the login was F> unsuccessful because of a timeout on the nfs mount of the home F> directory. By the way, I was unable to find this modification F> reflected in the qluman-qt entry "Network NFS mounts". Then I F> checked the entry NEED_RDMA="yes" in file F> /etc/default/nfs-kernel-server and it was missed even if the F> proper check was activated in qluman-qt . I did not find any way F> to edit this configuration within qluman-qt.
Please note that a lot of configurations on the head-node are not possible to do with the GUI. This is one of them.
F> At this point, I choose to avoid having nfs on IB, so I edited F> again the /etc/exports file deleting the entry about IB and in F> qluman-qt I deleted the reference to IB in the Network priorities F> sub-menu of the Filesystem exports menu. Et voilà, after a new F> restart of the cluster, users are now able to login to the F> compute nodes landing in their home directory.
F> Your suggestions were essential to solve this issue, thank you F> again. If I can be of any help for a better analysis of the F> reported behavior, do not hesitate to contact me.
F> A side info: I get continuous disconnections from the qluman-qt F> interface, while all connections to the server console or to the F> nodes are perfectly stable.
Could be slurm related. This might be fixed with an upcoming release.
Best,
Roland