NFS boot scripts do not run
by stephanie.kerckhof@ec.europa.eu
Hello,
We are encountering problems with the NFS boot scripts. We already used this on another installation of Qlustar. It was working fine, but on our new install, it seems that the scripts are not executing at boot anymore. Here is the procedure we followed:
1. We created a script in /etc/qlustar/common/rc.boot
2. We made the script executable using chmod +x
If I look on the node, the script is visible in the NFS folder /etc/qlustar/common/rc.boot. If I execute it manually, it has the behaviour I'm expecting from it.
Do you have any idea of why the NFS scripts could have some problems in executing at boot time?
I was also wondering if there would be a way to start a service automatically at boot on a node?
Thanks already for your help.
4 months, 3 weeks
Compute nodes do not boot
by Alexey Yaremchuk
Dear experts,
I am trying to install a Qlustar OS on several Ethernet-linked
machines and still compute nodes do not boot. The installer reports
that there have been no errors, head and FE VM nodes do start, Qluman
sees them and I could not find any error messages from dmesg except
insignificant note that firmware load for regulatory.db failed. I
installed a Debian system on one of compute nodes and upon configuring
the network interface with a static address, established an ssh
connection both with the head and VM-FE nodes. At least this ensures
that cables are OK. If I configure it as dhcp, it does not discover
dhcp server. If I force BIOS boot from the network, I receive either
PXE-E51: No DHCP or proxyDHCP offers were received
from Intel Boot Agent, if I make BIOS use only Legacy OpROM, or
Media Present ...
Start PXE Boot over IPv4 ...
with a silent subsequent fall back into BIOS Setup, if I choose to use
UEFI driver. I tried different combinations of
Boot Device Control {UEFI_and_Legacy_OpROM|UEFI_only|Legacy_OpROM_only},
Boot from Network Devices {Legacy_OpROM_first|UEFI_driver_first},
OS TYPE {Other_OS|Windows_UEFI_mode} in BIOS Setup, and none of them
worked. Qluman confirms that Boot network is configured as dhcp,
and systemctl says that there are no failed services.
Since I am the only one on the forum with such a problem, probably it
can be fixed easily. However, I have no idea what to check and look
forward for help.
Thanks,
Alexei
5 months, 1 week
nfs fails on compute nodes
by marra@irc.cnr.it
Dear Qlustar experts,
after a successful installation of the latest version of Qlustar (ver 11.0.1), I am facing a problem with the setup of computing nodes: they do not mount the nfs directories for apps and home.
To me, everything looks correctly set up (checking with the qluman-qt application), Filesystem Exports and Network FS Mounts look correctly defined and assigned to the configs of the nodes.
The only signal that something in the configuration of nodes is maybe not correct is an empty /etc/fstab, but probably the mount mechanism is different from standard mounts.
What happens is that doing ssh to a compute node nor the apps nor the home directories are found. With the previous version 10.0 I did not have (on a different older cluster without IB) this issue.
Please, can you give me some hints about what I have to check to fix this issue?
Thank you in advance. Best regards,
Franco
5 months, 1 week
OpenMPI failure
by Kim Peterson
Qlustar
Kim here again; new Qlustar user, local university CS student. I have installed Qlustar on a Master node with 2 compute nodes attached, and I have installed the QluMan GUI Singularity container on a stand alone Front End computer. All seems to be working well. I get to Section 1.8 near the end of the First Steps Guide, and follow the instruction to log into the User account I have set up on the Head Node (Section 1.7), and run a test to verify the installation of MPI. I get no response to the instruction
mpicc.openmpi-gcc -o hello-world-c hello-world.c
I note that I get a response of: "no such instruction found" when I try:
mpirun --version
Anyone have an idea of what I might be doing wrong or what I might do to get OpenMPI working?
Kim
5 months, 2 weeks
install qluman-qt fails
by Kim Peterson
I have set up a small cluster with a Master and 2 compute nodes. The Master has public and private Ethernet interfaces. The public interface is the connection to the internet. The private interface is the connection with the compute nodes.
I installed Qlustar 11.0.1-0 on the Master, which completed with no issue. I had to restart the network services with
systemctl restart systemd-resolved.service
after the second re-boot during installation, and then proceeded with the First Steps checklist. I did not configure a Demo Cluster (Step 1.3) and I did not install additional software (Step 1.5). First Steps went fine up until step 1.6 "Running the Cluster Manager QluMan". Using the Head Node, I obtained a one-time token and saved it to /media/token (Step 1.6.1). When I attempted to install QluMan (Step 1.6.2), the review of packages that appeared on the monitor was normal looking, but the installation of each package failed. Failure indication was the same for each package to be installed; There was a notice that started with "Failed to fetch http://repo.qlustar.com/repo/ubuntu/pool/ ..." and ended with "... temporary failure resolving "repo.qlustar.com".
I did a little communication check of the public network:
ping www.google.com
produced "Temporary failure in name resolution"
ping 8.8.8.8
produced a steady stream of reports of "64 bytes from 8.8.8.8: icmp_seq=<n> tt1=116 time = 78.2 ms", with <n> incrementing each time.
I also note that when I power up the 2 compute nodes after the successful Qlustar installation on the Master, there is no booting going on, which I presume may be related to whatever has made the Master unable to resolve network ip addresses and names.
I am a CS student working a project that required a small cluster computer, so I have no particular skill with the technology here. I spent the better part of the day trying to figure out what I'm doing wrong, but I have failed. The issue is reproducible. I have done multiple installs from scratch and always get the same result. I note that the Qlustar OS installation and the early First Steps require access to the Qlustar.com repo, and that the "Final Reboot" at Step 1.2 appears to be when the Master node loses its ability to resolve network addresses and names and becomes unable to access the Qlustar.com repo. Do any of the kind listeners recognize my error or perhaps could recommend some sequence of steps I might take to isolate the problem?
Kim
6 months
Security/Bug fix releases 11.0.1.1-b519f1302 / 10.1.1.14-b521f1301
by Roland Fehrenbacher
The Qlustar releases 11.0.1.1-b519f1302 and 10.1.1.14-b521f1301 are ready
including a number of security and bug fixes. Please check the following
web pages for details about security fixes and special update instructions:
https://qlustar.com/news/qsa-0715201-linux-kernel-vulnerabilities
https://qlustar.com/news/qsa-0715202-security-update-bundle
The following non-security related enhancements/bug fixes are included:
Qlustar 11.0.1.1-b519f1302 updates to/adds
+ the latest Lustre LTS release 2.12.5
+ the latest Nvidia graphics drivers 440.100 supporting recently introduced
Nvidia hardware
+ Several Nvidia GPU metrics (temperature, load, fan) were added to Ganglia
+ Fixes
- QluMan
* Improve slurm job information updates on busy head-nodes
* Allow multiple networks for external hosts in Filesystem Mounts
* Fix exceptions when editing properties
- GPU detection (possibly missing /dev/nvidia-uvm on hardware with
many GPUs)
- NIS was failing in certain situations due to wrong /etc/hosts
6 months