Dear experts,
I am trying to install a Qlustar OS on several Ethernet-linked machines and still compute nodes do not boot. The installer reports that there have been no errors, head and FE VM nodes do start, Qluman sees them and I could not find any error messages from dmesg except insignificant note that firmware load for regulatory.db failed. I installed a Debian system on one of compute nodes and upon configuring the network interface with a static address, established an ssh connection both with the head and VM-FE nodes. At least this ensures that cables are OK. If I configure it as dhcp, it does not discover dhcp server. If I force BIOS boot from the network, I receive either PXE-E51: No DHCP or proxyDHCP offers were received from Intel Boot Agent, if I make BIOS use only Legacy OpROM, or Media Present ... Start PXE Boot over IPv4 ... with a silent subsequent fall back into BIOS Setup, if I choose to use UEFI driver. I tried different combinations of Boot Device Control {UEFI_and_Legacy_OpROM|UEFI_only|Legacy_OpROM_only}, Boot from Network Devices {Legacy_OpROM_first|UEFI_driver_first}, OS TYPE {Other_OS|Windows_UEFI_mode} in BIOS Setup, and none of them worked. Qluman confirms that Boot network is configured as dhcp, and systemctl says that there are no failed services.
Since I am the only one on the forum with such a problem, probably it can be fixed easily. However, I have no idea what to check and look forward for help.
Thanks, Alexei
"A" == Alexey Yaremchuk a.yaremch@yandex.ru writes:
Hi Alexey,
A> Dear experts, I am trying to install a Qlustar OS on several A> Ethernet-linked machines and still compute nodes do not boot. The A> installer reports that there have been no errors, head and FE VM A> nodes do start, Qluman sees them and I could not find any error A> messages from dmesg except insignificant note that firmware load A> for regulatory.db failed. I installed a Debian system on one of A> compute nodes and upon configuring the network interface with a A> static address, established an ssh connection both with the head A> and VM-FE nodes. At least this ensures that cables are OK. If I A> configure it as dhcp, it does not discover dhcp server. If I A> force BIOS boot from the network, I receive either A> PXE-E51: No DHCP or proxyDHCP offers were received A> from Intel Boot Agent, if I make BIOS use only Legacy OpROM, or A> Media Present ... Start PXE Boot over IPv4 ... A> with a silent subsequent fall back into BIOS Setup, if I choose A> to use UEFI driver. I tried different combinations of Boot Device A> Control {UEFI_and_Legacy_OpROM|UEFI_only|Legacy_OpROM_only}, Boot A> from Network Devices {Legacy_OpROM_first|UEFI_driver_first}, OS A> TYPE {Other_OS|Windows_UEFI_mode} in BIOS Setup, and none of them A> worked. Qluman confirms that Boot network is configured as dhcp, A> and systemctl says that there are no failed services.
A> Since I am the only one on the forum with such a problem, A> probably it can be fixed easily. However, I have no idea what to A> check and look forward for help.
you can check in /var/log/syslog on the head whether DHCP and TFTP request arrive. If they don't, your UEFI BIOS might be broken. It's also possible that your legacy BIOS mode is broken and can't handle grub as bootloader which we use in our latest versions. You can try legacy PXElinux support by installing the qlustar-netboot-compat package as described in the release notes:
https://docs.qlustar.com/en-US/Qlustar_Cluster_OS/11.0/html-single/Release_N...
If you do so, you will have to restart qlumand (service qluman-server restart) and write the dnsmasq config.
If all this doesn't work, you will have to check low-level with tcpdump to see what the node sends out to the head when trying to boot.
Best,
Roland
Roland,
Thank you very much for help and instructions. However, I could not trap the problem's origin and the problem itself is reproducible -- an independent installation on different hardware exhibits similar features. # tcpdump -vvne -i br-int shows that booting node sends
13:03:31.877187 2c:fd:a1:c7:1c:72 > ff:ff:ff:ff:ff:ff, ethertype IPv4 (0x0800), length 389: (tos 0x0, ttl 64, id 1512, offset 0, flags [none], proto UDP (17), length 375) 0.0.0.0.68 > 255.255.255.255.67: [udp sum ok] BOOTP/DHCP, Request from 2c:fd:a1:c7:1c:72, length 347, xid 0xacb7dc7, Flags [Broadcast] (0x8000) Client-Ethernet-Address 2c:fd:a1:c7:1c:72 Vendor-rfc1048 Extensions Magic Cookie 0x63825363 DHCP-Message Option 53, length 1: Discover MSZ Option 57, length 2: 1464 Parameter-Request Option 55, length 35: Subnet-Mask, Time-Zone, Default-Gateway, Time-Server IEN-Name-Server, Domain-Name-Server, Hostname, BS Domain-Name, RP, EP, RSZ TTL, BR, YD, YS NTP, Vendor-Option, Requested-IP, Lease-Time Server-ID, RN, RB, Vendor-Class TFTP, BF, GUID, Option 128 Option 129, Option 130, Option 131, Option 132 Option 133, Option 134, Option 135 GUID Option 97, length 17: 0.96.45.126.76.135.44.200.244.97.208.44.253.161.199.28.114 NDI Option 94, length 3: 1.3.16 ARCH Option 93, length 2: 7 Vendor-Class Option 60, length 32: "PXEClient:Arch:00007:UNDI:003016"
and gets no response. Also there is a permanently knocking client with unknown MAC sending
13:03:30.633239 00:fd:45:86:bf:2c > ff:ff:ff:ff:ff:ff, ethertype IPv4 (0x0800), length 303: (tos 0x10, ttl 64, id 0, offset 0, flags [none], proto UDP (17), length 289) 0.0.0.0.68 > 255.255.255.255.67: [udp sum ok] BOOTP/DHCP, Request from 00:fd:45:86:bf:2c, length 261, xid 0xc09adca9, secs 121, Flags [Broadcast] (0x8000) Client-Ethernet-Address 00:fd:45:86:bf:2c Vendor-rfc1048 Extensions Magic Cookie 0x63825363 DHCP-Message Option 53, length 1: Discover MSZ Option 57, length 2: 576 Parameter-Request Option 55, length 11: Option 125, Subnet-Mask, TFTP, LOG NTP, Hostname, Domain-Name, Domain-Name-Server Default-Gateway, BF, TFTP-Server-Address
Accordingly, /var/log/syslog is filled with messages like
Aug 7 06:25:23 cl-head qluman-dhcpscanner[1044]: - Reporting MAC 00:fd:45:86:bf:2c (timestamp: 07 Aug 20, 06:25) Aug 7 06:25:28 cl-head dnsmasq-dhcp[1073]: DHCPDISCOVER(br-int) 00:fd:45:86:bf:2c no address available
Also the head and VM-FE nodes exchange with something annoying like
192.168.52.253.54304 > 192.168.52.254.6001: Flags [P.], cksum 0xeb97 (incorrect -> 0x45fe), seq 37:74, ack 38, win 501, options [nop,nop,TS val 464898062 ecr 3557167855], length 37
and send reports like
192.168.52.254.47883 > 239.2.11.71.8649: [bad udp cksum 0xf039 -> 0x812d!] UDP, length 48
In addition, # systemctl list-units |grep -i fail shows that
jobmonarch-jobmond.service loaded failed failed Ganglia Job Monitoring Daemon slurmctld.service loaded failed failed Slurm controller daemon
but Qluman-qt, as far as I see, is green everywhere and in Manage_Cluster/Networks shows that Boot network is configured as dhcp. Searching for a difference between boot and external (which is static in my case) networks, I found that /etc/dnsmasq.d/00-header.conf contains
# Definition for Network 'Boot' dhcp-range=set:Boot,192.168.52.0,static
and it seems to me that /etc/dnsmasq.conf says that this specifies a subnet which can't be used for dynamic address allocation. I did not try to change it and look forward for help.
Alexey
"A" == Alexey Yaremchuk a.yaremch@yandex.ru writes:
Hi Alexey,
A> Roland, Thank you very much for help and instructions. However, I A> could not trap the problem's origin and the problem itself is A> reproducible -- an independent installation on different hardware A> exhibits similar features. A> # tcpdump -vvne -i br-int A> shows that booting node sends ....
sorry, but I don't have the resources to debug networking problems.
A> In addition, A> # systemctl list-units |grep -i fail A> shows that
>> jobmonarch-jobmond.service loaded failed failed Ganglia Job >> Monitoring Daemon >> slurmctld.service loaded failed failed Slurm controller daemon
This means that Slurm is not properly configured.
A> but Qluman-qt, as far as I see, is green everywhere and in A> Manage_Cluster/Networks shows that Boot network is configured as A> dhcp. Searching for a difference between boot and external A> (which is static in my case) networks, I found that A> /etc/dnsmasq.d/00-header.conf contains
>> # Definition for Network 'Boot' >> dhcp-range=set:Boot,192.168.52.0,static
A> and it seems to me that /etc/dnsmasq.conf says that this A> specifies a subnet which can't be used for dynamic address A> allocation. I did not try to change it and look forward for help.
Yes, QluMan does not configure dynamic DHCP. You need to know which host has which IP etc ...
Best,
Roland