Roland,
Thank you very much for help and instructions. However, I could not trap the problem's origin and the problem itself is reproducible -- an independent installation on different hardware exhibits similar features. # tcpdump -vvne -i br-int shows that booting node sends
13:03:31.877187 2c:fd:a1:c7:1c:72 > ff:ff:ff:ff:ff:ff, ethertype IPv4 (0x0800), length 389: (tos 0x0, ttl 64, id 1512, offset 0, flags [none], proto UDP (17), length 375) 0.0.0.0.68 > 255.255.255.255.67: [udp sum ok] BOOTP/DHCP, Request from 2c:fd:a1:c7:1c:72, length 347, xid 0xacb7dc7, Flags [Broadcast] (0x8000) Client-Ethernet-Address 2c:fd:a1:c7:1c:72 Vendor-rfc1048 Extensions Magic Cookie 0x63825363 DHCP-Message Option 53, length 1: Discover MSZ Option 57, length 2: 1464 Parameter-Request Option 55, length 35: Subnet-Mask, Time-Zone, Default-Gateway, Time-Server IEN-Name-Server, Domain-Name-Server, Hostname, BS Domain-Name, RP, EP, RSZ TTL, BR, YD, YS NTP, Vendor-Option, Requested-IP, Lease-Time Server-ID, RN, RB, Vendor-Class TFTP, BF, GUID, Option 128 Option 129, Option 130, Option 131, Option 132 Option 133, Option 134, Option 135 GUID Option 97, length 17: 0.96.45.126.76.135.44.200.244.97.208.44.253.161.199.28.114 NDI Option 94, length 3: 1.3.16 ARCH Option 93, length 2: 7 Vendor-Class Option 60, length 32: "PXEClient:Arch:00007:UNDI:003016"
and gets no response. Also there is a permanently knocking client with unknown MAC sending
13:03:30.633239 00:fd:45:86:bf:2c > ff:ff:ff:ff:ff:ff, ethertype IPv4 (0x0800), length 303: (tos 0x10, ttl 64, id 0, offset 0, flags [none], proto UDP (17), length 289) 0.0.0.0.68 > 255.255.255.255.67: [udp sum ok] BOOTP/DHCP, Request from 00:fd:45:86:bf:2c, length 261, xid 0xc09adca9, secs 121, Flags [Broadcast] (0x8000) Client-Ethernet-Address 00:fd:45:86:bf:2c Vendor-rfc1048 Extensions Magic Cookie 0x63825363 DHCP-Message Option 53, length 1: Discover MSZ Option 57, length 2: 576 Parameter-Request Option 55, length 11: Option 125, Subnet-Mask, TFTP, LOG NTP, Hostname, Domain-Name, Domain-Name-Server Default-Gateway, BF, TFTP-Server-Address
Accordingly, /var/log/syslog is filled with messages like
Aug 7 06:25:23 cl-head qluman-dhcpscanner[1044]: - Reporting MAC 00:fd:45:86:bf:2c (timestamp: 07 Aug 20, 06:25) Aug 7 06:25:28 cl-head dnsmasq-dhcp[1073]: DHCPDISCOVER(br-int) 00:fd:45:86:bf:2c no address available
Also the head and VM-FE nodes exchange with something annoying like
192.168.52.253.54304 > 192.168.52.254.6001: Flags [P.], cksum 0xeb97 (incorrect -> 0x45fe), seq 37:74, ack 38, win 501, options [nop,nop,TS val 464898062 ecr 3557167855], length 37
and send reports like
192.168.52.254.47883 > 239.2.11.71.8649: [bad udp cksum 0xf039 -> 0x812d!] UDP, length 48
In addition, # systemctl list-units |grep -i fail shows that
jobmonarch-jobmond.service loaded failed failed Ganglia Job Monitoring Daemon slurmctld.service loaded failed failed Slurm controller daemon
but Qluman-qt, as far as I see, is green everywhere and in Manage_Cluster/Networks shows that Boot network is configured as dhcp. Searching for a difference between boot and external (which is static in my case) networks, I found that /etc/dnsmasq.d/00-header.conf contains
# Definition for Network 'Boot' dhcp-range=set:Boot,192.168.52.0,static
and it seems to me that /etc/dnsmasq.conf says that this specifies a subnet which can't be used for dynamic address allocation. I did not try to change it and look forward for help.
Alexey