Hello List,
recently, I've moved /apps to its own device. This invalidates NFS handles on the nodes, of course, so I started to reboot them. To my surprise, they don't come up again. The nodes complain about a time-out, "Failed to request QluMan node config in time", and ask me to check qlumand and qluman-route on the head.
These two processes are indeed running. I've checked the logs, but couldn't fine anything helpful (to me) in there.
qlumand seems to see the node briefly: 2019-03-05 11:45:06,615 [29219] INFO server.admin - Identifying node from '00-25-90-d9-08-86' 2019-03-05 11:45:06,617 [29219] INFO server.admin - Registering Execd 'node31-35' 2019-03-05 11:45:39,645 [29219] INFO server.admin - Execd '00-25-90-d9-08-86' disconnected
I've attached the router log.
There's one thing that's conspicuous, but it doesn't seem to be correlated with the nodes booting: a stack trace when accessing the database.
I'll be grateful for any pointers.
Thanks,
A.
Hello Ansgar Esztermann,
I'm sorry to hear about the boot failure. Unfortunately the logs you provided don't shed any light on the cause. But let me explain what I see in the logs:
router.log ========
The "Zap request" is ZeroMQs way of authenticating peers. So this showing up at all means the node can talk to the router. Next the "Asking master" shows the router forwarding the authentication request to the qlumand and "Authenticated peer" shows the qlumand has responded. The peer was correctly identified and authenticated as an execd. So far everything is going right.
Even the last line, that the exec has expired might or might not be normal. During boot the node requests the configuration for the node and then the qluman-execd process restarts with the new configuration. From the router log it's impossible to say if the disconnect is due to a failure to get the configuration or the planned restart. From what I can tell the router did it's job fine, it routed all the traffic correctly.
That leaves 2 other components in the mix:
1) The qlumand running on the headnode.
Please have another look in the /var/log/qluman/qlumand.log and try to capture everything between the time you turn on a node and the node failing with the message that it failed to request the node config.
2) The qluman-execd running on the node itself.
If you can please include the execd.log. Depending on the qlustar version the log should be either in /root/log or /run/log. Alternatively you can run execd manualy from the shell you get after the timeout using:
timeout -s TERM 120 /usr/sbin/qluman-execd --write-conf 2>&1 | tee execd.log
trace.ksh ========
"Lost connection to MySQL server during query" means exactly that. The qluman-server lost the connection to the database where all the configurations are stored. But you are correct that this isn't correlated to the nodes booting. "scan_apt_database" is invoked on updates to detect new versions of qlustar image modules. If at the same time the mysql package is updated then the mysql server is shut down for restart causing the connection to be dropped.
So this exception is not totaly unexpected. You can check if mysqld is running but if the error does not repeat itself then it is nothing to worry about. The qlumand automatically reconnects to mysqld when it has restarted.
I hope this brings us closer to finding the problem.
Yours, Goswin von Brederlow
Hi Goswin,
I'm sorry to hear about the boot failure. Unfortunately the logs you provided don't shed any light on the cause. But let me explain what I see in the logs:
Thank you for your explanation. I have now taken a closer look at the execd.log, and the nodes are booting again. Here's roughly what has happened (although a few details remain unclear to me): -I switched on the "Initialize IMPI" property in the generic property set I use for compute nodes; -upon reboot, the nodes couldn't find the command that's needed to configure the IMPI interface (I don't remember the name, and the logs are probably gone; but the wrapper script called by init.qlustar complained that "Freeimpi doesn't work"; -I did in fact notice that error at the time, but a) thought it is probably not fatal, and b) set "Initialize IMPI" to false just in case; -when things still did not work, I looked at the head node for problems, and wrote to the list when I couldn't find any; -following your suggenstions, I looked at the execd log again, regenerated the image, and voilà: the nodes boot again; -I have now set "Initialize IMPI" to true again, and the nodes boot and do configure their IMPI cards.
Two observations remain: -There seems to be a typo in init.qlustar. I've attached a patch in case it's useful. -The head node does not have a network interface in the (in-band) IPMI net. Do I configure this manually, or is this supposed to happen automatically?
Thanks a lot,
A.
"A" == Ansgar Esztermann-Kirchner aeszter@mpibpc.mpg.de writes:
Hi Ansgar,
A> -following your suggenstions, I looked at the execd log again, A> regenerated the image, and voilà: the nodes boot again; A> -I have now set "Initialize IMPI" to true again, and the nodes A> boot and do configure their IMPI cards.
glad your nodes boot again and have the IPMI configured.
A> Two observations remain: -There seems to be a typo in A> init.qlustar. I've attached a patch in case it's useful.
Seems to be missing.
A> -The head node does not have a network interface in the (in-band) A> IPMI net. Do I configure this manually, or is this supposed to happen A> automatically?
This needs to be done manually at this stage.
Best,
Roland
Hi,
A> Two observations remain: -There seems to be a typo in A> init.qlustar. I've attached a patch in case it's useful.
Seems to be missing.
Indeed.
A> -The head node does not have a network interface in the (in-band) A> IPMI net. Do I configure this manually, or is this supposed to happen A> automatically?
This needs to be done manually at this stage.
OK, fine by me.
A.
"A" == Ansgar Esztermann-Kirchner aeszter@mpibpc.mpg.de writes:
A> Hi, Two observations remain: -There seems to be a typo in A> init.qlustar. I've attached a patch in case it's useful. >> >> Seems to be missing.
A> Indeed.
Thanks for the patch, we had already noticed and fixed it recently when overhauling the multicast process ...