Dear Qlustar Users,
I'm running what I think may be the world's worst supercomputer (for fun), composed entirely of different spec laptops. This was a pandemic project gone amok, just to see what is possible. I've used the opportunity to figure out how to do some strange things, one of which includes replacing nodes quickly and frequently on a Qlustar-based system, as laptops fail fairly often when used as HPCs.
The Problem: When deleting a host from Qluman (13.6.0, and many previous versions), and trying to assign a new physical machine to the exact same hostname (an attempt at drop-in replacement), it never acquired a DHCP license. Everything looked good from Qluman's side, but the new node would never get a DHCP license to start loading/booting Qlustar. It would, however, after restarting the head node. I'd rather not do that after every drop-in replacement.
The Solution: I found that each host has an entry in /var/lib/misc/dnsmasq.leases which blocks Qluman and such from truly assigning it as a new hosts. If instead I first delete that entry from dnsmasq.leases, restart the dnsmasq service by typing 'service dnsmasq restart,' and also rewrite the Slurm Config file (causing the slurmctl service to also restart), it works!
The Request: Could the Qlustar mods please consider a line in the "Delete Hosts" menu option in the Enclosure Vire of Qluman's scripts, which greps dnsmasq.leases for any line with that host name and deletes it, and restarts the dnsmasq service? I think that would do the trick, but I may be mistaken. Happy to discuss more if you'd all like!
Sincerely, -Mike