cannot start MPI jobs
by Rolandas
Hi,
In fresh qlustar installation without infiniband we cannot start MPI
jobs. Error message is long, but essential part is:
[beo-01:19012] pmix_mca_base_component_repository_open: "mca_bfrops_v21"
does not appear to be a valid bfrops MCA dynamic component (ignored):
/usr/lib/openmpi/3.1.2/gcc/lib/openmpi/mca_pmix_ext2x.so: undefined
symbol: mca_bfrops_v21_component. ret -1
Qlustar image for nodes have modules: core, slurm, ofed, lustre-client.
UnionFS Chroot (xenial/10.1) is freshly installed without additional
packages installed (dist-upgrade was executed).
Regards,
Rolandas
1 year, 7 months
Slurm srun & MPI
by Ansgar Esztermann-Kirchner
Hello List,
according to https://slurm.schedmd.com/mpi_guide.html#intel_srun, srun
should be able to lauch MPI jobs on its own. However, my test jobs
fail with errors like these:
Fatal error in PMPI_Init_thread: Other MPI error, error stack:
MPIR_Init_thread(805).....: fail failed
MPID_Init(1832)...........: channel initialization failed
...
Prior to the above errors, I see:
mdrun_mpi_AVX_256: /usr/lib/x86_64-linux-gnu/slurm/auth_munge.so:
Incompatible Slurm plugin version (17.11.9)
mdrun_mpi_AVX_256: /usr/lib/x86_64-linux-gnu/slurm/auth_munge.so:
Incompatible Slurm plugin version (17.11.9)
mdrun_mpi_AVX_256: error: Couldn't load specified plugin name for
auth/munge: Incompatible plugin version
mdrun_mpi_AVX_256: error: cannot create auth context for auth/munge
Where mdrun_mpi_AVX_256 is my executable. I've regenerated qlustar
images and rebooted the nodes, but to no avail. I've also checked that
md5sums for auth_munge.so and slurmstepd match the ones given in
/usr/lib/qlustar/modules/xenial-amd64/10.1.1/slurm.contents, and now
I'm out of ideas.
I gues it would be nice to know what slurm thinks the correct version
is, but unfortunately, only the offending version is printed...
Any ideas what I should try next?
Thanks,
A.
--
Ansgar Esztermann
Sysadmin Dep. Theoretical and Computational Biophysics
http://www.mpibpc.mpg.de/grubmueller/esztermann
1 year, 7 months
Re: multipath in qlustar storage nodes
by Roland Fehrenbacher
>>>>> "R" == Rolandas <rolnas(a)gmail.com> writes:
R> So we need to use /etc/qlustar/common/rc.boot and
R> /etc/qlustar/common/image-files to configure nodes ? But I have
R> already problem with rc.boot scripts, they run too late (for
R> e.g. ZFS pool import for lustre I have to run hostid >
R> /etc/hostid) or copy-files is run in parallel to lustre startup.
You can apply the patch below to the generated image (use
'qlustar-image-edit -s <image name>' and type 'exit' when done). This
will make spl use a hostid that doesn't change if the hardware of the
machine doesn't change. The patch will be part of future releases.
-----------------------------------------------------------------------------
diff --git a/sbin/init.qlustar b/sbin/init.qlustar
--- a/core/image/common/sbin/init.qlustar
+++ b/core/image/common/sbin/init.qlustar
@@ -91,8 +91,13 @@ echo FIXME
echo -e "### Starting Qlustar pre-systemd boot script /sbin/init.qlustar ###"
-# generate the systemd machine id here
+# generate the systemd machine id here ...
systemd-machine-id-setup
+# ... and also the hostid ...
+hostid > /etc/hostid
+# ... and generate a unique spl_host_id from it
+echo "options spl spl_hostid=$(( 16#$(cat /etc/hostid) ))" >> \
+ /etc/modprobe.d/spl.conf
-----------------------------------------------------------------------------
R> But qlustar defines only one MAC address per host, but in the
R> case if first link is down, PXE booting will run on second link
R> with different MAC address.
Adapter redundancy upon boot is unfortunately not supported.
1 year, 7 months
Trouble installing
by Ian Kaufman
Hi all,
I have a server that already has Ubuntu 16.04 installed on a RAID 1 device
(two SSDs, software RAID). When I try to run the latest installer (Qlustar
10.1.1-3) from a live dvd, it fails to boot with the following error:
No supported filesystem images found at /live
Looking in the boot.log, I see
error reading /lib/udev/hwdb.bin: No such file or directory
Do think that is a result of the drives being md devices? The initramfs
kernel sees /dev/md* and /dev/md12[67]
Thanks,
Ian
--
Ian Kaufman
Research Systems Administrator
UC San Diego, Jacobs School of Engineering ikaufman AT ucsd DOT edu
1 year, 8 months
bug in slurm-config
by Tobias Moehle
Dear all,
I ran into two problems with the configuration of slurm:
a) Among others, I have the nodes 'compute-0-0' and 'compute-1-0'; if I
put them into a common partition, qluman writes 'compute-[0-1]-0'; but
this format seems not readable for slurm.
I am not sure if this is a bug of slurm or of qluman; but it
requires manual reconfiguration.
b) When I try to assign some new properties to the 'Node Groups', the
fields 'Delete' and 'New Node Property' are grayed out; I can only add
values to existing properties.
Can you probably tell me, what is going on? Is it maybe an issue
with my configuration somewhere?
Many thanks in advance,
Tobias Moehle
1 year, 8 months
Re: setup infiniband fails
by Tobias Moehle
Hello Roland,
great; after the upgrade ib works perfectly.
Thank you very much!
Best regards
> On 11.05.19 09:42, Roland Fehrenbacher wrote:
>>>>>>> "T" == Tobias Moehle <tobias.moehle(a)uni-rostock.de> writes:
>> Hi Tobias,
>>
>> please upgrade to the just released 10.1.1.4-b509f1240. Support for the
>> IBA7322 was missing and has been readded.
>>
>> T> Dear all, On our cluster, we have some nodes with infiniband
>> T> which I am trying to set up at the moment.
>>
>> T> After some time of struggeling, I finally understand:
>> T> 1) The hardware is found:
>>
>> T> #lspci | grep fini 04:00.0 InfiniBand: QLogic
>> T> Corp. IBA7322 QDR InfiniBand HCA (rev 02)
>> T> 2) The kernel-modules(?) seem to be loaded: ..................
>>
>> T> Also, "the pre-defined hardware property IB Adapter with a value
>> T> of true must be assigned to a host, to explicitly enable IB for
>> T> it"
>>
>> This is not necessary anymore. Needs to be fixed in the docs.
>>
>> T> But in my hardware properties no "IB Adapter" is available (only
>> T> #CPU cores, #CPU sockets, HW Type, Size of RAM and Chassis
>> T> Color). I expect this to be the issue; but what is wrong there?
>>
>> This HW property is obsolete. Just follow
>>
>> https://docs.qlustar.com/en-US/Qlustar_Cluster_OS/10.1/html-single/QluMan...
>>
>>
>> to configure the IB network.
>>
>> Best,
>>
>> Roland
>> _______________________________________________
>> Qlustar-General mailing list -- qlustar-general(a)qlustar.org
>> To unsubscribe send an email to qlustar-general-leave(a)qlustar.org
1 year, 8 months
setup infiniband fails
by Tobias Moehle
Dear all,
On our cluster, we have some nodes with infiniband which I am trying to
set up at the moment.
After some time of struggeling, I finally understand:
1) The hardware is found:
#lspci | grep fini
04:00.0 InfiniBand: QLogic Corp. IBA7322 QDR InfiniBand HCA (rev 02)
2) The kernel-modules(?) seem to be loaded:
#lsmod | grep ib_
ib_ipoib 122880 0
ib_cm 53248 2 rdma_cm,ib_ipoib
ib_uverbs 90112 1 rdma_ucm
ib_umad 28672 0
ib_core 217088 7
rdma_cm,ib_ipoib,iw_cm,ib_umad,rdma_ucm,ib_uverbs,ib_cm
ipv6 405504 59 rdma_cm,ib_ipoib,ib_core
#lsmod | grep verbs
ib_uverbs 90112 1 rdma_ucm
ib_core 217088 7
rdma_cm,ib_ipoib,iw_cm,ib_umad,rdma_ucm,ib_uverbs,ib_cm
3) but 'ibv_devices' gives an empty list, as well as ibstat etc.
maybe helpful:
#ibnodes
ibwarn: [3220] mad_rpc_open_port: can't open UMAD port ((null):0)
src/ibnetdisc.c:784; can't open MAD port ((null):0)
/usr/sbin/ibnetdiscover: iberror: failed: discover failed
ibwarn: [3225] mad_rpc_open_port: can't open UMAD port ((null):0)
src/ibnetdisc.c:784; can't open MAD port ((null):0)
/usr/sbin/ibnetdiscover: iberror: failed: discover failed
From the qlustar-documentation (
https://qlustar.com/book/docs/qluman-guide#Configuring-IB) I know that I
need to setup nodes that run OpenSM (Set Generic Property->OpenSM
Ports->ALL); this I have, the service is running.
Also, "the pre-defined hardware property IB Adapter with a value of true
must be assigned to a host, to explicitly enable IB for it" But in my
hardware properties no "IB Adapter" is available (only #CPU cores, #CPU
sockets, HW Type, Size of RAM and Chassis Color).
I expect this to be the issue; but what is wrong there?
Any help is highly appreciated.
Best regards,
Tobias Moehle
1 year, 8 months
qlustar-initial-config fails on qluman db bootstrap
by rolnas@gmail.com
Hi,
In ~50% cases qlustar-initial-config failing in qluman bootstrap step
with duplicate table errors, like "Table 'Nets' already exists".
I'm using qlustar-installer-10.1.1-2 version.
Regards,
Rolandas
1 year, 8 months
Missing swap after qlustar installation
by rolnas@gmail.com
Hi,
I just installed from qlustar-installer-10.1.1-2.iso and installed head
node is missing swap. By looking into FaiConfig.py inside of installer
it doesn't define swap for FAI to create. swapSize is defined and
calculated, but not used anywhere.
Best regards,
Rolandas
1 year, 8 months