"A" == Ansgar Esztermann-Kirchner aeszter@mpibpc.mpg.de writes:
Hi Ansgar,
A> Hello List, according to A> https://slurm.schedmd.com/mpi_guide.html#intel_srun, srun should A> be able to lauch MPI jobs on its own. However, my test jobs fail A> with errors like these:
A> Fatal error in PMPI_Init_thread: Other MPI error, error stack: A> MPIR_Init_thread(805).....: fail failed A> MPID_Init(1832)...........: channel initialization failed ...
A> Prior to the above errors, I see: mdrun_mpi_AVX_256: A> /usr/lib/x86_64-linux-gnu/slurm/auth_munge.so: Incompatible Slurm A> plugin version (17.11.9) mdrun_mpi_AVX_256: A> /usr/lib/x86_64-linux-gnu/slurm/auth_munge.so: Incompatible Slurm A> plugin version (17.11.9) mdrun_mpi_AVX_256: error: Couldn't load A> specified plugin name for auth/munge: Incompatible plugin version A> mdrun_mpi_AVX_256: error: cannot create auth context for A> auth/munge
A> Where mdrun_mpi_AVX_256 is my executable. I've regenerated A> qlustar images and rebooted the nodes, but to no avail. I've also A> checked that md5sums for auth_munge.so and slurmstepd match the A> ones given in A> /usr/lib/qlustar/modules/xenial-amd64/10.1.1/slurm.contents, and A> now I'm out of ideas. I gues it would be nice to know what slurm A> thinks the correct version is, but unfortunately, only the A> offending version is printed...
what is the slurm package version mentioned in /usr/share/doc/qlustar-image/slurm.packages.version (on the booted node)? Should be 17.11.9.2-ql.3+10-xenial.
So I assume 17.11.9.2 should be the correct version. Maybe some mismatch with the chroot?
A> Any ideas what I should try next?
See above.
Best,
Roland