Hello List,
according to https://slurm.schedmd.com/mpi_guide.html#intel_srun, srun should be able to lauch MPI jobs on its own. However, my test jobs fail with errors like these:
Fatal error in PMPI_Init_thread: Other MPI error, error stack: MPIR_Init_thread(805).....: fail failed MPID_Init(1832)...........: channel initialization failed ...
Prior to the above errors, I see: mdrun_mpi_AVX_256: /usr/lib/x86_64-linux-gnu/slurm/auth_munge.so: Incompatible Slurm plugin version (17.11.9) mdrun_mpi_AVX_256: /usr/lib/x86_64-linux-gnu/slurm/auth_munge.so: Incompatible Slurm plugin version (17.11.9) mdrun_mpi_AVX_256: error: Couldn't load specified plugin name for auth/munge: Incompatible plugin version mdrun_mpi_AVX_256: error: cannot create auth context for auth/munge
Where mdrun_mpi_AVX_256 is my executable. I've regenerated qlustar images and rebooted the nodes, but to no avail. I've also checked that md5sums for auth_munge.so and slurmstepd match the ones given in /usr/lib/qlustar/modules/xenial-amd64/10.1.1/slurm.contents, and now I'm out of ideas. I gues it would be nice to know what slurm thinks the correct version is, but unfortunately, only the offending version is printed...
Any ideas what I should try next?
Thanks,
A.
"A" == Ansgar Esztermann-Kirchner aeszter@mpibpc.mpg.de writes:
Hi Ansgar,
A> Hello List, according to A> https://slurm.schedmd.com/mpi_guide.html#intel_srun, srun should A> be able to lauch MPI jobs on its own. However, my test jobs fail A> with errors like these:
A> Fatal error in PMPI_Init_thread: Other MPI error, error stack: A> MPIR_Init_thread(805).....: fail failed A> MPID_Init(1832)...........: channel initialization failed ...
A> Prior to the above errors, I see: mdrun_mpi_AVX_256: A> /usr/lib/x86_64-linux-gnu/slurm/auth_munge.so: Incompatible Slurm A> plugin version (17.11.9) mdrun_mpi_AVX_256: A> /usr/lib/x86_64-linux-gnu/slurm/auth_munge.so: Incompatible Slurm A> plugin version (17.11.9) mdrun_mpi_AVX_256: error: Couldn't load A> specified plugin name for auth/munge: Incompatible plugin version A> mdrun_mpi_AVX_256: error: cannot create auth context for A> auth/munge
A> Where mdrun_mpi_AVX_256 is my executable. I've regenerated A> qlustar images and rebooted the nodes, but to no avail. I've also A> checked that md5sums for auth_munge.so and slurmstepd match the A> ones given in A> /usr/lib/qlustar/modules/xenial-amd64/10.1.1/slurm.contents, and A> now I'm out of ideas. I gues it would be nice to know what slurm A> thinks the correct version is, but unfortunately, only the A> offending version is printed...
what is the slurm package version mentioned in /usr/share/doc/qlustar-image/slurm.packages.version (on the booted node)? Should be 17.11.9.2-ql.3+10-xenial.
So I assume 17.11.9.2 should be the correct version. Maybe some mismatch with the chroot?
A> Any ideas what I should try next?
See above.
Best,
Roland
Hello Roland,
thanks for your reply.
what is the slurm package version mentioned in /usr/share/doc/qlustar-image/slurm.packages.version (on the booted node)? Should be 17.11.9.2-ql.3+10-xenial.
Yes, it is.
So I assume 17.11.9.2 should be the correct version. Maybe some mismatch with the chroot?
As far as I can tell from the source, slurm versions are three bytes only, so the patchlevel (".2") is left out.
I tried to run my code on the master, and it worked without the error. I also checked md5sums between node and master for auth_munge.so (and all other plugins in the same location), and for slurmd and slurmstepd, and they all match up. So something fishy is going on, but I don't know what yet.
I'll keep you posted.
A.
Hello List,
I've finally found the culprit: in addition to libslurm32 from the slurm module, there's a libslurm31 in the chroot. For some reason, this is used when srun starts MPI jobs, and this triggers the error message (libslurm31 is 17.02).
I didn't notice this until I re-compiled Slurm with a custom message. Plus, it's not clear to me why the outdated libslurm.so get used at all.
Anyway, case closed since I can start MPI jobs now.
A.
"A" == Ansgar Esztermann-Kirchner aeszter@mpibpc.mpg.de writes:
Hi Ansgar,
A> Hello List, I've finally found the culprit: in addition to A> libslurm32 from the slurm module, there's a libslurm31 in the A> chroot. For some reason, this is used when srun starts MPI jobs, A> and this triggers the error message (libslurm31 is 17.02).
strange indeed.
A> I didn't notice this until I re-compiled Slurm with a custom A> message. Plus, it's not clear to me why the outdated libslurm.so A> get used at all.
Shouldn't have been.
A> Anyway, case closed since I can start MPI jobs now.
Glad to head that and thanks for reporting.
Best,
Roland