Hello List,
according to https://slurm.schedmd.com/mpi_guide.html#intel_srun, srun should be able to lauch MPI jobs on its own. However, my test jobs fail with errors like these:
Fatal error in PMPI_Init_thread: Other MPI error, error stack: MPIR_Init_thread(805).....: fail failed MPID_Init(1832)...........: channel initialization failed ...
Prior to the above errors, I see: mdrun_mpi_AVX_256: /usr/lib/x86_64-linux-gnu/slurm/auth_munge.so: Incompatible Slurm plugin version (17.11.9) mdrun_mpi_AVX_256: /usr/lib/x86_64-linux-gnu/slurm/auth_munge.so: Incompatible Slurm plugin version (17.11.9) mdrun_mpi_AVX_256: error: Couldn't load specified plugin name for auth/munge: Incompatible plugin version mdrun_mpi_AVX_256: error: cannot create auth context for auth/munge
Where mdrun_mpi_AVX_256 is my executable. I've regenerated qlustar images and rebooted the nodes, but to no avail. I've also checked that md5sums for auth_munge.so and slurmstepd match the ones given in /usr/lib/qlustar/modules/xenial-amd64/10.1.1/slurm.contents, and now I'm out of ideas. I gues it would be nice to know what slurm thinks the correct version is, but unfortunately, only the offending version is printed...
Any ideas what I should try next?
Thanks,
A.