diff -u /usr/sbin/ql-adjust-oomscore ql-adjust-oomscore --- /usr/sbin/ql-adjust-oomscore 2025-04-15 11:41:22.000000000 +0300 +++ ql-adjust-oomscore 2025-05-28 15:52:27.993037862 +0300 @@ -32,6 +32,7 @@ systemd Xorg sssd rsyslogd dbus-daemon lightdm mysqld \ qluman-execd.py qlumand.py qluman-dhcpscanner.py qluman-router.py \ slurmctld slurmd" +USERS="root,_rpc,statd,messagebus,systemd-timesync,ganglia" DEFAULTS=/etc/default/oomscore
function set_oom() { @@ -42,7 +43,7 @@ SCORE="$COMPAT_SCORE" fi for proc in $procs; do - pids=$(pgrep -f $proc) + pids=$(pgrep -u $USERS -f $proc) for i in $pids; do oom_adj=/proc/$i/${target_file} if [ -f $oom_adj ]; then
On 2025-05-28 14:15, rolnas@gmail.com wrote:
Hi,
Finally found a source of problem. In Qlustar 13 distribution there
is a periodic (hourly) script to make some processes unkillable for OOM, but pattern to choose processes by they command line is too wide and process 182580 were matching it (because of slurmd in name).
Regards Rolandas
P.S. Script ql-adjust-oomscore.
On 2025-05-28 11:29, rolnas@gmail.com wrote:
Hi,
We are using SLURM 24.11.5 and maintaining HPC cluster. Several times we had nodes stuck after user jobs. Investigation lead to OOM situations. Deeper investigation found in some strange situations users jobs
oom_score_adj is left at -1000 as slurmstepd process.
For e.g. (some information is changed for privacy): root 182565 0.0 0.0 947952 30128 ? SLl May27 0:04
slurmstepd: [177126.batch]
hpc_user 182580 0.0 0.0 1845524 70696 ? Sl May27 0:04
_ / scratch/lustre/home/hpc_user/miniforge-pypy3/envs/snakemake/bin/ python3.12 /var/lib/slurm-llnl/slurmd/job177126/slurm_script -p all -- use-conda --rerun-incomplete
hpc_user 241662 99.9 7.2 30762668 28598376 ? R May27 1418:09
_ ClonalFrameML ...
cat /proc/{182565,182580,241662}/oom_score_adj -1000 -1000 -1000
I'm trying to replicate that situation and no luck. Every time I'm
getting users jobs with oom_score_adj=0 what it should be.
Regards Rolandas
P.S. Command to launch this job was: sbatch -A alloc_2a21c_test -c 12 --time=100:00:00 snakemake -p all
-- use-conda --rerun-incomplete